UNL R Workshops
  • Home
  • Introduction to R
  • Graphics with ggplot2
  • Data Wrangling
  • Modeling

On this page

  • Reading files
    • 2. The NHANES (National Health and Nutrition Survey) publishes data in the SAS xport format:
  • Summarizing with dplyr
  • Tidy Data
  • Joining Data

Your Turn Solutions

Reading files

1. Have a look at the parameters of read_csv to solve the following problems:

Read the first two lines of the file into an object called midwest_names

Read everything EXCEPT the first two lines into an object called midwest_data

Solution

#Would use your file path 

library(tidyverse)
midwest_names <- read_csv(
  "http://unl-statistics.github.io/R-workshops/r-format/data/midwest.csv", n_max= 2, 
  col_names = FALSE)
midwest_data <- read_csv(
  "http://unl-statistics.github.io/R-workshops/r-format/data/midwest.csv", 
  skip = 2, col_names = FALSE)
head(midwest_data)
# A tibble: 6 × 11
  X1       X2       X3 X4        X5 X6        X7 X8        X9 X10      X11
  <chr>    <chr> <dbl> <chr>  <dbl> <chr>  <dbl> <chr>  <dbl> <chr>  <dbl>
1 1994-Nov <NA>  NA    <NA>   NA    <NA>   NA    28-Nov  1.12 <NA>   NA   
2 1994-Dec 5-Dec  1.09 12-Dec  1.06 19-Dec  1.04 26-Dec  1.03 <NA>   NA   
3 1995-Jan 2-Jan  1.02 9-Jan   1.05 16-Jan  1.03 23-Jan  1.05 30-Jan  1.06
4 1995-Feb 6-Feb  1.04 13-Feb  1.04 20-Feb  1.03 27-Feb  1.05 <NA>   NA   
5 1995-Mar 6-Mar  1.05 13-Mar  1.04 20-Mar  1.05 27-Mar  1.06 <NA>   NA   
6 1995-Apr 3-Apr  1.09 10-Apr  1.11 17-Apr  1.12 24-Apr  1.15 <NA>   NA   
head(midwest_names)
# A tibble: 2 × 11
  X1         X2       X3    X4       X5    X6      X7    X8    X9    X10   X11  
  <chr>      <chr>    <chr> <chr>    <chr> <chr>   <chr> <chr> <chr> <chr> <chr>
1 Year-Month Week 1   <NA>  Week 2   <NA>  Week 3  <NA>  Week… <NA>  Week… <NA> 
2 <NA>       End Date Value End Date Value End Da… Value End … Value End … Value

2. The NHANES (National Health and Nutrition Survey) publishes data in the SAS xport format:

Scroll to the bottom, choose one of the datasets (Demographics, Examination, etc.). Download the Data file (XPT)

Use read.xport() to load the file into R

Briefly examine the dataset you’ve imported (use head or tail, etc)

library(haven) 
sas.data <- read_xpt("../data/DEMO_G.XPT")
sas.data %>% head(5)
# A tibble: 5 × 48
   SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3 RIDEXMON
  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 62161        7        2        1       22       NA        3        3        2
2 62162        7        2        2        3       NA        1        1        1
3 62163        7        2        1       14       NA        5        6        2
4 62164        7        2        2       44       NA        3        3        1
5 62165        7        2        2       14       NA        4        4        2
# ℹ 39 more variables: RIDEXAGY <dbl>, RIDEXAGM <dbl>, DMQMILIZ <dbl>,
#   DMQADFC <dbl>, DMDBORN4 <dbl>, DMDCITZN <dbl>, DMDYRSUS <dbl>,
#   DMDEDUC3 <dbl>, DMDEDUC2 <dbl>, DMDMARTL <dbl>, RIDEXPRG <dbl>,
#   SIALANG <dbl>, SIAPROXY <dbl>, SIAINTRP <dbl>, FIALANG <dbl>,
#   FIAPROXY <dbl>, FIAINTRP <dbl>, MIALANG <dbl>, MIAPROXY <dbl>,
#   MIAINTRP <dbl>, AIALANGA <dbl>, WTINT2YR <dbl>, WTMEC2YR <dbl>,
#   SDMVPSU <dbl>, SDMVSTRA <dbl>, INDHHIN2 <dbl>, INDFMIN2 <dbl>, …

Summarizing with dplyr

1. Use filter to get a subset of the pitch dataset

Ex. Filter the data down to left handed pitchers, who throw a curve with at least 3200 rpms (spin_rate), and the play results in a ball (action_result).

%>% the subset and create a plot

Solution

pitch <- read_csv("https://unl-statistics.github.io/R-workshops/r-format/data/pitch.csv")

pitch %>%
  filter(spin_rate >= 3200 & pitcher_hand == "L" & pitch_type == "CU") %>%
  ggplot(aes(x=action_result, y=spin_rate)) +
  geom_point(aes(color=as.factor(playerid)))  #Your graph could be different

2. Look up the help file for the function slice.

Use slice on the arranged pitchdata dataset to select a single row

Use slice to select multiple rows

pitch %>% 
  arrange(desc(playerid), spin_rate) %>% 
  slice(11)
# A tibble: 1 × 8
   ...1 playerid pitcher_hand pitch_type spin_rate ab_result action_result adj_h
  <dbl>    <dbl> <chr>        <chr>          <dbl> <chr>     <chr>         <dbl>
1   664    15540 R            CU             3042. "\\N"     F                 0
pitch %>% 
  arrange(desc(playerid), spin_rate) %>% 
  slice(1:5)
# A tibble: 5 × 8
   ...1 playerid pitcher_hand pitch_type spin_rate ab_result action_result adj_h
  <dbl>    <dbl> <chr>        <chr>          <dbl> <chr>     <chr>         <dbl>
1   966    16669 R            CU             3056. "\\N"     C                 0
2  1136    16669 R            CU             3072. "\\N"     B                 0
3  1165    16669 R            CU             3075. "\\N"     B                 0
4  1411    16669 R            CU             3165. "\\N"     C                 0
5   649    15686 R            CU             3033. "K"       S                 0

3. Select only playerid, spin_rate, and action result

Group by both playerid and action result and find mean and sd of spin rates

%>% the summaries into a ggplot histogram

Solution

pitch %>%
  select(playerid, spin_rate, action_result) %>%
    group_by(playerid, action_result) %>%
    summarise(mean_spin = mean(spin_rate), sd_spin = sd(spin_rate)) %>%
  ggplot(aes(x = mean_spin)) + geom_histogram()

4. Based on your (limited) knowledge of baseball, you determine what is a “successful” curveball. Then determine what pitchers pitched the most successful curveballs!

Note: There are many different ways of answering this question. None are wrong and you don’t need to know anything about baseball to try. Consider criteria that it needs to meet. Ex. A successful curveball needs to be above 90 mph in velocity and have over 3100 rpms in spin rate.

Your answer may be different

Solution

#Consider a success as any strike (S), catch (C), and foul ball (F) 
SScurve <- pitch %>%
  select("playerid", "action_result", "ab_result", "adj_h") %>%  
  arrange(desc(playerid)) %>% 
  mutate(successfulCU = ifelse(
    (action_result %in% c("C","S","F")), 1, 0)) %>%
  group_by(playerid) %>% 
  mutate(totalSSCU=sum(successfulCU)) %>% 
  mutate(percentSSCU=totalSSCU / n())

SScurve %>% head(4)
# A tibble: 4 × 7
# Groups:   playerid [1]
  playerid action_result ab_result adj_h successfulCU totalSSCU percentSSCU
     <dbl> <chr>         <chr>     <dbl>        <dbl>     <dbl>       <dbl>
1    16669 C             "\\N"         0            1         2         0.5
2    16669 B             "\\N"         0            0         2         0.5
3    16669 B             "\\N"         0            0         2         0.5
4    16669 C             "\\N"         0            1         2         0.5
#Calculate successful curveball percentages
percentages <- SScurve %>% 
  distinct(playerid, totalSSCU, percentSSCU)

#Look at some graphs to see what the data actually looks like now.
ggplot(data = percentages) + 
  geom_point(aes(x = percentSSCU, y = totalSSCU,
                 colour = playerid))

#Filter down to get the best pitchers with a minimum of 50 curveballs thrown 
percentages %>% 
  filter(totalSSCU > 50) %>% 
  arrange(desc(percentSSCU)) %>% ggplot() +
   geom_point(aes(x = percentSSCU, y = totalSSCU, colour = playerid))

5. The dataset ChickWeight is part of the core packages that come with R

Solution
Create a line plot with each line representing the weight of each Chick

ChickWeight <- ChickWeight

ChickWeight %>% 
  ggplot(aes(x=Time, y=weight, group=Chick, color=Diet)) +
  geom_line() + 
  facet_wrap(~Diet)

Focus on weight on day 21. Draw side-by-side dotplots of weight by diet.

ChickWeight %>%   
  filter(Time==21) %>% 
  ggplot(aes(x=Diet)) +
  geom_point(aes(y=weight, color=Diet), size=3)

Bonus: Use summarize the average weight on day 21 under each diet. Overlay the dotplots by error bars around the average weight under each diet (see ?geom_errorbar)

ChickW1 <- ChickWeight %>% 
  filter(Time==21) %>% 
  group_by(Diet) %>% 
  summarize(mean_weight = mean(weight, na.rm=TRUE),
            sd_weight = sd(weight, na.rm=TRUE)/n())

ChickWeight %>% 
  filter(Time==21) %>% 
  ggplot(aes(x=Diet)) +
  geom_point(aes(y=weight), size=2) +
  geom_errorbar(data= ChickW1, aes(ymin = mean_weight-1.96*sd_weight, ymax = mean_weight+1.96*sd_weight, colour = Diet), width=.3) +
  geom_point(data=ChickW1, aes(y=mean_weight, color=Diet), size=3)

Tidy Data

1. The Iowa Data Portal is a wealth of information on and about the State of Iowa.

Assess the ‘messiness’ of the data. List issues that prevent us from working with the data directly. Which of these issues are of type (1) or (2) of messiness?

url <- "https://data.iowa.gov/api/views/3adi-mht4/rows.csv"
campaign <- readr::read_csv(url)
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Solution
-Date is text, in the format of Month/Day/Year (Messy 2)
-City coordinates are a combination of City name, state, zip code and geographic latitude and longitude. (Messy 2)
-Expenditure amount is a textual expression, not a number (Messy different)
-No Messy 1? - problems of type Messy 1 are typically hard to detect and often up to interpretation/dependent on the analysis to be done.

2. During the 1870 census data on people’s occupation was collected. The data occupation-1870 contains state-level aggregates of occupation by gender.

Use tidyr to get the data into a long format.

Separate the occupation.gender type variable into two variables.

Spread the data such that you can draw scatterplots of values for men against women facetted by occupation.

occupation <- read_csv("https://unl-statistics.github.io/R-workshops/r-format/data/occupation-1870.csv")
occupation_long <- occupation %>% 
  pivot_longer(Agriculture.Male:School.Female,names_to = "Occupation", values_to = "Measurement") %>%
  separate(Occupation, c("Occupation", "Gender"), sep="\\.") %>%
  pivot_wider(names_from = "Gender", values_from = "Measurement")
head(occupation_long, 10)
# A tibble: 10 × 5
   `Area name`       Total.Population Occupation      Male Female
   <chr>                        <dbl> <chr>          <dbl>  <dbl>
 1 Alabama                     996992 Agriculture   226768  64860
 2 Alabama                     996992 Manufacturing  15078   1992
 3 Alabama                     996992 Trade          14345     90
 4 Alabama                     996992 Service        19449  22676
 5 Alabama                     996992 School         38600  38539
 6 Arizona Territory             9658 Agriculture     1284      1
 7 Arizona Territory             9658 Manufacturing    883    156
 8 Arizona Territory             9658 Trade            588      3
 9 Arizona Territory             9658 Service         2979    136
10 Arizona Territory             9658 School            79     70

3. The flights dataset contains information on over 300,000 flights that departed from New York City in the year 2013.

Using the flights data, create a new column Date using lubridate. You will need to paste together the columns year, month, and day in order to do this. See the paste function.

Use dplyr to calculate the average departure delay for each date.

Plot the date versus the average departure delay

library(lubridate)
flights <- read.csv("http://unl-statistics.github.io/R-workshops/r-format/data/flights.csv")

flights$date <- ymd(paste(flights$year, flights$month, flights$day, sep = "-"))
delay.dat <- flights %>% 
  group_by(date) %>% 
  summarise(dep_delay = mean(dep_delay, na.rm = TRUE))

ggplot(delay.dat) + geom_line(aes(date, dep_delay))

4. Read in the billboard top 100 music data:

Use tidyr to convert this data into a long format.

Use ggplot2 to create something like this:

Joining Data

1. Load the Lahman package into your R session.

Join (relevant pieces of) the Master data set and the HallOfFame data.

Output the names of individuals with more than 15 attempts (attempts is how many times the playerid shows up in the hall of fame data. Make sure to deal with missing values appropriately.

Solution

library(Lahman)

Voted <- HallOfFame %>% 
  filter(!is.na(ballots)) %>% 
  group_by(playerID) %>% 
  arrange(yearID) %>% 
  mutate( attempt = order(yearID))

Voted <- left_join(Voted, People %>% 
                dplyr::select(playerID, nameFirst, nameLast), 
                   by = "playerID") %>% filter(attempt > 15)

Names <- Voted %>% mutate(FullName = paste(nameFirst, nameLast))
Names$FullName %>% unique()
 [1] "Rabbit Maranville" "Dazzy Vance"       "Ray Schalk"       
 [4] "Ross Youngs"       "Edd Roush"         "Zack Wheat"       
 [7] "Hank Gowdy"        "Eppa Rixey"        "Hack Wilson"      
[10] "Red Faber"         "Red Ruffing"       "Phil Rizzuto"     
[13] "Pee Wee Reese"     "Mickey Vernon"     "Don Newcombe"     
[16] "Al Dark"