Data Wrangling

Using tidyr

What is tidy data?

Aspect of data cleaning that structures a dataset to facilitate analysis
Resource: follow along tidyr vignette
available as vignette("tidy-data", package="tidyr")
vignette is version of the tidy data paper with updated code

Outline

Different sources of messiness
Key-Value pairs
tidyr: pivot_wider and pivot_longer, separate

Data is usually in a spreadsheet format, but…

There are different ways of encoding the same information:

Option 1:

Patient	Treatment A	Treatment B
John Smith	NA	18
Jane Doe	4	1
Mary Johnson	6	7

Option 2:

Patient	John Smith	Jane Doe	Mary Johnson
A	NA	4	6
B	18	1	7

Option 3:

Patient	Treatment	Score
John Smith	A	NA
Jane Doe	A	4
Mary Johnson	A	6
John Smith	B	18
Jane Doe	B	1
Mary Johnson	B	7

Sources of Messiness (Review)

Five main ways tables of data tend not to be tidy:

Column headers are values, not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table.
A single observational unit is stored in multiple tables.

What we have and what we want

Wide format

some variables are spread out across columns.
typically uses less space to display
how you would typically choose to present your data
far less repetition of labels and row elements

Long format

each variable is a column
each observation is a row
is likely not the data’s most compact form

Remember from Graphics

Tidy Data

Each variable is a column
Each observation is a row
Each type of observational unit forms a table

Patient	Treatment	Score
John Smith	A	NA
John Smith	B	18
Jane Doe	A	4
Jane Doe	B	1
Mary Johnson	A	6
Mary Johnson	B	7

What makes data tidy?

Patient	Treatment	Score
John Smith	A	NA
John Smith	B	18
Jane Doe	A	4
Jane Doe	B	1
Mary Johnson	A	6
Mary Johnson	B	7

Treatment and Patient uniquely describe a single row in the dataset.
Treatment and Patient are key variables,
Score is a measurement variable
this makes Treatment-Patient and Score a key-value pair

Key-value pairs (KVP)

Key-Value pairs (KVP) - also attribute-value, field-value, name-value: abstract data representation that allows a lot of flexibility

One way of telling whether a data set is tidy is to check that all keys for a value are aligned in one row:

Key-value pairs (KVP)

Keys/Identifiers:

Identify a record (must be unique)
Example: Indices on an random variable
Fixed by design of experiment (known in advance)
May be single or composite (may have one or more variables)

Values/Measures:

Collected during the experiment (not known in advance)
Usually numeric quantities

Let’s ask for help…

Pivot Monsters drawing by Allison Horst & adjusted by R ladies Sydney

Tidying data - Plan of attack

Messy (1): tidyr functions:

pivot_longer (data,names_to,values_to, ...): take multiple columns and collapse into key-value pairs

pivot_wider (data,names_from,values_from,values_fill = NA): spread a key-value pair across multiple columns.

Messy (2): tidyr function separate (data, col, into, sep = " "): separate one column into multiple columns

Messy (3): dplyr : a combination of the functions discussed previously today

Messy (4): dplyr functionality join (and friends) to combine multiple data sets

Untidy data

french_fries <- read_csv("frenchfries.csv")
head(french_fries)

	time	treatment	subject	rep	potato	buttery	grassy	rancid	painty
61	1	1	3	1	2.9	0.0	0.0	0.0	5.5
25	1	1	3	2	14.0	0.0	0.0	1.1	0.0
62	1	1	10	1	11.0	6.4	0.0	0.0	0.0
26	1	1	10	2	9.9	5.9	2.9	2.2	0.0
63	1	1	15	1	1.2	0.1	0.0	1.1	5.1
27	1	1	15	2	8.8	3.0	3.6	1.5	2.3

This format is not ideal for data analysis

library(ggplot2)

ggplot(french_fries) + 
geom_boxplot(aes(x="1_buttery", y=buttery), fill = "cyan4") +
geom_boxplot(aes(x = "2_grassy", y = grassy), fill = "darkorange2") +
geom_boxplot(aes(x = "3_painty", y = painty), fill = "darkorchid1") +
geom_boxplot(aes(x = "4_potato", y = potato), fill = "chartreuse3") +
geom_boxplot(aes(x = "5_rancid", y = rancid), fill = "deeppink") +
xlab("variable") + ylab("rating")

Let’s start with Messy (1)

Use pivot_longer and pivot_wider

Tidy your data using `pivot_longer`

When pivoting longer, you need to specify: - the cols (identifiers) - the names_to (new column name) - the values_to (measures from cols)

`pivot_longer` Example

french_fries_long <- french_fries %>% 
  pivot_longer(cols = potato:painty, #5:9
               names_to = "variable", 
               values_to = "rating")

time	treatment	subject	rep	variable	rating
1	1	3	1	potato	2.9
1	1	3	1	buttery	0.0
1	1	3	1	grassy	0.0
1	1	3	1	rancid	0.0
1	1	3	1	painty	5.5
1	1	3	2	potato	14.0

This format is better suited for data analysis

ggplot(french_fries_long) + 
  geom_boxplot(aes(x = variable, y = rating, fill = variable))

Long to Wide

In certain applications, we may wish to take a long dataset and convert it to a wide dataset (Perhaps displaying in a table).

time	treatment	subject	rep	variable	rating
1	1	3	1	potato	2.9
1	1	3	1	buttery	0.0
1	1	3	1	grassy	0.0

Tidy your data using `pivot_wider`

When pivoting wider, you need to specify: - the names_from (column name that we want to widen) - the values_from (measures from column we want to widen)

`pivot_wider` Example

french_fries_wide <- french_fries_long %>% 
  pivot_wider(names_from = variable, 
              values_from = rating, 
              values_fill = NA)

time	treatment	subject	rep	potato	buttery	grassy	rancid	painty
1	1	3	1	2.9	0.0	0.0	0.0	5.5
1	1	3	2	14.0	0.0	0.0	1.1	0.0
1	1	10	1	11.0	6.4	0.0	0.0	0.0
1	1	10	2	9.9	5.9	2.9	2.2	0.0

We are now back to our original format

`pivot_wider` (cont.)

We can also use the pivot_wider function to compare replicates within the French Fry dataset

french_fries_wide <- french_fries_long %>% 
  pivot_wider(names_from = rep, 
              values_from = rating)

time	treatment	subject	variable	1	2
1	1	3	potato	2.9	14
1	1	3	buttery	0.0	0
1	1	3	grassy	0.0	0

Compare replicates of tested French Fry Data

french_fries_wide %>%
  ggplot(aes(x = `1`, y = `2`)) + 
  geom_point() +
  facet_wrap(~variable) + 
  geom_abline(colour = "grey50")

Motivating Example (separate function)

states	dates
Lincoln/NE	Aug-2000
Chicago/IL	Sep-2001
Ames/IA	Oct-2002
Nashville/TN	Feb-2001

Basic Idea: `separate`

How do I handle those values in my dataset? Let’s use separate

`separate` Example 1

df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df

     x
1 <NA>
2  a.b
3  a.d
4  b.c

df %>% separate(x, into = c("A", "B"))

     A    B
1 <NA> <NA>
2    a    b
3    a    d
4    b    c

`separate` Example 2

df2 <- data.frame(y = c("Lincoln, NE", "Chicago, IL", "Ames, IA", "Nashville, TN"))
df2

              y
1   Lincoln, NE
2   Chicago, IL
3      Ames, IA
4 Nashville, TN

df2 %>% separate(y, c("City", "State"), sep = ", ")

       City State
1   Lincoln    NE
2   Chicago    IL
3      Ames    IA
4 Nashville    TN

Your Turn

The Iowa Data Portal is a wealth of information on and about the State of Iowa.

The website Campaign Expenditures provides data on campaign expenditures. The code below reads the data into an R session.

url <- "https://data.iowa.gov/api/views/3adi-mht4/rows.csv"
campaign <- readr::read_csv(url, guess_max = 50000)

Assess the ‘messiness’ of the data. List issues that prevent us from working with the data directly. Which of these issues are of type (1) or (2) of messiness?

Problems with the data

Date is text, in the format of Month/Day/Year (Messy 2)
city coordinates are a combination of City name, state, zip code and geographic latitude and longitude. (Messy 2)
expenditure amount is a textual expression, not a number (Messy different)

No Messy 1? - problems of type Messy 1 are typically hard to detect and often up to interpretation/dependent on the analysis to be done.

Your Turn

During the 1870 census data on people’s occupation was collected. The data occupation-1870 contains state-level aggregates of occupation by gender.

Use tidyr to get the data into a long format (use pivot_longer).
Separate the occupation.gender type variable into two variables.
“Spread” (use pivot_wider) the data such that you can draw scatterplots of values for men against women facetted by occupation.

Solution

occupation <- read_csv("https://unl-statistics.github.io/R-workshops/r-format/data/occupation-1870.csv")

occupation_long <- occupation %>% 
  pivot_longer(Agriculture.Male:School.Female,names_to = "Occupation", values_to = "Measurement") %>%
  separate(Occupation, c("Occupation", "Gender"), sep="\\.") %>%
  pivot_wider(names_from = "Gender", values_from = "Measurement")

head(occupation_long, 10)

# A tibble: 10 × 5
   `Area name`       Total.Population Occupation      Male Female
   <chr>                        <dbl> <chr>          <dbl>  <dbl>
 1 Alabama                     996992 Agriculture   226768  64860
 2 Alabama                     996992 Manufacturing  15078   1992
 3 Alabama                     996992 Trade          14345     90
 4 Alabama                     996992 Service        19449  22676
 5 Alabama                     996992 School         38600  38539
 6 Arizona Territory             9658 Agriculture     1284      1
 7 Arizona Territory             9658 Manufacturing    883    156
 8 Arizona Territory             9658 Trade            588      3
 9 Arizona Territory             9658 Service         2979    136
10 Arizona Territory             9658 School            79     70

Solution

occupation_long %>% ggplot() + geom_point(aes(x = Male, y = Female)) + facet_wrap(~Occupation)

Dates and Times

Dates are deceptively hard to work with in R.

Example: 02/05/2012. Is it February 5th, or May 2nd?

Other things are difficult too:

Time zones
POSIXct format in base R is challenging

The lubridate package helps tackle some of these issues.

Basic Lubridate Use

library(lubridate) #<< #New Package Alert!!

now()
today()
now() + hours(4)
today() - days(2)

[1] "2024-02-09 08:16:34 CST"
[1] "2024-02-09"
[1] "2024-02-09 12:16:34 CST"
[1] "2024-02-07"

Parsing Dates

ymd("2013-05-14")
mdy("05/14/2013")
dmy("14052013")
ymd_hms("2013:05:14 14:50:30", tz = "America/Chicago")

[1] "2013-05-14"
[1] "2013-05-14"
[1] "2013-05-14"
[1] "2013-05-14 14:50:30 CDT"

Your Turn

The flights dataset contains information on over 300,000 flights that departed from New York City in the year 2013.

flights <- 
  read.csv("http://unl-statistics.github.io/R-workshops/r-format/data/flights.csv")

Using the flights data, create a new column Date using lubridate. You will need to paste together the columns year, month, and day in order to do this. See the paste function.
Use dplyr to calculate the average departure delay for each date.
Plot the date versus the average departure delay

SOLUTION

flights$date <- ymd(paste(flights$year, flights$month, flights$day, sep = "-"))
delay.dat <- flights %>% 
    group_by(date) %>% 
    summarise(dep_delay = mean(dep_delay, na.rm = TRUE))


ggplot(delay.dat) + geom_line(aes(date, dep_delay))

Your Turn

Read in the billboard top 100 music data:

billboard <- 
  read.csv("http://unl-statistics.github.io/R-workshops/r-format/data/billboard.csv")

Use tidyr to convert this data into a long format.
Use ggplot2 to create something like this:

Data Wrangling

What is tidy data?

Outline

Data is usually in a spreadsheet format, but…

Sources of Messiness (Review)

What we have and what we want

Remember from Graphics

Tidy Data

What makes data tidy?

Key-value pairs (KVP)

Key-value pairs (KVP)

Let’s ask for help…

Tidying data - Plan of attack

Untidy data

This format is not ideal for data analysis

Let’s start with Messy (1)

Tidy your data using pivot_longer

pivot_longer Example

This format is better suited for data analysis

Long to Wide

Tidy your data using pivot_wider

pivot_wider Example

pivot_wider (cont.)

Compare replicates of tested French Fry Data

Motivating Example (separate function)

Basic Idea: separate

separate Example 1

separate Example 2

Your Turn

Problems with the data

Your Turn

Solution

Solution

Dates and Times

Basic Lubridate Use

Parsing Dates

Your Turn

SOLUTION

Your Turn

Tidy your data using `pivot_longer`

`pivot_longer` Example

Tidy your data using `pivot_wider`

`pivot_wider` Example

`pivot_wider` (cont.)

Basic Idea: `separate`

`separate` Example 1

`separate` Example 2