Data Wrangling

Reading Files

Project Management

Important to understand your file structure and where you save everything
Working directory
“Default location where R will look for files you want to load and where it will put any files you save” source

#Returns the file path of the current working directory
getwd()

#Set Working Directory
setwd()

Easiest way to set Working Directory:
- Session -> Set Working Directory -> Choose Directory
- Point and Click Method

Excel

Data Files in Excel

Formats .xls and .csv - what’s the difference?
File extensions .xls and .xlsx are proprietary Excel formats, binary files
csv is an extension for Comma Separated Value files. They are text files - directly readable.
Motivating Example: Gas prices in the Midwest since 1994

MOTIVATING EXAMPLE

Gas Prices in the Midwest

Year-Month	Week 1	…3	Week 2	…5	Week 3	…7	Week 4	…9	Week 5	…11
NA	End Date	Value	End Date	Value	End Date	Value	End Date	Value	End Date	Value
1994-Nov	NA	NA	NA	NA	NA	NA	28-Nov	1.122	NA	NA
1994-Dec	5-Dec	1.086	12-Dec	1.057	19-Dec	1.039	26-Dec	1.027	NA	NA
1995-Jan	2-Jan	1.025	9-Jan	1.046	16-Jan	1.031	23-Jan	1.054	30-Jan	1.055
1995-Feb	6-Feb	1.045	13-Feb	1.04	20-Feb	1.031	27-Feb	1.052	NA	NA
1995-Mar	6-Mar	1.053	13-Mar	1.042	20-Mar	1.048	27-Mar	1.065	NA	NA

What’s the difference `read_csv` vs. `read_*` ??

read_csv is just one way to read a file using the readr package:

read_delim: the most generic function. Use the delim argument to read a file with any type of delimiter
read_tsv: read tab separated files
read_lines: read a file into a vector that has one element per line of the file
read_file: read a file into a single character element
read_table: read a file separated by space

Your Turn

Have a look at the parameters of read_csv to solve the following problems:

Read the first two lines of the file into an object called midwest_names
Read everything EXCEPT the first two lines into an object called midwest_data

Hint: Visit http://unl-statistics.github.io/R-workshops/r-format/data/midwest.csv to download the data into your current working directory (getwd()) and look at it in Excel to help figure out what the issue is.

Your Turn Solutions

midwest_names <- read_csv(
"midwest.csv", 
  n_max= 2, col_names = FALSE)


midwest_data <- read_csv(
"midwest.csv", 
  skip = 2, col_names = FALSE)

Quick Lesson in Data Cleaning

library(lubridate) # works with dates
library(tidyverse)
values <- c(midwest_data$X3, midwest_data$X5, midwest_data$X7, 
            midwest_data$X9, midwest_data$X11)
dates <- c(paste(midwest_data$X1, midwest_data$X2, sep = "-"), 
           paste(midwest_data$X1, midwest_data$X4, sep = "-"),
           paste(midwest_data$X1, midwest_data$X6, sep = "-"),
           paste(midwest_data$X1, midwest_data$X8, sep = "-"),
           paste(midwest_data$X1, midwest_data$X10, sep = "-"))

dates <- dates[!is.na(values)] #removing NAs
values <- values[!is.na(values)] #removing Nas

dates <- ymd(dates)

midwest_gas <- data_frame(date = dates, price = values)
midwest_gas <- arrange(midwest_gas, dates)

Now we can make a plot!

library(ggplot2)

ggplot(midwest_gas, aes(x = date, y = price)) + geom_line()

Reading Excel Data

Download the midwest.xls file to your current working directory (getwd())

library(readxl)
midwest2 <- read_excel("https://github.com/unl-statistics/R-workshops/raw/main/r-format/data/midwest.xls") 
head(midwest2)

Something isn’t quite right here…

Year-Month	Week 1	...3	Week 2	...5	Week 3	...7	Week 4	...9	Week 5	...11
NA	End Date	Value	End Date	Value	End Date	Value	End Date	Value	End Date	Value
1994-Nov	NA	NA	NA	NA	NA	NA	39779	1.1220000000000001	NA	NA
1994-Dec	39786	1.0860000000000001	39793	1.0569999999999999	39800	1.0389999999999999	39807	1.0269999999999999	NA	NA

Reading Excel Data: Attempt #2

The 2nd row of the excel file also contains variable names.

Let’s skip the first this time. We can then rename the first column.

midwest2 <- read_excel("https://github.com/unl-statistics/R-workshops/raw/main/r-format/data/midwest.xls", skip = 1) #<<
names(midwest2)[1] <- "Year-Month"
head(midwest2)

Year-Month	End Date...2	Value...3	End Date...4	Value...5	End Date...6	Value...7	End Date...8	Value...9	End Date...10	Value...11
1994-Nov	NA	NA	NA	NA	NA	NA	2012-11-28	1.122	NA	NA
1994-Dec	2012-12-05	1.086	2012-12-12	1.057	2012-12-19	1.039	2012-12-26	1.027	NA	NA
1995-Jan	2012-01-02	1.025	2012-01-09	1.046	2012-01-16	1.031	2012-01-23	1.054	2012-01-30	1.055

Better, but not yet perfect…

Other Data Types

`haven` Package

Other file formats can be read using the functions from package haven
SPSS: read_spss
SAS: read_xpt
Stata: read_dta

Useful Packages for Other Data Files

foreign package reads from Minitab, Systat, etc. files.
rgdal package reads shapefiles (for maps)
sf package reads arcgis files, etc. sf package

R knowledge Spatial, by Allison Horst

Your Turn (~2 minutes)

The NHANES (National Health and Nutrition Survey) publishes data in the SAS xport format:

https://wwwn.cdc.gov/Nchs/Nhanes/continuousnhanes/default.aspx?BeginYear=2013

Scroll to the bottom, choose one of the datasets (Demographics, Examination, etc.). Download the Data file (XPT)
Use read_xpt() to load the file into R
Briefly examine the dataset you’ve imported (use head or tail, etc)

Solution

library(haven) 
sas.data <- read_xpt("../data/DEMO_G.XPT")
sas.data %>% head(2)

# A tibble: 2 × 48
   SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3 RIDEXMON
  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 62161        7        2        1       22       NA        3        3        2
2 62162        7        2        2        3       NA        1        1        1
# ℹ 39 more variables: RIDEXAGY <dbl>, RIDEXAGM <dbl>, DMQMILIZ <dbl>,
#   DMQADFC <dbl>, DMDBORN4 <dbl>, DMDCITZN <dbl>, DMDYRSUS <dbl>,
#   DMDEDUC3 <dbl>, DMDEDUC2 <dbl>, DMDMARTL <dbl>, RIDEXPRG <dbl>,
#   SIALANG <dbl>, SIAPROXY <dbl>, SIAINTRP <dbl>, FIALANG <dbl>,
#   FIAPROXY <dbl>, FIAINTRP <dbl>, MIALANG <dbl>, MIAPROXY <dbl>,
#   MIAINTRP <dbl>, AIALANGA <dbl>, WTINT2YR <dbl>, WTMEC2YR <dbl>,
#   SDMVPSU <dbl>, SDMVSTRA <dbl>, INDHHIN2 <dbl>, INDFMIN2 <dbl>, …