Data Wrangling
This workshop will to prepare you for dealing with messy data by walking you through real-life examples. We will work on improving your programming skills and help you move beyond using copy-and-paste. We will discuss how to write functions in order to reduce duplication in your code and automate common tasks and how to use iteration in order to further reduce duplication. You will leave with skills that will allow you to both tackle problems with more ease.
The course will be data centric, with lots of different data sets that illustrate examples of the different techniques used for different problems.
Timetable
Date | Notes | Lectures and Resources |
---|---|---|
9 - 9:15 | Introduction | reading in basic file types: .xls, .csv, .txt, .xport and more general functions: filter, join, … |
9:15 - 10:05 | Reading Files | Excel files vs. text, data organization 2-Files.R, midwest.csv, midwest.xls |
10:05 - 10:30 | Break | |
10:30 - 12:15 | Summarizing with dplyr | Pipe operator and dplyr verbs 3-dplyr.R pitch.csv |
12:15 - 1:15 | Lunch Break (on your own) | |
1:15 - 2:45 | Tidy Data | Restructuring data with pivot wider, pivot longer, and separate. 4-tidyr.R, frenchfries.csv, billboard.csv, flights.csv, occupation-1870.csv |
2:45 - 3:00 | Break | |
3:00 - 4:00 | Joining Data | Join dataframes together using SQL-based logic 5-joining.R, boxoffice.csv, baseball.csv |
3:55 - 4:00 | Evaluation | Help us make the workshops better! |
Your Turn Solutions
Useful Links
- The Split-Apply-Combine Strategy for Data Analysis, Journal of Statistical Software, 2011
- Overview of base apply functions
- Dplyr and Tidyr Cheat Sheet