Data Wrangling

Intro

Brief Outline

Data Wrangling offers the flexibility of collaboration with other formats.

Data Wrangling Flowchart by Hadley Wickham and Garrett Grolemund

  • Reading in files from various formats (.xls, .xlsx, .csv, .txt, .sps, .xport)
  • Summarizing Data using dplyr (Toolbox of data cleaning function)
  • Tidy Data for future analyses and reproducibility
  • Joining Data from different Data Sources

What is Data Wrangling?

Data wrangling can be defined as the process of cleaning, organizing, and transforming raw data into the desired format for analysts to use for prompt decision-making. Also known as data cleaning.

R knowledge Data Wrangling, by Allison Horst

Why do you need “Data Wrangling” Skills?

  • Improve data usability by converting raw data into a compatible format for the end system

  • Quickly build data flows within an intuitive user interface

  • Schedule and automate the data-flow process

  • Integrate various types of information and their sources (like databases, web services, files, etc.)

  • Process very large volumes of data easily and easily share data-flow techniques.

Source

Messy data

Happy families are all alike; every unhappy family is unhappy in its own way. - Leo Tolstoy

Comparison of work benches, by Allison Horst

Messy data

Happy families are all alike; every unhappy family is unhappy in its own way. - Leo Tolstoy

Five main ways tables of data tend not to be tidy:

  1. Column headers are values, not variable names.

  2. Multiple variables are stored in one column.

  3. Variables are stored in both rows and columns.

  4. Multiple types of observational units are stored in the same table.

  5. A single observational unit is stored in multiple tables.

Motivation

Make Friends with Tidy Data, by Allison Horst

Extendend Outline

READING FILE TYPES - What file types can be read in with R? - Reading in different file types - Formatting your data: A tidy data discussion
(Review from Graphics Lecture)

DPLYR PACKAGE - filter, mutate, select, summarise, group by, and arrange

TIDYR PACKAGES - What is tidy data? - pivot longer, pivot wider and separate functions - lubridate package basics

JOINING DATASETS - Basic set theory logic (joining/combining datasets)