Homework: Data Transformations

Author

Your Name

Note: This assignment must be submitted in github classroom.

This week’s assignment uses data from Tidy Tuesday (link) and relates to food consumption and CO2 emissions.

# Credit to Kasia and minorly edited to create output file and test plot
# Blog post at https://r-tastic.co.uk/post/from-messy-to-tidy/
library(rvest)
library(dplyr)

url <- "https://web.archive.org/web/20191224072125/https://www.nu3.de/blogs/nutrition/food-carbon-footprint-index-2018"

# scrape the website
url_html <- read_html(url)

# extract the HTML table
whole_table <- url_html %>% 
  html_nodes('table') %>%
  html_table(fill = TRUE) %>%
  .[[1]]

table_content <- whole_table %>%
  select(-X1) %>% # remove redundant column
  filter(!dplyr::row_number() %in% 1:3) # remove redundant rows

raw_headers <- url_html %>%
  html_nodes(".thead-icon") %>%
  html_attr('title')

tidy_bottom_header <- raw_headers[28:length(raw_headers)]
# tidy_bottom_header[1:10]

raw_middle_header <- raw_headers[17:27]
# raw_middle_header

tidy_headers <- c(
  rep(raw_middle_header[1:7], each = 2),
  "animal_total",
  rep(raw_middle_header[8:length(raw_middle_header)], each = 2),
  "non_animal_total",
  "country_total")

# tidy_headers

combined_colnames <- paste(tidy_headers, tidy_bottom_header, sep = ';')
colnames(table_content) <- c("Country", combined_colnames)

table_content <- table_content %>%
  mutate_at(vars(2:26), as.numeric)

The code above reads the data in from the Wayback Machine’s archived version of the original webpage and gets it into tabular form.

Your job is to complete the following tasks:

Describe the state of the data set, table_content.
- What are the variables in the data set?
  - var1
  - var2
  - (add more as necessary)
- Is it in tidy form? What principles of tidy data does this violate?
  Your answer here
- What steps do you need to take to get it into tidy form?
  2. (add more steps as necessary)
Sketch out what the final (tidy) data set will look like. You can use markdown table syntax or a picture here, but if you use a picture, upload it to imgur and include the image link in this document USING PROPER MARKDOWN SYNTAX.
Write R or python code for each step in the process you identified in #1. Show what the data looks like at each step using head(). Each step should be in a different code chunk.
For each food type (you may have to remove total values), plot the relationship between Carbon output and Consumption (use facets to get separate plots for each type of food). What do you notice for each plot? If you want to reduce carbon emissions, what foods should you eat less of?
Look at the plot above again. Do you have any concerns about the data? The data source?