Data Explorations

Stat 251

2026-02-12

Exploratory Data Analysis

  • For observed data (rather than a study), a (descriptive) data exploration is often the only thing we can do

  • But with any new data set, you should do some initial exploration: what are the assumptions (what have you been told about the data?) - are implied expectations holding up?

  • Make sure to read through the EDA chapter

Homework 4 - Reading Data with Cookies

  • Asking questions is hard

  • Make sure that your questions and insights respect the data structure

  • One row in the cookies data set is … ?

  • The quantity variable is deceptively simple. Why do we have to be careful when drawing conclusions involving this variable?

Homework 4 - Asking Questions

  • Stay away from “what is the best, highest, lowest, average, …”: if you get a single number, you do not gain actionable insight.

  • What are your expectations regarding the data?

  • One way of asking questions, is to re-phrase an expectation in form of ‘(how) does the data deviate from …?’

  • Generally, when the data does not meet expectations, we find weird stuff …

Homework 4 - Proportions of …

  • It is very easy to come up with proportions … and surprisingly(?) tricky to come up with the proportion the question asks for

“Using R and Python, generate a table that shows what proportion of recipes contain each type of ingredient, for the most common 20 ingredients.”

  • What are the most common 20 ingredients?

  • What is in the numerator? What is the denominator?

Split - Apply - Combine

“Using R and Python, generate a table that shows what proportion of recipes contain each type of ingredient, for the most common 20 ingredients.”

  • split on ingredient (group_by or .groupby)

  • create a summary (summarize or agg) by using some function that counts the number of (distinct) recipes (n_distinct or nunique) and divide by the total number of recipes

  • combination back into a dataset happens automatically

Homework 4