2026-02-12
For observed data (rather than a study), a (descriptive) data exploration is often the only thing we can do
But with any new data set, you should do some initial exploration: what are the assumptions (what have you been told about the data?) - are implied expectations holding up?
Make sure to read through the EDA chapter
Asking questions is hard
Make sure that your questions and insights respect the data structure
One row in the cookies data set is … ?
The quantity variable is deceptively simple. Why do we have to be careful when drawing conclusions involving this variable?
Stay away from “what is the best, highest, lowest, average, …”: if you get a single number, you do not gain actionable insight.
What are your expectations regarding the data?
One way of asking questions, is to re-phrase an expectation in form of ‘(how) does the data deviate from …?’
Generally, when the data does not meet expectations, we find weird stuff …
“Using R and Python, generate a table that shows what proportion of recipes contain each type of ingredient, for the most common 20 ingredients.”
What are the most common 20 ingredients?
What is in the numerator? What is the denominator?
“Using R and Python, generate a table that shows what proportion of recipes contain each type of ingredient, for the most common 20 ingredients.”
split on ingredient (group_by or .groupby)
create a summary (summarize or agg) by using some function that counts the number of (distinct) recipes (n_distinct or nunique) and divide by the total number of recipes
combination back into a dataset happens automatically