Chocolate Chip Cookies

Author

Your Name

Published

September 30, 2024

Note: This assignment must be submitted in github classroom.

Reading In the Data

First, read in the CSV data of cookie ingredients. Make sure that your end-result data has appropriate types for each column - these should match the types provided in the documentation in the README.md file.

Exploratory Data Analysis

Exploratory data analysis is the process of getting familiar with your dataset. To get started, this blog post provides a nice checklist to get you thinking:

What question(s) are you trying to solve (or prove wrong)?

What kind of data do you have and how do you treat different types?

What’s missing from the data and how do you deal with it?

Where are the outliers and why should you care about them?

How can you add, change or remove features to get more out of your data?

Generating Questions

Generate at least 5 questions you might explore using this database of cookie ingredients.

Skimming the Data

One thing we often want to do during EDA is to examine the quality of the data - are there missing values? What quirks might exist in the dataset?

The skimr package in R, and the similar skimpy package in python (which has a much better name, in my opinion), can help provide visual summaries of the data.

Install both packages, and read the package documentation (R, Python).

[Part 1] Use each package and generate summaries of your data that require the use of at least some non-default options in each package’s skim function.

[Part 2] Write 1-2 sentences about what you can tell from each summary display you generate. Did you discover anything new about the data?

Generating Tables

Another useful technique for exploratory data analysis is to generate summary tables. You may want to use the dplyr package in R (group_by or count functions), as well as the groupby and count methods in Pandas. Python example, R example

[Part 1] Using R and Python, generate a table that shows what proportion of recipes contain each type of ingredient, for the most common 20 ingredients.

[Part 2] Print out a character string that lists all of the ingredients that do not appear in at least 20 recipes.

(Delete this note, but you can include data values inline in markdown text by using backticks, at least in R. For instance, here is R’s built in value for pi: 3.1415927. Unfortunately, this doesn’t work in python using the knitr markdown engine, but you can print the list out in python anyways using a code chunk.)

Visualization

Using whatever plotting system you are comfortable with in R or python, see if you can create a couple of useful exploratory data visualizations which address one of the questions you wrote above - or another question which you’ve come up with as you’ve worked on this assignment.

[Part 1] Create at least one plot (it doesn’t have to be pretty) that showcases an interesting facet of the data.

[Part 2] Write 2-3 sentences about what you can learn from that plot and what directions you might want to investigate from here.