library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)
Practice Final Exam 2025
Note: This assignment must be submitted in github classroom.
Note: The points on this document are intended to provide you with an idea of how points will be allocated on the actual exam. This “Homework” assignment is worth 10 points.
The real exam is due at 3pm on May 15, 2025.
I will grade the exam as it is pushed to github. I cannot grade the exam that exists on your computer, so please double-check your github repository to ensure that the file that is on github is the file you want me to grade.
For each of these problems, you may choose to solve the problem in either R or python. I have provided both R and python chunks. You should feel free to remove the unused chunk for each question.
(5 points) Multipart problems have only one code chunk. Please put your code under the comment corresponding to the part you are working on. This will help me to grade your work more efficiently.
Rules
You may use the textbook and your notes on this exam.
If you need to search for ‘how do I do X in Y language’, that is allowable using google/duckduckgo, but you must 1) document that you did the search, and 2) provide a link to the website you used to get a solution.
AI and LLM usage is strictly forbidden. Use of any unauthorized resources will result in a 0 on this exam.
You must be able to explain how any code you submit on this exam works.
- Oral exams based on your submissions will be held on Friday, May 16.
- You will be notified of the need for an oral exam by 8pm on Thursday, May 15.
- If you are notified that an oral exam is required, you must schedule a time for the exam on Thursday night.
- Oral exam times will be available on Friday at 30-minute intervals between 9am and 2pm.
If you get stuck, you may ask Dr. Vanderplas for the solution to the problem you are stuck on, at the cost of the points which would be awarded for that problem. This is designed to get you un-stuck and allow you to complete multi-part problems.
(5 points) Your submitted qmd file must compile without errors. Use
error=TRUE
in a chunk if it is supposed to return an error or if you cannot get the code to work properly.
Data
The data for this practice exam comes from TidyTuesday.
Ravelry describes itself as a social networking and organizational tool for knitters, crocheters, designers, spinners, weavers and dyers.
Setting Up
Packages
Load any additional packages you need for the rest of the exam in the setup chunks. I have started by loading some basic packages in R and Python. Python packages use the aliases that are consistently used in class and in the textbook.
import pandas as pd
import seaborn as sns
import seaborn.objects as so
import matplotlib.pyplot as plt
Load Data
<- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-10-11/yarn.csv') yarn
= pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-10-11/yarn.csv') yarn
<string>:1: DtypeWarning: Columns (0,13) have mixed types. Specify dtype option on import or set low_memory=False.
Data Codebook
variable | class | description |
---|---|---|
discontinued | logical | discontinued true/false |
gauge_divisor | double | gauge divisor - The number of inches that equal min_gauge to max_gauge stitches |
grams | double | Unit weight in grams |
id | double | id |
machine_washable | logical | machine washable true/false |
max_gauge | double | max gauge - The max number of stitches that equal gauge_divisor |
min_gauge | double | min gauge - The min number of stitches that equal gauge_divisor |
name | character | name |
permalink | character | permalink |
rating_average | double | rating average - The average rating out of 5 |
rating_count | double | rating count |
rating_total | double | rating total |
texture | character | texture - Texture free text |
thread_size | character | thread size |
wpi | double | wraps per inch |
yardage | double | yardage |
yarn_company_name | character | Yarn company name |
yarn_weight_crochet_gauge | logical | Yarn weight crochet gauge - Crochet gauge for the yarn weight category |
yarn_weight_id | double | Yarn weight ID - Identifier for the yarn weight category |
yarn_weight_knit_gauge | character | Yarn weight knit gauge - Knit guage for the yarn weight category |
yarn_weight_name | character | Yarn weight name - Name for the yarn weight category |
yarn_weight_ply | double | Yarn weight ply - Ply for the yarn weight category |
yarn_weight_wpi | character | Yarn weight wraps per inch - Wraps per inch for the yarn weight category |
texture_clean | character | Texture clean - Texture with some light text cleaning |
Initial Exploration
- (5 points) Determine the number of rows and columns in the data, using appropriate commands.
How many of the recorded yarns are machine washable?
- (2 points) Use a simple calculation to answer this question. (Simple = you should not need any loops, custom functions, or if statements)
Hint: You may want to exclude NA values before you do your calculation.
- (3 points) Explain how your code works, including any changes in variable types that occur during the computation.
Your explanation goes here
- (2 points) Use a simple calculation to answer this question. (Simple = you should not need any loops, custom functions, or if statements)
How many missing values are recorded for each variable?
- (5 points) Write a loop that will iterate through each column of the data set and calculate the number of missing values in that column. Store the answer in a new vector variable.
(5 points) List the column that has the: (Fill this in)
- Most NA values
- Fewest NA values
- Median NA values
# Write a loop that will iterate through each column of the data set and calculate the number of missing values in that column. Store the answer in a new vector variable.
# Which column has the most, fewest, and median NA values?
# Write a loop that will iterate through each column of the data set and calculate the number of missing values in that column. Store the answer in a new vector variable.
# Which column has the most, fewest, and median NA values?
Data Transformations and Summaries
Consider the column
yarn_company_name
.- (3 points) Count up the number of times that each company appears in the data set.
(2 points) Store the data from only the 10 most common brands in the variable
common_yarn
.(5 points) Is this data in tidy form? Explain why or why not.
(5 points) Sketch the process to create a data set from
common_yarn
that has three columns:yarn_company_name
,yarn_weight_name
, andn
, the number of entries from each company with that yarn weight.(5 points) Execute the transformation you planned out in 4.4.
(5 points) Create a plot of the brand and yarn weight using a tile geometric object, mapping
n
to the fill color of the tile. If you cannot figure out the tile plot, you can use a scatterplot to make a similar (though less easily readable) plot, mapping point size ton
instead.
# Count up the number of times that each company appears in the data set.
# Store the data from only the 10 most common brands in the variable `common_yarn`.
# Execute the transformation you planned out in 4.4 to create a data set from `common_yarn` that has three columns: `yarn_company_name`, `yarn_weight_name`, and `n`, the number of entries from each company with that yarn weight.
# Create a plot of the brand and yarn weight using a tile geometric object, mapping `n` to the fill color of the tile. If you cannot figure out the tile plot, you can use a scatterplot to make a similar (though less easily readable) plot, mapping point size to `n` instead.
# Count up the number of times that each company appears in the data set.
# Store the data from only the 10 most common brands in the variable `common_yarn`.
# Execute the transformation you planned out in 4.4 to create a data set from `common_yarn` that has three columns: `yarn_company_name`, `yarn_weight_name`, and `n`, the number of entries from each company with that yarn weight.
# Create a plot of the brand and yarn weight using a tile geometric object, mapping `n` to the fill color of the tile. If you cannot figure out the tile plot, you can use a scatterplot to make a similar (though less easily readable) plot, mapping point size to `n` instead.
Data Cleaning
Consider the columns
texture
andtexture_clean
.- (2 points) For how many rows of
yarn
dotexture
andtexture_clean
differ?
Your answer goes here, supported by code below
- (3 points) Examine the rows where
texture
andtexture_clean
differ. What function(s) do you think were used to transformtexture
totexture_clean
?
Your answer goes here (supporting code should be in the chunk below)
- (5 points) Describe/sketch the process you would use to identify the top 15 common descriptor words/phrases.
Your answer goes here.
- (5 points) Create a variable,
top_descriptors
, that contains the top 15 common descriptor words/phrases.
- (5 points) Describe/sketch the process you would use to determine whether a yarn texture description contains one of the keywords in your
top_descriptors
list.
Your answer goes here
- (5 points) Write a function corresponding to the operations described in 5.5,
is_top_descriptor
. Your function should take a texture description and returnTrue
if one of thetop_descriptors
is present, andFalse
if none of thetop_descriptors
are present.
- (5 points) Create a new column in
yarn
,common_texture
, which isTrue
iftexture_clean
contains one of the popular keywords andFalse
otherwise.
- (2 points) For how many rows of
# For how many rows of `yarn` do `texture` and `texture_clean` differ?
# Examine the rows where `texture` and `texture_clean` differ. What function(s) do you think were used to transform `texture` to `texture_clean`?
# Create a variable, `top_descriptors`, that contains the top 15 common descriptor words/phrases.
# Write a function corresponding to the operations described in 5.5, `is_top_descriptor`. Your function should take a texture description and return `True` if one of the `top_descriptors` is present, and `False` if none of the `top_descriptors` are present.
# Create a new column in `yarn`, `common_texture`, which is `True` if `texture_clean` contains one of the popular keywords and `False` otherwise.
# For how many rows of `yarn` do `texture` and `texture_clean` differ?
# Examine the rows where `texture` and `texture_clean` differ. What function(s) do you think were used to transform `texture` to `texture_clean`?
# Create a variable, `top_descriptors`, that contains the top 15 common descriptor words/phrases.
# Write a function corresponding to the operations described in 5.5, `is_top_descriptor`. Your function should take a texture description and return `True` if one of the `top_descriptors` is present, and `False` if none of the `top_descriptors` are present.
# Create a new column in `yarn`, `common_texture`, which is `True` if `texture_clean` contains one of the popular keywords and `False` otherwise.
Data Transformation
Consider the variable
yarn_weight_wpi
.- (5 points) Print out the unique values for
yarn_weight_wpi
in theyarn
dataset.
- (5 points) Describe a process for converting this variable to a representative integer. What assumptions or simplifying decisions would you make?
My answer is on page ___ of my scratch paper
The steps to convert the variable to an integer are…
- (5 points) Write a function,
convert_weight_int(x)
that takes a character string describing the weight and converts it to a representative integer.
- (5 points) Use your function to create a new column in yarn,
yarn_weight_wpi_num
.
- (2 points) What is the mean yarn weight for Bernat yarns?
If you had trouble with 6.4, you can useprob6pt4.csv
to complete this problem.
- (5 points) Remove any rows with an NA for
yarn_weight_wpi_num
.
If you had trouble with 6.4, you can useprob6pt4.csv
to complete this problem.
- (5 points) Print out the unique values for
# Print out the unique values for `yarn_weight_wpi` in the `yarn` dataset.
# Write a function, `convert_weight_int(x)` that takes a character string describing the weight and converts it to a representative integer.
# Use your function to create a new column in yarn, `yarn_weight_wpi_num`.
# What is the mean yarn weight for Bernat yarns?
# If you had trouble with 6.4, you can use `prob6pt4.csv` to complete this problem.
# Remove any rows with an NA for `yarn_weight_wpi_num`.
# If you had trouble with 6.4, you can use `prob6pt4.csv` to complete this problem.
# Print out the unique values for `yarn_weight_wpi` in the `yarn` dataset.
# Write a function, `convert_weight_int(x)` that takes a character string describing the weight and converts it to a representative integer.
# Use your function to create a new column in yarn, `yarn_weight_wpi_num`.
# What is the mean yarn weight for Bernat yarns?
# If you had trouble with 6.4, you can use `prob6pt4.csv` to complete this problem.
# Remove any rows with an NA for `yarn_weight_wpi_num`.
# If you had trouble with 6.4, you can use `prob6pt4.csv` to complete this problem.