# Create sample data
<- data.frame(x = c(rnorm(100, 10), rnorm(100, 20)),
df y = rep(c("Group 1", "Group 2"), each = 100))
<- data.frame(y = NULL, mean = NULL)
df_means
# For each y group, what is the mean of x?
for (i in unique(df$y)) {
<- subset(df, y == i)
sub_df <- rbind(df_means,
df_means data.frame(y = i, mean = mean(sub_df$x, na.rm = T)))
}
df_means## y mean
## 1 Group 1 10.07680
## 2 Group 2 19.90693
# Demonstration of na.rm
mean(c(NA, 1, 2, 3), na.rm = T) # Remove NAs
## [1] 2
mean(c(NA, 1, 2, 3), na.rm = F) # Don't remove NAs
## [1] NA
2024 Midterm
Download the starter qmd file here
Instructions
Completing the exam in both languages will earn you 20% extra credit on the exam and will also prove that you’re kicking butt in this class.
Create a folder for this exam on your computer.
Save this file into that folder.
When you are finished with the exam, compile the document. Make sure all of your code runs!
Upload a zip file of your work and any necessary file dependencies to Canvas.
Ground Rules
You may use the textbook and the internet (but the normal rules apply - you must be able to explain your answer!)
You may NOT confer with other people or AI entities - including posting on StackOverflow, Reddit, etc.
You may ask clarifying questions of Dr. Vanderplas by email/zoom or in person
You may use R or Python for any of these tasks, but your code must be reproducible - I must be able to run your quarto file on my machine. I have provided R chunks in the correct locations in this file - change them to Python if you wish.
You should have at least one code chunk for each numbered task below.
Dataset
The data for this exam are taken from the happy
dataset in the classdata
R package. I’ve exported the data to CSV for you at this link: https://raw.githubusercontent.com/srvanderplas/stat151-homework/main/happy.csv
Description
The data is a small sample of variables related to happiness from the general social survey (GSS). The GSS is a yearly cross-sectional survey of Americans, run since 1972. We combine data for more than 25 years to yield over 60 thousand observations, and of the over 5,000 variables, we select some variables that are related to happiness:
Format
A data frame with 62466 rows and 11 variables
Details
year. year of the response, 1972 to 2018.
age. age in years: 18–89 (89 stands for all 89 year olds and older).
degree. highest education: lt high school, high school, junior college, bachelor, graduate.
finrela. how is your financial status compared to others: far below, below average, average, above average, far above.
happy. happiness: very happy, pretty happy, not too happy.
health. health: excellent, good, fair, poor.
marital. marital status: married, never married, divorced, widowed, separated.
sex. sex: female, male.
polviews. from extremely conservative to extremely liberal.
partyid. party identification: strong republican, not str republican, ind near rep, independent, ind near dem, not str democrat, strong democrat, other party.
wtssall. probability weight. 0.39–8.74
Tasks
Read in the data and create a data frame that you will work with for this exam.
Create a new column variable,
decade
, in your data frame.
- You will need to take the response year and truncate it to the decade, so that 1972 becomes 1970 and 1989 becomes 1980. You can use a series of logical statements if you want, but it may be more effective to find a numerical function or combination of functions that will perform the operation you want.
floor()
andmath.floor()
in R and python respectively are good places to start. - Create a scatterplot (use
geom_point
) of your happy year vs decade to show that your approach succeeded.
- Create a new data set by iterating through each year to find the proportion of people who are very happy. Use a for loop. Using your new data frame, plot the proportion of very happy people over time.
Note: You may have to pass an argument to the mean function to tell it to exclude missing values from the calculation, such asna.rm
orskipna
. Or, you can remove the NAs from happy using a function likena.omit
ordropna
, but be careful to only drop rows with an NA in variables we care about, likehappy
oryear
.
The code below provides an example of how to create a summary dataset and handle NAs in R and python. You may modify this code to help you answer part 3.
import pandas as pd
import numpy as np
# Create a new data frame
= pd.DataFrame({
df 'y': np.repeat(['Group1', 'Group2'], (100, 100)),
'x': np.concatenate((np.random.normal(loc = 10, size = 100), np.random.normal(loc = 12, size = 100)), axis = None)
})
# Create an empty dataframe
= pd.DataFrame(columns = ['y', 'mean'])
df_means
# For each age, how many values?
for i in np.unique(df.y):
# Create the subset
= df.loc[df.y == i]
df_sub # Drop NAs from the data frame
# This step isn't necessary because mean() uses skipna = T by default
# df_sub = df_sub.dropna(subset = ['x', 'y'])
# Add a new row to the end of df_means
len(df_means.index)] = [i, df_sub.x.mean()]
df_means.loc[
# Demonstrating skipna parameter of mean
'y':[1, 2, 3, np.nan]}).y.mean(skipna = True)
pd.DataFrame({## 2.0
'y':[1, 2, 3, np.nan]}).y.mean(skipna = False)
pd.DataFrame({## nan
Solutions
Copy the list of tasks here and put your code for each task below the task description. Your code should be well commented.
If you cannot figure out how to do a task, provide a list of steps that you think are necessary to accomplish that task and provide as much detail in terms of how you would accomplish that in code as possible.
If your code does not compile, add , error = T
to the chunk header so that the rest of the document will still compile.