2024 Midterm

Week07
Exam
Week07
Author

Your Name

Published

March 7, 2024

Download the starter qmd file here

Instructions

Completing the exam in both languages will earn you 20% extra credit on the exam and will also prove that you’re kicking butt in this class.

  1. Create a folder for this exam on your computer.

  2. Save this file into that folder.

  3. When you are finished with the exam, compile the document. Make sure all of your code runs!

  4. Upload a zip file of your work and any necessary file dependencies to Canvas.

Ground Rules

  • You may use the textbook and the internet (but the normal rules apply - you must be able to explain your answer!)

  • You may NOT confer with other people or AI entities - including posting on StackOverflow, Reddit, etc.

  • You may ask clarifying questions of Dr. Vanderplas by email/zoom or in person

  • You may use R or Python for any of these tasks, but your code must be reproducible - I must be able to run your quarto file on my machine. I have provided R chunks in the correct locations in this file - change them to Python if you wish.

  • You should have at least one code chunk for each numbered task below.

Dataset

The data for this exam are taken from the happy dataset in the classdata R package. I’ve exported the data to CSV for you at this link: https://raw.githubusercontent.com/srvanderplas/stat151-homework/main/happy.csv

Description

The data is a small sample of variables related to happiness from the general social survey (GSS). The GSS is a yearly cross-sectional survey of Americans, run since 1972. We combine data for more than 25 years to yield over 60 thousand observations, and of the over 5,000 variables, we select some variables that are related to happiness:

Format

A data frame with 62466 rows and 11 variables

Details

  • year. year of the response, 1972 to 2018.

  • age. age in years: 18–89 (89 stands for all 89 year olds and older).

  • degree. highest education: lt high school, high school, junior college, bachelor, graduate.

  • finrela. how is your financial status compared to others: far below, below average, average, above average, far above.

  • happy. happiness: very happy, pretty happy, not too happy.

  • health. health: excellent, good, fair, poor.

  • marital. marital status: married, never married, divorced, widowed, separated.

  • sex. sex: female, male.

  • polviews. from extremely conservative to extremely liberal.

  • partyid. party identification: strong republican, not str republican, ind near rep, independent, ind near dem, not str democrat, strong democrat, other party.

  • wtssall. probability weight. 0.39–8.74

Tasks

  1. Read in the data and create a data frame that you will work with for this exam.

  2. Create a new column variable, decade, in your data frame.

  1. You will need to take the response year and truncate it to the decade, so that 1972 becomes 1970 and 1989 becomes 1980. You can use a series of logical statements if you want, but it may be more effective to find a numerical function or combination of functions that will perform the operation you want.
    floor() and math.floor() in R and python respectively are good places to start.
  2. Create a scatterplot (use geom_point) of your happy year vs decade to show that your approach succeeded.
  1. Create a new data set by iterating through each year to find the proportion of people who are very happy. Use a for loop. Using your new data frame, plot the proportion of very happy people over time.
    Note: You may have to pass an argument to the mean function to tell it to exclude missing values from the calculation, such as na.rm or skipna. Or, you can remove the NAs from happy using a function like na.omit or dropna, but be careful to only drop rows with an NA in variables we care about, like happy or year.

The code below provides an example of how to create a summary dataset and handle NAs in R and python. You may modify this code to help you answer part 3.

# Create sample data
df <- data.frame(x = c(rnorm(100, 10), rnorm(100, 20)),
                 y = rep(c("Group 1", "Group 2"), each = 100))

df_means <- data.frame(y = NULL, mean = NULL)

# For each y group, what is the mean of x?
for (i in unique(df$y)) {
  sub_df <- subset(df, y == i)
  df_means <- rbind(df_means, 
                    data.frame(y = i, mean = mean(sub_df$x, na.rm = T)))
}

df_means
##         y     mean
## 1 Group 1 10.07680
## 2 Group 2 19.90693

# Demonstration of na.rm
mean(c(NA, 1, 2, 3), na.rm = T) # Remove NAs
## [1] 2
mean(c(NA, 1, 2, 3), na.rm = F) # Don't remove NAs
## [1] NA
import pandas as pd
import numpy as np

# Create a new data frame
df = pd.DataFrame({
  'y': np.repeat(['Group1', 'Group2'], (100, 100)), 
  'x': np.concatenate((np.random.normal(loc = 10, size = 100), np.random.normal(loc = 12, size = 100)), axis = None)
  })

# Create an empty dataframe
df_means = pd.DataFrame(columns = ['y', 'mean'])

# For each age, how many values?
for i in np.unique(df.y):
  # Create the subset
  df_sub = df.loc[df.y == i]
  # Drop NAs from the data frame
  # This step isn't necessary because mean() uses skipna = T by default
  # df_sub = df_sub.dropna(subset = ['x', 'y']) 
  # Add a new row to the end of df_means
  df_means.loc[len(df_means.index)] = [i, df_sub.x.mean()]


# Demonstrating skipna parameter of mean
pd.DataFrame({'y':[1, 2, 3, np.nan]}).y.mean(skipna = True)
## 2.0
pd.DataFrame({'y':[1, 2, 3, np.nan]}).y.mean(skipna = False)
## nan

Solutions

Copy the list of tasks here and put your code for each task below the task description. Your code should be well commented.

If you cannot figure out how to do a task, provide a list of steps that you think are necessary to accomplish that task and provide as much detail in terms of how you would accomplish that in code as possible.

If your code does not compile, add , error = T to the chunk header so that the rest of the document will still compile.

Read in the data and create a data frame

Create a new variable, decade

Happy People per Year