Stat 151 Final Exam

Week15

Exam

Author

Your Name

Published

2024-05-08

Instructions

This exam is due at 6pm on May 10, 2024. Your exam MUST BE PUSHED TO GITHUB CLASSROOM by 6pm. Please double-check your github repository to ensure that the file that is on github is the file you want me to grade.
For each of these problems, you may choose to solve the problem in either R or python.
The chunks I’ve provided are R chunks, but you are free to change the code type to python; I just want to ensure that your answers are where I expect them to be.

Rules

You may use the textbook, your notes, and google on this exam, but you may not post this exam and ask for help on any site.
It is ok to google, for instance, how to convert a string to a list of characters, but it is not ok to google how to solve the entire question. Please ask if you are concerned about any possible edge cases.
You may NOT confer with other people or AI entities - including posting on StackOverflow, Reddit, etc. Pre-existing posts on SO are fair game, though.
You must be able to explain how any code you submit on this exam works. Oral exams based on your submissions will be held M-W May 13-15 (e.g. during finals week)
You may ask clarifying questions of Dr. Vanderplas or Muxin Hua by email/zoom or in person
There are a total of 70 points available on this exam.
If you get stuck, you may email Dr. Vanderplas for the solution to the problem you are stuck on, at the cost of the points which would be awarded for that problem. Please specify the part and question number, if you decide to use this option. This is designed to get you un-stuck and allow you to complete problems
(6 points) Your submitted qmd file must compile without errors .
Use error=TRUE in a chunk if it is supposed to return an error (for instance, if you are demonstrating error handling).

Wordle!

Wordle is a web-based word game created by Josh Wardle and currently owned by the New York Times. Players get 6 attempts to guess a five-letter word. On each attempt, letters that are in the solution but not in the correct place are indicated with yellow tiles, and letters that are in the correct place are indicated with green tiles.

You can play wordle here.

In this assignment, you will work with two data sets:

wordle-words.csv, which contains the valid guesses, answers for all 2315 days of predetermined wordle solutions in the original version of the game, and google N-gram word frequency of each word between 1970 and 2019. This data was modified from the csv provided by https://github.com/steve-kasica/wordle-words/.
guess-words.csv, which contains an additional 1881 valid guesses which were added when the New York Times took over the game in January of 2022. These words were obtained from this list and I used fetch.py to obtain the N-gram word frequency for these words.

Because of changes in the NYT version of wordle, 7 of the original 2315 solutions have been removed from the game. In order to additionally safeguard against any spoilers for wordle players, I have shuffled the solution days in this data set so that all actual solutions are still marked as such, but the days reported in this data set are not the actual days on which the words will appear. I used the file shuffle-answers.R to perform this operation (you don’t need to use this file, but I like to be transparent).

Part 1: Read in and Clean the Data

(6 points total)

(2 points) Read in the file wordle-words.csv and store the data in a variable named orig. Read in the file guess-words.csv and store the data in a variable named guess.

(4 points) Is the orig data in tidy form? Explain - if it is, list the features of tidy data and how this data set meets the requirements. If not, list the features of tidy data this data set violates, and explain what a tidy form of the data would look like.

Part 2: Functions

(24 points total)

In this part of the exam, use only the orig words.

You may need to write a for loop to check whether each letter is (or is not) in the vector of words. If you start with a vector that has 0 for each word in the list, you can use addition operations to ensure that only words with all of the provided letters are returned.

In R, the letters object contains all 26 valid lowercase latin letters.

(4 points) Write a function named filter_yellow(words, x) that uses the words data and a vector of lowercase letters, x, which are in the wordle solution and returns all possible words containing those letters.

filter_yellow(orig, c("a", "e", "i", "u")) 
## Error in filter_yellow(orig, c("a", "e", "i", "u")): could not find function "filter_yellow"
#        word   occurrence day
# 110   adieu 2.186597e-06  NA
# 653   aurei 7.694299e-08  NA
# 11968 uraei 1.093829e-08  NA

filter_yellow(orig, c("q", "u", "e", "y")) 
## Error in filter_yellow(orig, c("q", "u", "e", "y")): could not find function "filter_yellow"
#       word   occurrence  day
# 8779 query 8.135131e-06 1495
# 8782 queyn 5.189541e-10   NA
# 8783 queys 3.685211e-10   NA
# 8818 quyte 5.250099e-09   NA

(2 points) Modify filter_yellow(words, x) so that the function will return a useful error if the words data frame does not have a column named word.
(2 points) Modify filter_yellow(words, x) so that the function will return a warning if any characters provided are not letters (A-z), and then will drop the non-letter characters from the vector before returning results using the valid letters.
(2 points) Modify filter_yellow(words, x) so that the function can accept uppercase or lowercase letters, but will convert the letters to lowercase automatically without an error.
(4 points) Write a function named filter_black(y) that takes a vector of letters, y, which aren’t in the wordle solution and returns all possible words which do not contain any of those letters. Use the same error handling code that you added in parts 2-4.

filter_black(orig, letters[1:19])
## Error in filter_black(orig, letters[1:19]): could not find function "filter_black"
# [1] "tutty" "vutty"

filter_black(orig, c("S", "T", "C", "H", "Y", "R", "P", "L", "K", "G", "N", "M", "D", "F", "B", "W"))
## Error in filter_black(orig, c("S", "T", "C", "H", "Y", "R", "P", "L", : could not find function "filter_black"
# [1] "ajiva" "aquae" "avize" "jaxie" "jeeze" "juvie" "ouija"
# [8] "ozzie" "qajaq" "queue" "zoaea" "zoeae" "zooea"

(4 points) Write a function named filter green(words, letters) that takes a string of 5 letters (with unknowns indicated by _) and returns all words which have the specified letters in the specified position. Add error handling that identifies not-valid characters and flags strings that are not of length 5.

filter_green(orig, z = "EXA__")
## Error in filter_green(orig, z = "EXA__"): could not find function "filter_green"
#    word   occurrence day
# 1 exact 2.450183e-05 139
# 2 exalt 5.109440e-07 449
# 3 exams 2.825810e-06  NA

filter_green(orig, z = "summer")
## Error in filter_green(orig, z = "summer"): could not find function "filter_green"
# Error in filter_green(orig, z = "summer") : nchar(z) == 5 is not TRUE

(4 points) Use your functions to determine what the remaining valid words are in the following situation. Based on the word occurrence frequency, which would you guess first?
Hint: You should be able to chain your functions together with pipes if you’ve set things up as instructed.

(2 points) How would you improve your functions to make your wordle game better?

Part 3: Assessing Word Frequency

(12 points total)

(2 points) Join the two data frames of valid wordle words to create a data frame named all_words of all valid words.

(2 points) Create a new column in your data frame, solution, that is TRUE for words which are solutions to wordle puzzles, and FALSE for words which are not solutions but are valid guesses.

(4 points) Using ggplot2 or plotnine, create side-by-side boxplots which show the frequency of words by solution. You may want to use a transformation (e.g. scale_y_log10())to make these values easier to visualize.
If you cannot figure this part out, email Dr. Vanderplas for code at the cost of a deduction of the 4 points you would get for this question.

(2 points) Make sure your plot has a title and descriptive axis labels.

(2 points) Write at least 2 sentences interpreting the chart you created in part 3.

Part 4: Decoding Wordle-Bot

(22 points total)

The New York Times added Wordle-Bot, which helps analyze your wordle guesses and critique your approach, when it acquired the game from its original creator.

This portion of the exam will help you create some of the data you would need to build your own wordle-bot clone.

(2 points) Starting with all_words, the data frame you created for Part 3, create a data frame named answers that contains only words which are wordle solutions.

(4 points) Split the letters of the words in the answers data frame into separate columns, so that there is a column for all of the first letters, all of the 2nd letters, and so on. You may find it helpful to first mutate a new column and then use a function such as unnest_wider() to get the data into a form that you can work with. A few lines of the desired data structure is provided below.
If you cannot figure this part out, email Dr. Vanderplas for code at the cost of a deduction of the 4 points you would get for this question.

# A tibble: 2,315 × 9
   word    occurrence   day X1    X2    X3    X4    X5    solution
   <chr>        <dbl> <int> <chr> <chr> <chr> <chr> <chr> <lgl>   
 1 aback 0.00000113    1628 a     b     a     c     k     TRUE    
 2 abase 0.0000000618   459 a     b     a     s     e     TRUE    
 3 abate 0.00000104    1781 a     b     a     t     e     TRUE    
 4 abbey 0.00000143     841 a     b     b     e     y     TRUE

(4 points) Describe the form of the data you would need in order to compare the relative frequency of each letter in the alphabet by word position (e.g. 1st letter, 2nd letter, 3rd letter…). Feel free to upload and include a sketch using markdown syntax, if it is easier to sketch out the form of the data you would need.
(4 points) Write code that will take you from the form of the data you got in question 2 to the structure you described in question 3.

(2 points) Create a new variable named letter_type that has values vowel and consonant. For the purposes of this question, ‘y’ should be treated as a vowel.

(4 points) Create a bar chart with letter on the x-axis and frequency on the y-axis. Facet by position so that your chart has 5 rows and a single column. The bars should be shaded according to letter_type, and your plot should have informative axis labels and a useful title.

(2 points) How would you improve this chart so that viewers could more clearly see which letters correspond to each position? Justify your answer, using some of the information on chart perception we discussed in this class.