filter_yellow(orig, c("a", "e", "i", "u"))
## Error in filter_yellow(orig, c("a", "e", "i", "u")): could not find function "filter_yellow"
# word occurrence day
# 110 adieu 2.186597e-06 NA
# 653 aurei 7.694299e-08 NA
# 11968 uraei 1.093829e-08 NA
filter_yellow(orig, c("q", "u", "e", "y"))
## Error in filter_yellow(orig, c("q", "u", "e", "y")): could not find function "filter_yellow"
# word occurrence day
# 8779 query 8.135131e-06 1495
# 8782 queyn 5.189541e-10 NA
# 8783 queys 3.685211e-10 NA
# 8818 quyte 5.250099e-09 NA
Stat 151 Final Exam
Instructions
This exam is due at 6pm on May 10, 2024. Your exam MUST BE PUSHED TO GITHUB CLASSROOM by 6pm. Please double-check your github repository to ensure that the file that is on github is the file you want me to grade.
For each of these problems, you may choose to solve the problem in either R or python.
The chunks I’ve provided are R chunks, but you are free to change the code type to python; I just want to ensure that your answers are where I expect them to be.
Rules
You may use the textbook, your notes, and google on this exam, but you may not post this exam and ask for help on any site.
It is ok to google, for instance, how to convert a string to a list of characters, but it is not ok to google how to solve the entire question. Please ask if you are concerned about any possible edge cases.You may NOT confer with other people or AI entities - including posting on StackOverflow, Reddit, etc. Pre-existing posts on SO are fair game, though.
You must be able to explain how any code you submit on this exam works. Oral exams based on your submissions will be held M-W May 13-15 (e.g. during finals week)
You may ask clarifying questions of Dr. Vanderplas or Muxin Hua by email/zoom or in person
There are a total of 70 points available on this exam.
If you get stuck, you may email Dr. Vanderplas for the solution to the problem you are stuck on, at the cost of the points which would be awarded for that problem. Please specify the part and question number, if you decide to use this option. This is designed to get you un-stuck and allow you to complete problems
(6 points) Your submitted qmd file must compile without errors .
Useerror=TRUE
in a chunk if it is supposed to return an error (for instance, if you are demonstrating error handling).
Wordle!
Wordle is a web-based word game created by Josh Wardle and currently owned by the New York Times. Players get 6 attempts to guess a five-letter word. On each attempt, letters that are in the solution but not in the correct place are indicated with yellow tiles, and letters that are in the correct place are indicated with green tiles.
You can play wordle here.
In this assignment, you will work with two data sets:
wordle-words.csv, which contains the valid guesses, answers for all 2315 days of predetermined wordle solutions in the original version of the game, and google N-gram word frequency of each word between 1970 and 2019. This data was modified from the csv provided by https://github.com/steve-kasica/wordle-words/.
guess-words.csv, which contains an additional 1881 valid guesses which were added when the New York Times took over the game in January of 2022. These words were obtained from this list and I used fetch.py to obtain the N-gram word frequency for these words.
Because of changes in the NYT version of wordle, 7 of the original 2315 solutions have been removed from the game. In order to additionally safeguard against any spoilers for wordle players, I have shuffled the solution days in this data set so that all actual solutions are still marked as such, but the days reported in this data set are not the actual days on which the words will appear. I used the file shuffle-answers.R to perform this operation (you don’t need to use this file, but I like to be transparent).
Part 1: Read in and Clean the Data
(6 points total)
- (2 points) Read in the file
wordle-words.csv
and store the data in a variable namedorig
. Read in the fileguess-words.csv
and store the data in a variable namedguess
.
- (4 points) Is the
orig
data in tidy form? Explain - if it is, list the features of tidy data and how this data set meets the requirements. If not, list the features of tidy data this data set violates, and explain what a tidy form of the data would look like.
Part 2: Functions
(24 points total)
In this part of the exam, use only the orig
words.
You may need to write a for loop to check whether each letter is (or is not) in the vector of words. If you start with a vector that has 0 for each word in the list, you can use addition operations to ensure that only words with all of the provided letters are returned.
In R, the letters
object contains all 26 valid lowercase latin letters.
- (4 points) Write a function named
filter_yellow(words, x)
that uses the words data and a vector of lowercase letters,x
, which are in the wordle solution and returns all possible words containing those letters.
(2 points) Modify
filter_yellow(words, x)
so that the function will return a useful error if thewords
data frame does not have a column namedword
.(2 points) Modify
filter_yellow(words, x)
so that the function will return a warning if any characters provided are not letters (A-z), and then will drop the non-letter characters from the vector before returning results using the valid letters.(2 points) Modify
filter_yellow(words, x)
so that the function can accept uppercase or lowercase letters, but will convert the letters to lowercase automatically without an error.(4 points) Write a function named
filter_black(y)
that takes a vector of letters,y
, which aren’t in the wordle solution and returns all possible words which do not contain any of those letters. Use the same error handling code that you added in parts 2-4.
filter_black(orig, letters[1:19])
## Error in filter_black(orig, letters[1:19]): could not find function "filter_black"
# [1] "tutty" "vutty"
filter_black(orig, c("S", "T", "C", "H", "Y", "R", "P", "L", "K", "G", "N", "M", "D", "F", "B", "W"))
## Error in filter_black(orig, c("S", "T", "C", "H", "Y", "R", "P", "L", : could not find function "filter_black"
# [1] "ajiva" "aquae" "avize" "jaxie" "jeeze" "juvie" "ouija"
# [8] "ozzie" "qajaq" "queue" "zoaea" "zoeae" "zooea"
- (4 points) Write a function named
filter green(words, letters)
that takes a string of 5 letters (with unknowns indicated by _) and returns all words which have the specified letters in the specified position. Add error handling that identifies not-valid characters and flags strings that are not of length 5.
filter_green(orig, z = "EXA__")
## Error in filter_green(orig, z = "EXA__"): could not find function "filter_green"
# word occurrence day
# 1 exact 2.450183e-05 139
# 2 exalt 5.109440e-07 449
# 3 exams 2.825810e-06 NA
filter_green(orig, z = "summer")
## Error in filter_green(orig, z = "summer"): could not find function "filter_green"
# Error in filter_green(orig, z = "summer") : nchar(z) == 5 is not TRUE
- (4 points) Use your functions to determine what the remaining valid words are in the following situation. Based on the word occurrence frequency, which would you guess first?
Hint: You should be able to chain your functions together with pipes if you’ve set things up as instructed.
- (2 points) How would you improve your functions to make your wordle game better?
Part 3: Assessing Word Frequency
(12 points total)
- (2 points) Join the two data frames of valid wordle words to create a data frame named
all_words
of all valid words.
- (2 points) Create a new column in your data frame,
solution
, that is TRUE for words which are solutions to wordle puzzles, and FALSE for words which are not solutions but are valid guesses.
- (4 points) Using
ggplot2
orplotnine
, create side-by-side boxplots which show the frequency of words bysolution
. You may want to use a transformation (e.g.scale_y_log10()
)to make these values easier to visualize.
If you cannot figure this part out, email Dr. Vanderplas for code at the cost of a deduction of the 4 points you would get for this question.
- (2 points) Make sure your plot has a title and descriptive axis labels.
- (2 points) Write at least 2 sentences interpreting the chart you created in part 3.
Part 4: Decoding Wordle-Bot
(22 points total)
The New York Times added Wordle-Bot, which helps analyze your wordle guesses and critique your approach, when it acquired the game from its original creator.
This portion of the exam will help you create some of the data you would need to build your own wordle-bot clone.
- (2 points) Starting with
all_words
, the data frame you created for Part 3, create a data frame namedanswers
that contains only words which are wordle solutions.
- (4 points) Split the letters of the words in the
answers
data frame into separate columns, so that there is a column for all of the first letters, all of the 2nd letters, and so on. You may find it helpful to first mutate a new column and then use a function such asunnest_wider()
to get the data into a form that you can work with. A few lines of the desired data structure is provided below.
If you cannot figure this part out, email Dr. Vanderplas for code at the cost of a deduction of the 4 points you would get for this question.
# A tibble: 2,315 × 9
word occurrence day X1 X2 X3 X4 X5 solution
<chr> <dbl> <int> <chr> <chr> <chr> <chr> <chr> <lgl>
1 aback 0.00000113 1628 a b a c k TRUE
2 abase 0.0000000618 459 a b a s e TRUE
3 abate 0.00000104 1781 a b a t e TRUE
4 abbey 0.00000143 841 a b b e y TRUE
(4 points) Describe the form of the data you would need in order to compare the relative frequency of each letter in the alphabet by word position (e.g. 1st letter, 2nd letter, 3rd letter…). Feel free to upload and include a sketch using markdown syntax, if it is easier to sketch out the form of the data you would need.
(4 points) Write code that will take you from the form of the data you got in question 2 to the structure you described in question 3.
- (2 points) Create a new variable named
letter_type
that has valuesvowel
andconsonant
. For the purposes of this question, ‘y’ should be treated as a vowel.
- (4 points) Create a bar chart with letter on the x-axis and frequency on the y-axis. Facet by position so that your chart has 5 rows and a single column. The bars should be shaded according to
letter_type
, and your plot should have informative axis labels and a useful title.
- (2 points) How would you improve this chart so that viewers could more clearly see which letters correspond to each position? Justify your answer, using some of the information on chart perception we discussed in this class.