String Processing and Data Wrangling in R and Python
Note: This assignment must be submitted in github classroom.
Background Information
In this assignment, you will analyze poems written by two of Reddit’s famous poets: Poem_for_your_sprog
and SchnoodleDoodleDo
. These poems were scraped using the RedditExtractoR
package on June 15, 2023. Code Note: This package no longer works because of changes to reddit’s API that became effective on July 1, 2023.
poem_for_your_sprog
writes poems in response to other people’s comments (‘sprog’ means child) on AskReddit
posts. SchnoodleDoodleDo
writes poems in response to comments and posts on subreddits that are usually related to cute or wholesome content, such as aww
, WhatsWrongWithYourDog
, wholesomememes
, and more.
You can find the CSV files of each writer’s reddit comments in sprog_poems.csv
and schnoodle_poems.csv
, respectively. Historical versions of these CSVs are also saved in the repo for posterity’s sake, as a number of users left Reddit in June 2023 and deleted their accounts and/or comments.
As part of this analysis, you should learn how to do the following tasks:
- Split a vector of strings into components (lines)
- Create a new data frame with the components as a new variable
- Replace characters to clean up your data
- Compute basic numerical statistics on strings (word count, capitalization, use of punctuation and symbols, number of syllables, etc.)
- Combine this information to assess poetry styles
Content Warning
These poems may address adult topics and/or use strong or vulgar language, as they are taken from Reddit. I have not censored them in any way because I want you to work with data that is realistic. You should find a few test cases which you are comfortable with to use to test out your code, but you do not need to read all of the poems (and in both cases, I’ve provided you with approximately 1000 samples, so it’s probably too much to read in any case).
Here are 5 indices (counting from 1) which are approximately G or PG in content for each poet, using the most recent snapshot CSV (20230806):
- Sprog poems: 1, 2, 3, 6, 7
- SchnoodleDoodleDo poems: 1, 2, 3, 4, 5
These should be sufficient for you to test your code even if you are worried about being exposed to adult themes or language during this activity.
Assignment Instructions
Choose a poet and complete the following tasks in R or Python. Once you finish the set of tasks provided here, start on the other poet in the other language.
Using R to analyze <Replace this with poet’s username>’s poems
Splitting Poems into Lines
Split the poem into lines and create a data frame that has
- a new column called
poem_id
, numbered from one to the total number of poems by the poet in question - a new variable called
line
that contains the text of each line of the poem (one row per poem line). To get this, you may need to split the comment string by the endline character (\n
) and then unnest or expand your data frame. - a new variable called
line_no
that contains the line number of the poem. This should be computed per poem.
Ensure that you are splitting lines by a string which makes sense. Some poems have multiple paragraphs (stanzas) and may have a blank line in between; you want to preserve this blank line as it will help you make sense of the poem. Some poems instead have blank lines between every line of the poem; in these cases, you may want to split by e.g. \n\n
instead of \n
.
See Part 1 Checkpoint for an example of what the output should look like from each poet.
Summarizing Poems
For each poem, create a summary data frame that contains:
- Average number of words per line in the poem
- Number of lines in the poem
- Number of characters which are not letters or spaces (punctuation, numbers, and any non-ASCII characters)
- Number of uppercase letters in the poem
- Number of lowercase letters in the poem
Create a plot showing the distribution of the number of words per line across all poems you have in your dataset. Create another plot showing the number of lines in the poem for all poems in your dataset. What does this tell you about the “average” style of the poet in question?
See Part 2 Checkpoint for an example of what the output should look like from each poet.
Poetry Analysis
Choose some characteristic(s) of the poet’s style to explore graphically. If necessary, create a subset of the data with poems relevant to your question before you generate numerical summaries. You may want to clean up the data and remove lines which contain quotes (e.g. start with a >
character) or horizontal lines in reddit markdown (e.g. only have ----
).
Some ideas to get you started thinking:
Number of syllables per line may be used to infer rhyming scheme and/or poetry style. You can use the
qdap
package in R, which contains thesyllable_sum
function to count the number of syllables in a sentence. You may need to use a for-loop or program a custom function to use the syllables function on every entry in your data frame column.Use of non-alphabetic characters. Schnoodle often uses emoji and other text annotations to convey emotions and excitement - how often do these types of annotations appear in their poems?
Common characters and phrases. Sprog writes poems frequently about a character named Timmy (who often meets a horrible end). Do these poems have a common format/style/rhyme scheme?
Sprog often writes longer poems with multiple stanzas. Identify which poems have multiple stanzas (Hint, look for blank lines in a systematic pattern) and show the distribution of stanza length, stanza variation within poems, and number of stanzas in each poem.
Schnoodle often misspells words intentionally (using e.g.
fren
instead offriend
) to convey that they are writing using an animal’s voice. What proportion of words are misspelled in each poem? How much does this proportion vary? Based on the distribution, how likely is it, in your opinion, that Schnoodle misspells words more often when speaking as one type of animal than another? You can use thehunspell
package in R to detect whether words are likely misspelled.
Using Python to analyze <Replace this with poet’s username>’s poems
Splitting Poems into Lines
Split the poem into lines and create a data frame that has
- a new column called
poem_id
, numbered from one to the total number of poems by the poet in question - a new variable called
line
that contains the text of each line of the poem (one row per poem line). To get this, you may need to split the comment string by the endline character (\n
) and then unnest or expand your data frame. - a new variable called
line_no
that contains the line number of the poem. This should be computed per poem.
Ensure that you are splitting lines by a string which makes sense. Some poems have multiple paragraphs (stanzas) and may have a blank line in between; you want to preserve this blank line as it will help you make sense of the poem. Some poems instead have blank lines between every line of the poem; in these cases, you may want to split by e.g. \n\n
instead of \n
. See Part 1 Checkpoint for an example of what the output should look like from each poet.
Summarizing Poems
For each poem, create a summary data frame that contains:
- Average number of words per line in the poem
- Number of lines in the poem
- Number of characters which are not letters or spaces (punctuation, numbers, and any non-ASCII characters)
- Number of uppercase letters in the poem
- Number of lowercase letters in the poem
Create a plot showing the distribution of the number of words per line across all poems you have in your dataset. Create another plot showing the number of lines in the poem for all poems in your dataset. What does this tell you about the “average” style of the poet in question?
Poetry Analysis
Choose some characteristic(s) of the poet’s style to explore graphically. If necessary, create a subset of the data with poems relevant to your question before you generate numerical summaries. You may want to clean up the data and remove lines which contain quotes (e.g. start with a >
character) or horizontal lines in reddit markdown (e.g. only have ----
).
Some ideas to get you started thinking:
Number of syllables per line may be used to infer rhyming scheme and/or poetry style. You can use the
syllables
python package. You may need to use a for-loop or program a custom function to use the syllables function on every entry in your data frame column.Use of non-alphabetic characters. Schnoodle often uses emoji and other text annotations to convey emotions and excitement - how often do these types of annotations appear in their poems?
Common characters and phrases. Sprog writes poems frequently about a character named Timmy (who often meets a horrible end). Do these poems have a common format/style/rhyme scheme?
Sprog often writes longer poems with multiple stanzas. Identify which poems have multiple stanzas (Hint, look for blank lines in a systematic pattern) and show the distribution of stanza length, stanza variation within poems, and number of stanzas in each poem.
Schnoodle often misspells words intentionally (using e.g.
fren
instead offriend
) to convey that they are writing using an animal’s voice. What proportion of words are misspelled in each poem? How much does this proportion vary? Based on the distribution, how likely is it, in your opinion, that Schnoodle misspells words more often when speaking as one type of animal than another? You can use one of the python packages discussed in this post to detect whether words are likely misspelled.