If you choose not to attend the rally (or you are otherwise unable to attend class), please submit the completed qmd file by email.
Introduction
Make sure to install the arrow package in R before you proceed with this assignment.
Make your own big data
Create a data frame, df, with 10,000,000 rows and two columns.
The first column, x, should be a randomly sampled number from a N(0, 5) distribution.
The second column, y, should be a randomly sampled letter from the letters vector (sampling with replacement, uniform probability).
Save your data as an arrowTable using Table$create(df). You can find out more about the Table class by running ?Table after loading the arrow package.
library(arrow)
Attaching package: 'arrow'
The following object is masked from 'package:utils':
timestamp
?Table # Learn more about the table classTable$public_methods |>names() # These methods (functions attached to the Table object) will override columns with the same names
Save your data to disk as a parquet file named mydata.parquet. Read the data back in from the parquet format as pardf. What has changed?
dplyr and Arrow
Filtering Rows
Use arrow’s dplyr interface to filter and keep only the rows where \(x>0\) and \(y\in\{a, e, i, o, u\}\).
What function is necessary to convert the results back into a data frame at the end of the statement?
Summary Statistics
Group by the letter and compute the number of rows, the mean of \(x\) values, and the standard deviation of \(x\) values.
Reflection
In 3-4 sentences, explain where you see yourself using Arrow in the future, and when it would make sense to use Arrow rather than e.g. a base R data.frame or a tibble.