Arrow and Parquet

If you choose not to attend the rally (or you are otherwise unable to attend class), please submit the completed qmd file by email.

Introduction

Make sure to install the arrow package in R before you proceed with this assignment.

Make your own big data

Create a data frame, df, with 10,000,000 rows and two columns.

The first column, x, should be a randomly sampled number from a N(0, 5) distribution.
The second column, y, should be a randomly sampled letter from the letters vector (sampling with replacement, uniform probability).

Save your data as an arrow Table using Table$create(df). You can find out more about the Table class by running ?Table after loading the arrow package.

library(arrow)


Attaching package: 'arrow'

The following object is masked from 'package:utils':

    timestamp

?Table # Learn more about the table class
Table$public_methods |> names() # These methods (functions attached to the Table object) will override columns with the same names

 [1] "column"                "ColumnNames"           "nbytes"               
 [4] "RenameColumns"         "GetColumnByName"       "RemoveColumn"         
 [7] "AddColumn"             "SetColumn"             "ReplaceSchemaMetadata"
[10] "field"                 "serialize"             "to_data_frame"        
[13] "cast"                  "SelectColumns"         "Slice"                
[16] "Equals"                "Validate"              "ValidateFull"         
[19] "clone"

# Your data frame code goes here

Save, Read, and Compare

Save your data to disk as a parquet file named mydata.parquet. Read the data back in from the parquet format as pardf. What has changed?

`dplyr` and Arrow

Filtering Rows

Use arrow’s dplyr interface to filter and keep only the rows where $x>0$ and $y\in\{a, e, i, o, u\}$.

What function is necessary to convert the results back into a data frame at the end of the statement?

Summary Statistics

Group by the letter and compute the number of rows, the mean of $x$ values, and the standard deviation of $x$ values.

Reflection

In 3-4 sentences, explain where you see yourself using Arrow in the future, and when it would make sense to use Arrow rather than e.g. a base R data.frame or a tibble.