Introduction to Column-Oriented Data Storage
Deep Dive into Parquet
Working with Arrow in R
Querying Parquet with Different Engines
Arrow Datasets for Larger-than-Memory Operations
Partitioning Strategies
Hands-on Workshop: Analysis with PUMS Data
These slides borrowed from the JSM Big Data 2025 workshop
Data has to be represented somewhere, both during analysis and when storing.
The shape and characteristics of this representation impacts performance.
What if you could speed up a key part of your analysis by 30x and reduce your storage by 10x?
Row-oriented
|ID|Name |Age|City |
|--|-----|---|--------|
|1 |Alice|25 |New York|
|2 |Bob |30 |Boston |
|3 |Carol|45 |Chicago |
Column-oriented
ID: [1, 2, 3]
Name: [Alice, Bob, Carol]
Age: [25, 30, 45]
City: [New York, Boston, Chicago]
And you use column-oriented dataframes already!
… but still storing my data in a fundamentally row-oriented way.
https://github.com/arrowrbook/book/releases/download/PUMS_subset/PUMS.subset.zip
As a CSV file
user system elapsed
2.278 0.048 2.347
As a zipped CSV file
user system elapsed
0.268 0.000 0.293
As a CSV file with arrow
user system elapsed
0.931 0.277 0.196
As a Parquet file
user system elapsed
0.105 0.015 0.120
Are there any differences?
user system elapsed
0.012 0.000 0.011
user system elapsed
0.301 0.053 0.084
Languages with native Parquet support:
Systems with Parquet integration:
Arrow Table
Arrow Dataset
collect() is calledHive Partitioning
Directory format: column=value
Example:
person/
├── year=2018/
│ ├── state=NY/
│ │ └── data.parquet
│ └── state=CA/
│ └── data.parquet
├── year=2019/
│ ├── ...Self-describing structure
Standard in big data ecosystem
Non-Hive Partitioning
Directory format: value
Example:
person/
├── 2018/
│ ├── NY/
│ │ └── data.parquet
│ └── CA/
│ └── data.parquet
├── 2019/
│ ├── ...Requires column naming
Less verbose directory names
Resources: