Lists and Functional Programming

Purrr

  • dplyr handles data frames
  • purrr handles lists

Remember: a data frame is a list
of vectors with equal length.

Lists can be used inside data frames with list-columns

List-Columns

A vector of single values is an atomic vector. But, not all vectors are atomic.

library(tibble)
df <- tibble(x = 1:3, 
             y = list(rnorm(5), 
                      rnorm(15), 
                      rnorm(25)))
df
# A tibble: 3 × 2
      x y         
  <int> <list>    
1     1 <dbl [5]> 
2     2 <dbl [15]>
3     3 <dbl [25]>

y is a list-column.

List-Columns

We can handle list-columns in multiple ways:

  1. Unnest into an atomic vector (tidyr)
  2. Work with the list column using map_* and similar functions
library(tidyr)

df |> unnest(y) |> head()
# A tibble: 6 × 2
      x       y
  <int>   <dbl>
1     1  0.428 
2     1 -0.657 
3     1 -0.830 
4     1 -0.0448
5     1  0.231 
6     2  1.84  

Map Functions

The map_* family of functions

  1. Takes a list
  2. Applies a function to each element of the list
  3. Returns a list (map) or a vector (map_xxx) of type xxx

map is like a type-safe, pipe-friendly apply function from base R

Demonstrating Map

I want to show the Central Limit Theorem using simulation, using the Exponential(5) distribution.

Distribution of the sample mean converges to a normal distribution if \(\mu\) exists and \(\sigma^2 <\infty\) as \(n\rightarrow\infty\)

  1. Select values of \(n\): 1, 3, 5, 10, 20, 50, 100, 500, 1000

  2. Decide how many times you want to draw a sample for each value of \(n\).
    For \(i=1, ..., 500\), sample \(n\) from Exp(5)

  3. Set up a tibble with \(n\) as a column, with each \(n\) repeated 500 times.

  4. For each row in the tibble, draw a sample.

  5. For each sample, calculate \(\overline{x}\)

  6. Plot the \(\overline{x}_i\) values

Demonstrating Map

I want to show the Central Limit Theorem using simulation.

Distribution of the sample mean converges to a normal distribution if \(\mu\) exists and \(\sigma^2 <\infty\) as \(n\rightarrow\infty\)

library(dplyr)
library(purrr)
sim_df <- tibble(n = rep(c(1, 3, 5, 10, 20, 50, 100, 500, 1000), each = 500)) |>
  # Create a list-column with mutate + map*
  mutate(sample = map(n, ~rexp(., rate=5))) |>
  # Summarize a list-column with mutate + map*
  mutate(mean = map_dbl(sample, mean), 
         sd = map_dbl(sample, sd))

head(sim_df)
# A tibble: 6 × 4
      n sample      mean    sd
  <dbl> <list>     <dbl> <dbl>
1     1 <dbl [1]> 0.0610    NA
2     1 <dbl [1]> 0.458     NA
3     1 <dbl [1]> 0.190     NA
4     1 <dbl [1]> 0.0260    NA
5     1 <dbl [1]> 0.166     NA
6     1 <dbl [1]> 0.0375    NA

Demonstrating Map

I want to show the Central Limit Theorem using simulation.

Distribution of the sample mean converges to a normal distribution if \(\mu\) exists and \(\sigma^2 <\infty\) as \(n\rightarrow\infty\)

library(ggplot2)

ggplot(sim_df, aes(x = mean)) + geom_histogram() + facet_wrap(~n, scales = "free_x")

Demonstrating Nest/Unnest

I want to show the Central Limit Theorem using simulation, using the Exponential(5) distribution.

Distribution of the sample mean converges to a normal distribution if \(\mu\) exists and \(\sigma^2 <\infty\) as \(n\rightarrow\infty\)

  1. Select values of \(n\): 1, 3, 5, 10, 20, 50, 100, 500, 1000

  2. Decide how many times you want to draw a sample for each value of \(n\).
    For \(i=1, ..., 500\), sample \(n\) from Exp(5)

  3. Set up a tibble with \(n\) as a column, with each \(n\) repeated 500 times. Define index \(i\) as another column.

  4. For each row in the tibble, draw a sample. Unnest the samples.

  5. For each sample, calculate \(\overline{x}_i\), grouping by \(i\)

  6. Plot the \(\overline{x}_i\) values

Demonstrating Nest/Unnest

I want to show the Central Limit Theorem using simulation.

Distribution of the sample mean converges to a normal distribution if \(\mu\) exists and \(\sigma^2 <\infty\) as \(n\rightarrow\infty\)

library(dplyr)
library(purrr)
sim_df <- tibble(n = rep(c(1, 3, 5, 10, 20, 50, 100, 500, 1000), each = 500)) |>
  # Create i column
  group_by(n) |>
  mutate(i = 1:n()) |>
  ungroup() |>
  # Create a list-column with mutate + map*
  mutate(sample = map(n, ~rexp(., rate=5))) |>
  unnest(sample)

res_df <- sim_df |>
  group_by(n, i) |>
  # Summarize a list-column with mutate + map*
  summarize(mean = mean(sample)) |>
  ungroup()

head(sim_df)
# A tibble: 6 × 3
      n     i sample
  <dbl> <int>  <dbl>
1     1     1 0.245 
2     1     2 0.0958
3     1     3 0.0535
4     1     4 0.0369
5     1     5 0.193 
6     1     6 0.263 
head(res_df)
# A tibble: 6 × 3
      n     i   mean
  <dbl> <int>  <dbl>
1     1     1 0.245 
2     1     2 0.0958
3     1     3 0.0535
4     1     4 0.0369
5     1     5 0.193 
6     1     6 0.263 

Demonstrating Nest/Unnest

I want to show the Central Limit Theorem using simulation.

Distribution of the sample mean converges to a normal distribution if \(\mu\) exists and \(\sigma^2 <\infty\) as \(n\rightarrow\infty\)

library(ggplot2)

ggplot(res_df, aes(x = mean)) + geom_histogram() + facet_wrap(~n, scales = "free_x")