Your Turn Solutions

Motivating Example

Try playing with chunks of code from this session to further investigate the tips data:

Get a summary of the daily shedding values (use the shed data set)

#shed <- read_csv("https://unl-statistics.github.io/R-workshops/01-r-intro/data/daily_shedding.csv")

shed <- read.csv("../data/daily_shedding.csv")


summary(shed$daily_shedding)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   3.912   3.765   5.521  13.218

Make side by side boxplots of final weight gain by treatment group (use the final_shed data set)

library(tidyverse)

final_shed <- shed %>% 
  group_by(pignum) %>% 
  mutate(gain = pig_weight[time_point == 21] - pig_weight[time_point == 0]) %>% filter(time_point == 21)  %>% ungroup() %>% select(-c(4:9))

ggplot(final_shed) + 
  geom_boxplot( aes(treatment, gain, fill = treatment))

Compute a wilcox test for control vs. the “Bglu” treatment group

wilcox.test(total_shedding ~ treatment, data = final_shed,
            subset = treatment %in% c("control", "Bglu"))


    Wilcoxon rank sum exact test

data:  total_shedding by treatment
W = 23, p-value = 0.04326
alternative hypothesis: true location shift is not equal to 0

Basics

Using the R Reference Card (and the Help pages, if needed), do the following:

Find out how many rows and columns the `iris’ data set has. Figure out at least 2 ways to do this. Hint: “Variable Information” section on the first page of the reference card!

nrow(iris)

[1] 150

ncol(iris)

[1] 5

dim(iris)

[1] 150   5

Use the rep function to construct the following vector: 1 1 2 2 3 3 4 4 5 5 Hint: “Data Creation” section of the reference card

rep(1:5, each = 2)

 [1] 1 1 2 2 3 3 4 4 5 5

Use rep to construct this vector: 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

rep(1:5, times = 3)

 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Find out how many pigs had a total shedding value of less than 30 log10 CFUs. Hint: if you use the sum function on a logical vector, it’ll return how many TRUEs are in the vector:

sum(c(TRUE, TRUE, FALSE, TRUE, FALSE))

[1] 3

sum(shed$total_shedding < .30)

[1] 66

More Challenging: Calculate the sum of the total shedding log10 CFUs of all pigs with a total shedding value of less than 30 log10 CFUs

sum(shed$total_shedding[shed$total_shedding < 30])

[1] 1930.403

Which pigs have a shedding value less than or equal to 30 OR is in the Acid treatment group?

shedding <- final_shed$total_shedding
treatment <- final_shed$treatment
id <- (shedding <= 30 | treatment == "Acid")
final_shed[id,]

# A tibble: 15 × 7
   pignum time_point pig_weight daily_shedding treatment total_shedding  gain
    <int>      <int>      <dbl>          <dbl> <chr>              <dbl> <dbl>
 1     97         21       24.9           0    RPS                  0    12.9
 2    181         21       34.6           3.91 RPS                 23.8  16.9
 3    321         21       28.8           3.91 RPS                 28.0  13.5
 4    373         21       33.3           0    RPS                 29.1  16.7
 5    392         21       29.3           0    RPS                 19.6  16.2
 6     26         21       23.8           4.61 Acid                35.0  12.3
 7     52         21       34.7           3.91 Acid                34.0  17.2
 8    126         21       24.3           3.91 Acid                48.4  11.9
 9    152         21       29.7           3.91 Acid                48.6  14.6
10    178         21       33.3           3.91 Acid                35.8  16.6
11    211         21       28.6           0    Acid                30.9  14.4
12    348         21       27.2           3.91 Acid                58.5  12.9
13    361         21       19.5           0    Acid                40.1  10.0
14    378         21       31.6           3.91 Acid                34.0  15.7
15    426         21       35.0           3.91 Acid                33.5  17.6

Data Structures

In row_matrix, change ‘byrow = FALSE’ to see what happens

matrix(
  c(1, 2, 3, 4, 5, 6, 7, 8, 9),
  nrow = 3,  
  ncol = 3,        
  byrow = FALSE        
)

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Recreate this matrix.

matrix(c(2,4,6,8,10,12), nrow=3, ncol = 2, byrow = FALSE)

     [,1] [,2]
[1,]    2    8
[2,]    4   10
[3,]    6   12

Make a data frame with column 1: 1,2,3,4,5,6 and column 2: a,b,a,b,a,b

mydf <- data.frame(col1 = 1:6, col2 = rep(c("a", "b"), times = 3))

Select only rows with value “a” in column 2 using logical vector

mydf[mydf$col2 == "a",]

  col1 col2
1    1    a
3    3    a
5    5    a

names(mydf) <- c("Bulldogs", "Tigers")

mtcars is a built in data set like iris: Extract the 4th row of the mtcars data.

mtcars[4,]

                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

Create another column in the flower data frame, that is the sum of Sepal Width, Sepal Length, Petal Width and Petal Length.

flower <- iris

total <- flower$Sepal.Length + flower$Sepal.Width + flower$Petal.Length + flower$Petal.Width

flower2 <- data.frame(flower, total)

head(flower2)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species total
1          5.1         3.5          1.4         0.2  setosa  10.2
2          4.9         3.0          1.4         0.2  setosa   9.5
3          4.7         3.2          1.3         0.2  setosa   9.4
4          4.6         3.1          1.5         0.2  setosa   9.4
5          5.0         3.6          1.4         0.2  setosa  10.2
6          5.4         3.9          1.7         0.4  setosa  11.4

Step Further. Create a data frame of observations where Sepal Length is greater than 6. Then create another column in the this data frame, that is the sum of Sepal Width and Sepal Length

flower_s <- flower[flower$Sepal.Length>6, ]

total_s <- flower_s$Sepal.Length + flower_s$Sepal.Width

head(data.frame(flower_s, total_s))

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species total_s
51          7.0         3.2          4.7         1.4 versicolor    10.2
52          6.4         3.2          4.5         1.5 versicolor     9.6
53          6.9         3.1          4.9         1.5 versicolor    10.0
55          6.5         2.8          4.6         1.5 versicolor     9.3
57          6.3         3.3          4.7         1.6 versicolor     9.6
59          6.6         2.9          4.6         1.3 versicolor     9.5

Create a list containing a vector and a 2x3 data frame

mylist <- list(vec = 1:6, df = data.frame(x = 1:2, y = 3:4, z = 5:6))

Use indexing to select the data frame from your list

mylist[[2]]

  x y z
1 1 3 5
2 2 4 6

Use further indexing to select the first row from the data frame in your list

mylist[[2]][1,]

  x y z
1 1 3 5

View the top 8 rows of mtcars data

head(mtcars, n = 8)

                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2

What type of object is the mtcars data set?

str(mtcars)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

How many rows are in iris data set? (try finding this using dim or indexing + length)

dim(iris)

[1] 150   5

Summarize the values in each column in iris data set

summary(iris)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

Packages and Programming

Create a function that takes numeric input and provides the mean and a 95% confidence interval for the mean for the data (the t.test function could be useful)

mean_and_ci <- function(x) {
    themean <- mean(x)
    theci <- t.test(x)$conf.int
    
    return(list(mean = themean, ci = theci))
}

Add checks to your function to make sure the data is either numeric or logical. If it is logical convert it to numeric.

mean_and_ci <- function(x) {
    if (!is.numeric(x) && !is.logical(x)) stop("Need logical or numeric data")
    
    x <- as.numeric(x)
    
    themean <- mean(x)
    theci <- t.test(x)$conf.int
    
    return(list(mean = themean, ci = theci))
}

Loop over the columns of the diamonds data set and apply your function to all of the numeric columns.

for (i in colnames(diamonds)) {
  if (is.numeric(diamonds[,i,drop = TRUE])) {print(mean_and_ci(diamonds[,i,drop = TRUE]))}
  else {print("nope")}
}

$mean
[1] 0.7979397

$ci
[1] 0.7939395 0.8019400
attr(,"conf.level")
[1] 0.95

[1] "nope"
[1] "nope"
[1] "nope"
$mean
[1] 61.7494

$ci
[1] 61.73731 61.76150
attr(,"conf.level")
[1] 0.95

$mean
[1] 57.45718

$ci
[1] 57.43833 57.47604
attr(,"conf.level")
[1] 0.95

$mean
[1] 3932.8

$ci
[1] 3899.132 3966.467
attr(,"conf.level")
[1] 0.95

$mean
[1] 5.731157

$ci
[1] 5.721690 5.740624
attr(,"conf.level")
[1] 0.95

$mean
[1] 5.734526

$ci
[1] 5.724887 5.744165
attr(,"conf.level")
[1] 0.95

$mean
[1] 3.538734

$ci
[1] 3.532778 3.544689
attr(,"conf.level")
[1] 0.95