Generalized Linear Models (and more!)

Outline

  • What if my response does not follow a normal distribution?

  • Binomial Example

  • Count Example

  • Final Remarks

Generalized Linear Models (Basics)

If your data is not normally distributed, it will violate the model assumptions of a linear model. You must consider the type of data you have!

Model Scale verses Data Scale

Common Distributions

  • Binomial (binary data or number of successes out of total)

  • Poisson (count data)

  • Negative Binomial (count data)

  • Gamma (right skewed continuous data)

  • Beta (continuous proportion/probability)

GLMS and GLMM’s in R

Packages

library('lme4')
library('emmeans')

Generalized Linear Model (GLM)

help(glm)

Generalized Linear Mixed Model (GLMM)

help(glmer)

Binomial Example

Insects

Data The number of insects killed at various doses of insecticide.

insects_data <- read_csv("https://unl-statistics.github.io/R-workshops/r-modeling/data/insects.csv")
head(insects_data)
Rep Killed Number Insecticide Deposit
1 3 50 A 2.00
1 5 49 A 2.64
1 19 47 A 3.48
1 19 38 A 4.59
  • Killed the number of insects killed at each poison level; a numeric vector
  • Number the number of insects exposed at each poison level; a numeric vector
  • Insecticide the insecticide used; a factor with levels A, B and C
  • Deposit the amount of deposit (insecticide) used in milligrams; a numeric vector

Insects EDA

 Rep        Killed          Number      Insecticide    Deposit     
 1:18   Min.   : 1.00   Min.   :26.00   A:18        Min.   :2.000  
 2:18   1st Qu.:17.00   1st Qu.:48.00   B:18        1st Qu.:2.640  
 3:18   Median :29.00   Median :50.00   C:18        Median :4.035  
        Mean   :27.78   Mean   :47.48               Mean   :4.462  
        3rd Qu.:40.75   3rd Qu.:50.00               3rd Qu.:6.060  
        Max.   :50.00   Max.   :50.00               Max.   :8.000  


Insects Model

Statistical notation

\[y_{ij} \sim \text{Binomial}(\pi_{ij}, N_{ij})\] where \[\log\large(\frac{\pi_{ij}}{1-\pi_{ij}}\large) = \eta_{ij} = (\beta_0 + \tau_i) + (\beta_1 + \beta_{1i})\cdot x_{ij} + (\beta_2 + \beta_{2i})\cdot x_{ij}^2\]

  • \(y_{ij}\) is the number of insects killed for the \(i^{th}\) Insecticide given the \(j^{th}\) deposit.

  • \(N_{ij}\) is the total number of insects that started.

  • \(\eta\) is the overall mean on the model scale.

  • \(\tau_i\) is the effect of the \(i^{th}\) habitat.

  • \(b_j\) is the random variation due to the \(j^{th}\) site.

  • \(logit = \log(\frac{\pi}{1-\pi})\) data scale \(\implies\) model scale (link function)

  • \(\frac{1}{(1+e^{-\eta})}\) model scale \(\implies\) data scale (ilink function)

Insects GLM

 insects_glm <- glm(Killed/Number ~ Insecticide*Deposit + Insecticide*I(Deposit^2),
     weights = Number, #<<
     data = insects_data,
     family = "binomial") #<<
car::Anova(insects_glm)
Analysis of Deviance Table (Type II tests)

Response: Killed/Number
                         LR Chisq Df Pr(>Chisq)    
Insecticide                639.65  2     <2e-16 ***
Deposit                    197.82  1     <2e-16 ***
I(Deposit^2)               104.08  1     <2e-16 ***
Insecticide:Deposit          0.28  2     0.8711    
Insecticide:I(Deposit^2)     0.30  2     0.8612    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Insects GLM

summary(insects_glm)

Call:
glm(formula = Killed/Number ~ Insecticide * Deposit + Insecticide * 
    I(Deposit^2), family = "binomial", data = insects_data, weights = Number)

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)               -6.913880   0.687916 -10.050  < 2e-16 ***
InsecticideB               0.676798   0.911342   0.743  0.45770    
InsecticideC               3.053194   1.010612   3.021  0.00252 ** 
Deposit                    2.371302   0.287991   8.234  < 2e-16 ***
I(Deposit^2)              -0.175414   0.026636  -6.586 4.53e-11 ***
InsecticideB:Deposit      -0.154556   0.385324  -0.401  0.68834    
InsecticideC:Deposit       0.066789   0.498966   0.134  0.89352    
InsecticideB:I(Deposit^2)  0.016766   0.036163   0.464  0.64292    
InsecticideC:I(Deposit^2) -0.005033   0.052369  -0.096  0.92344    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1326.798  on 53  degrees of freedom
Residual deviance:   86.522  on 45  degrees of freedom
AIC: 278.27

Number of Fisher Scoring iterations: 5

Insects results

insects_emmeangrid <- ref_grid(insects_glm, at = list(Deposit = unique(insects_data$Deposit))) #<<
insects_emmeans <- emmeans(insects_emmeangrid, specs =~Insecticide:Deposit, type = "response")
insects_emmeans
Insecticide Deposit prob SE df asymp.LCL asymp.UCL
A 2.00 0.054 0.012 Inf 0.034 0.083
B 2.00 0.080 0.015 Inf 0.055 0.116
C 2.00 0.573 0.037 Inf 0.499 0.643
A 2.64 0.133 0.018 Inf 0.102 0.172
B 2.64 0.184 0.020 Inf 0.148 0.226
C 2.64 0.789 0.020 Inf 0.746 0.826

Compare Insecticides

insects_pairs <- emmeans(insects_emmeangrid, specs =~Insecticide | Deposit, type = "response")
pairs(insects_pairs, infer = c(T,T))
Deposit = 2:
 contrast odds.ratio      SE  df asymp.LCL asymp.UCL null z.ratio p.value
 A / B        0.6474 0.20481 Inf   0.30846    1.3589    1  -1.374  0.3544
 A / C        0.0421 0.01198 Inf   0.02165    0.0820    1 -11.141  <.0001
 B / C        0.0651 0.01661 Inf   0.03580    0.1184    1 -10.707  <.0001

Deposit = 2.64:
 contrast odds.ratio      SE  df asymp.LCL asymp.UCL null z.ratio p.value
 A / B        0.6800 0.13910 Inf   0.42104    1.0983    1  -1.885  0.1429
 A / C        0.0410 0.00809 Inf   0.02581    0.0651    1 -16.184  <.0001
 B / C        0.0603 0.01091 Inf   0.03944    0.0921    1 -15.525  <.0001

Deposit = 3.48:
 contrast odds.ratio      SE  df asymp.LCL asymp.UCL null z.ratio p.value
 A / B        0.7104 0.10409 Inf   0.50388    1.0015    1  -2.334  0.0513
 A / C        0.0398 0.00892 Inf   0.02351    0.0673    1 -14.375  <.0001
 B / C        0.0560 0.01229 Inf   0.03347    0.0936    1 -13.132  <.0001

Deposit = 4.59:
 contrast odds.ratio      SE  df asymp.LCL asymp.UCL null z.ratio p.value
 A / B        0.7257 0.12800 Inf   0.47999    1.0972    1  -1.818  0.1637
 A / C        0.0386 0.01159 Inf   0.01912    0.0780    1 -10.843  <.0001
 B / C        0.0532 0.01568 Inf   0.02669    0.1061    1  -9.960  <.0001

Deposit = 6.06:
 contrast odds.ratio      SE  df asymp.LCL asymp.UCL null z.ratio p.value
 A / B        0.7005 0.13204 Inf   0.45039    1.0896    1  -1.888  0.1421
 A / C        0.0379 0.01316 Inf   0.01679    0.0855    1  -9.424  <.0001
 B / C        0.0541 0.01847 Inf   0.02429    0.1204    1  -8.542  <.0001

Deposit = 8:
 contrast odds.ratio      SE  df asymp.LCL asymp.UCL null z.ratio p.value
 A / B        0.5985 0.15779 Inf   0.32261    1.1102    1  -1.947  0.1257
 A / C        0.0382 0.02450 Inf   0.00849    0.1717    1  -5.090  <.0001
 B / C        0.0638 0.04130 Inf   0.01399    0.2909    1  -4.251  0.0001

Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 3 estimates 
Intervals are back-transformed from the log odds ratio scale 
P value adjustment: tukey method for comparing a family of 3 estimates 
Tests are performed on the log odds ratio scale 

Insects results plot

insects_predgrid <- ref_grid(insects_glm, at = list(Deposit = seq(min(insects_data$Deposit), max(insects_data$Deposit), 0.2)))
insects_preds <- emmeans(insects_predgrid, specs =~Insecticide:Deposit, type = "response")
insects_preds
Insecticide Deposit prob SE df asymp.LCL asymp.UCL
A 2.0 0.054 0.012 Inf 0.034 0.083
B 2.0 0.080 0.015 Inf 0.055 0.116
C 2.0 0.573 0.037 Inf 0.499 0.643
A 2.2 0.073 0.014 Inf 0.049 0.106
B 2.2 0.106 0.017 Inf 0.077 0.145
C 2.2 0.652 0.029 Inf 0.594 0.707

Insects results plot

insects_preds %>%
  as_tibble() %>%
  ggplot(aes(x = Deposit, shape = Insecticide, color = Insecticide, fill = Insecticide)) +
  geom_line(aes(y = prob)) + #<<
  geom_ribbon(aes(ymin = asymp.LCL, ymax = asymp.UCL), color = NA, alpha = 0.2) + #<<
  geom_point(data = insects_data, aes(y = Killed/Number), position = position_jitter(width = 0.2, height = 0)) + #<<
  theme_bw() +
  theme(aspect.ratio = 1) +
  scale_color_locuszoom() +
  scale_fill_locuszoom() +
  scale_y_continuous("Mortality (%)", labels = scales::percent)

Count Example

Birds

An ecologist is investigating the impact of habitat type on the population of a certain type of bird. To do this, data were obtained from a study in which 24 randomly sampled sites were observed. Each site contains habitats of both types. Bird counts were obtained from each habitat type.

birds_data <- read_csv("https://unl-statistics.github.io/R-workshops/r-modeling/data/birds.csv")
head(birds_data) 
Site Habitat Count
1 1 27
1 2 209
2 1 1
2 2 0
3 1 12
3 2 21

Birds EDA

summary(birds_data)
      Site          Habitat        Count       
 Min.   : 1.00   Min.   :1.0   Min.   :  0.00  
 1st Qu.: 6.75   1st Qu.:1.0   1st Qu.:  6.75  
 Median :12.50   Median :1.5   Median : 18.00  
 Mean   :12.50   Mean   :1.5   Mean   : 76.83  
 3rd Qu.:18.25   3rd Qu.:2.0   3rd Qu.: 51.50  
 Max.   :24.00   Max.   :2.0   Max.   :989.00  
factorCols <- c("Site", "Habitat")
birds_data[,factorCols] <- lapply(birds_data[,factorCols], factor)
summary(birds_data)
      Site    Habitat     Count       
 1      : 2   1:24    Min.   :  0.00  
 2      : 2   2:24    1st Qu.:  6.75  
 3      : 2           Median : 18.00  
 4      : 2           Mean   : 76.83  
 5      : 2           3rd Qu.: 51.50  
 6      : 2           Max.   :989.00  
 (Other):36                           

Birds EDA

Data Scale

“Model Scale”

Birds Model

In words

\[\text{"log(Bird Count) = overall model scale mean + habitat + random site + scale"}\]

Statistical notation

\[y_i | b_j \sim \text{Negative Binomial}(\mu_{ij}, \phi)\] where \[\log(\mu_{ij}) = \eta_{ij} = \eta + \tau_i + b_j\]

  • \(y_{ij}\) is bird count for the \(i^{th}\) habitat at the \(j^{th}\) site.

  • \(\eta\) is the overall mean on the model scale.

  • \(\tau_i\) is the effect of the \(i^{th}\) habitat.

  • \(b_j\) is the random variation due to the \(j^{th}\) site.

  • log: data scale \(\implies\) model scale (link function)

  • exp: model scale \(\implies\) data scale (ilink function)

Birds GLMM

library(optimx)
birds_nb <- glmer.nb(Count ~ Habitat + (1|Site),
                     data = birds_data,
                     nAGQ = 100)
car::Anova(birds_nb)
Analysis of Deviance Table (Type II Wald chisquare tests)

Response: Count
         Chisq Df Pr(>Chisq)
Habitat 0.0095  1     0.9224

Birds GLMM results

summary(birds_nb)
Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 100) [glmerMod]
 Family: Negative Binomial(0.0849)  ( log )
Formula: Count ~ Habitat + (1 | Site)
   Data: birds_data

     AIC      BIC   logLik deviance df.resid 
    20.9     28.4     -6.5     12.9       44 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-0.2912 -0.2663 -0.2262 -0.0861  3.6476 

Random effects:
 Groups Name        Variance  Std.Dev. 
 Site   (Intercept) 1.444e-12 1.202e-06
Number of obs: 48, groups:  Site, 24

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   4.3888     0.7009   6.262  3.8e-10 ***
Habitat2     -0.0966     0.9912  -0.097    0.922    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
         (Intr)
Habitat2 -0.707
optimizer (Nelder_Mead) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')

Birds GLMM means

birds_emmeans <- emmeans(birds_nb, specs = ~ Habitat, 
                         type = "response") #<<
birds_emmeans
Habitat response SE df asymp.LCL asymp.UCL
1 80.54 56.45 Inf 20.39 318.12
2 73.13 51.25 Inf 18.51 288.85

Birds GLMM compare habitats

\[H_0: \mu_1 = \mu_2 \text{ (The mean bird count for habitat 1 is equal to habitat 2)}\]

\[H_A: \mu_1 \ne \mu_2 \text{ (The mean bird count for habitat 1 is not equal to habitat 2)}\]

pairs(birds_emmeans, infer = c(T,T))
 contrast            ratio   SE  df asymp.LCL asymp.UCL null z.ratio p.value
 Habitat1 / Habitat2   1.1 1.09 Inf     0.158      7.69    1   0.097  0.9224

Confidence level used: 0.95 
Intervals are back-transformed from the log scale 
Tests are performed on the log scale 

Birds GLMM compare habitats

\[H_0: \mu_1 = \mu_2 \text{ (The mean bird count for habitat 1 is equal to habitat 2)}\]

\[H_A: \mu_1 \ne \mu_2 \text{ (The mean bird count for habitat 1 is not equal to habitat 2)}\]

library(multcomp)
library(multcompView)
library(stringr)
birds_letters <- cld(birds_emmeans, Letters = letters, alpha = 0.5) %>% #<<
  mutate(.group = toupper(str_trim(.group))) #<<
birds_letters 
Habitat response SE df asymp.LCL asymp.UCL .group
2 73.13 51.25 Inf 18.51 288.85 A
1 80.54 56.45 Inf 20.39 318.12 A

Count Example: Birds results plot

birds_letters %>%
  as_tibble() %>% #<<
  ggplot(aes(x = Habitat, y = response)) +
  geom_point() +
  geom_errorbar(aes(ymin = asymp.LCL, ymax = asymp.UCL), width = 0.1) +
  geom_text(aes(label = .group, y = asymp.UCL), vjust = -0.5, size = 6) + #<<
  geom_point(data = birds_data, aes(y = Count), shape = 1, position = position_jitter(width = 0.2, height = 0)) + #<<
  theme_bw() +
  theme(aspect.ratio = 1) +
  ylab("Number of Birds")

Final Remarks

Models and functions

Model Distribution Experimental Design R Function
Linear Model Normal CRD lm()
Linear Mixed Model Normal Blocking (or more) lmer()
Generalized Linear Model Non-normal CRD glm()
Generalized Linear Mixed Model Non-normal Blocking (or more) glmer()

Model Specifications

Image source: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html

Additional models and functions

Repeated Measures

  • Take multiple measures on the same object over a period of time
  • lme() in nlme package

Nonlinear Regression

  • Does not follow the typical polynomial framework
  • nls() in base R, stats package

Generalized Additive Models (GAM)

  • Estimates smoothing splines
  • gam() in mcgv package