Regression modeling workflows

ID 529: Data Management and Analytic Workflows in R

Jarvis Chen




Thursday, January 16, 2025

Follow along



https://bit.ly/id529_regression_models

  • Clone the repository and open it as a R Project in your RStudio.

  • Open the id529_day4_regression_models.R script. You can follow along, annotate, and/or run the code in your own R session.

Some goals for today


  • Practice reading someone else’s R code and annotating to help you understand what is going on

  • Review and consolidate some of the concepts we’ve been learning about

    • Use a dplyr workflow to prepare our dataset for analysis
    • Write a quick function (using concepts of functional programming)
    • Use ggplot to make a figure
  • Fit some regression models

    • lm() for linear regression
    • glm() for generalized linear regression
    • What arguments do these functions take?
    • What is contained in the resulting model objects?

Some goals for today

  • Learn how to extract output of interest from model objects

    • using base R
    • using broom::tidy, broom::augment, and broom::glance
  • Learn how to create some pretty tables

    • Learn about the gtsummary package

Priority List


What to prioritize in understanding the code in the example

  • Using dplyr code for data cleaning/management

  • Calling lm() and glm()

  • Using summary(), coef(), confint(), and broom::tidy() to extract and summarize coefficients.

  • Writing our own function to extract coefficients and output to a tibble

  • Using anova to compare models`

  • Using predict() and broom::augment() to extract predictions and residuals`

  • Using broom::glance to extract model fit statistics`

Data preparation

  • For this example, we will use the NHANES data from the NHANES package. Note that this is different from the nhanes_id529 data in the ID529data package.
df <- NHANES  |>  
  # Remember that we have to restrict to people 25 and above
  filter(Age>=25)  |> 
  # recoding of the variables we're going to use
  mutate(agecat = case_when(
      Age < 35 ~ "25-34",
      35 <= Age & Age < 45 ~ "35-44",
      Age >= 45 & Age < 55 ~ "45-54",
      Age >= 55 & Age < 65 ~ "55-64",
      Age >= 65 & Age < 75 ~ "65-74",
      Age >= 75 ~ "75+"),
    # We want College Grad to be the reference category for education, so we'll
    # re-order the factor so that it is reversed from the way it came in the NHANES dataset
    Education = factor(Education, 
                       levels=rev(levels(NHANES$Education))),
    # Here we collapse Hispanic and Mexican into the Hispanic category
    racecat = factor(case_when(
      Race1 %in% c("Hispanic", "Mexican") ~ "Hispanic",
      Race1 %in% c("Asian", "Other") ~ "Other Non-Hispanic",
      Race1 == "Black" ~ "Black Non-Hispanic",
      Race1 == "White" ~ "White Non-Hispanic"), 
      levels = c("White Non-Hispanic", "Black Non-Hispanic", "Hispanic", "Other Non-Hispanic"))
  ) |>
  # select just variables we are going to use in the analysis
  select(ID, SurveyYr, Gender, Age, agecat, Education, racecat, BPSysAve, SmokeNow)

A basic call to lm()

  • You’ve probably seen something like this in a basic statistics class
lm_model1 <- lm(BPSysAve ~ factor(Education), 
              data=df)
  • This creates an object called lm_model1 where we are storing the results of having fit a linear regression model with BPSysAve as the dependent variable and categories of Education as the independent variables.

  • To see the actual results, we have to do something like

print(lm_model1)

Call:
lm(formula = BPSysAve ~ factor(Education), data = df)

Coefficients:
                    (Intercept)    factor(Education)Some College  
                        118.563                            3.750  
   factor(Education)High School  factor(Education)9 - 11th Grade  
                          5.140                            4.688  
     factor(Education)8th Grade  
                          7.921  

Summarizing the model object

summary(lm_model1)

Call:
lm(formula = BPSysAve ~ factor(Education), data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-45.252 -11.563  -1.703   8.686 103.686 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     118.5634     0.3907 303.467  < 2e-16 ***
factor(Education)Some College     3.7503     0.5588   6.712 2.09e-11 ***
factor(Education)High School      5.1401     0.6190   8.303  < 2e-16 ***
factor(Education)9 - 11th Grade   4.6882     0.7268   6.451 1.20e-10 ***
factor(Education)8th Grade        7.9206     0.9414   8.414  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.28 on 6319 degrees of freedom
  (247 observations deleted due to missingness)
Multiple R-squared:  0.01887,   Adjusted R-squared:  0.01824 
F-statistic: 30.38 on 4 and 6319 DF,  p-value: < 2.2e-16

Other ways of summarizing the model

anova(lm_model1)
Analysis of Variance Table

Response: BPSysAve
                    Df  Sum Sq Mean Sq F value    Pr(>F)    
factor(Education)    4   36277  9069.3  30.376 < 2.2e-16 ***
Residuals         6319 1886669   298.6                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exploring the model object more fully

class(lm_model1)
[1] "lm"
  • It’s an object of class “lm”, but it’s also a list in that it has various elements of different types and different lengths
names(lm_model1)
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "na.action"     "contrasts"     "xlevels"       "call"         
[13] "terms"         "model"        
  • Note that summary(lm_model1) is also a list that has different elements from lm_model1!
names(summary(lm_model1))
 [1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled"  "na.action"    

Exploring the model object more fully

  • Note the difference between the list item called coefficients in the lm_model1 object and the list item called coefficients in the summary(lm_model1) object
lm_model1$coefficients
                    (Intercept)   factor(Education)Some College 
                     118.563395                        3.750341 
   factor(Education)High School factor(Education)9 - 11th Grade 
                       5.140080                        4.688178 
     factor(Education)8th Grade 
                       7.920635 
summary(lm_model1)$coefficients
                                  Estimate Std. Error    t value     Pr(>|t|)
(Intercept)                     118.563395  0.3906963 303.466895 0.000000e+00
factor(Education)Some College     3.750341  0.5587681   6.711802 2.089606e-11
factor(Education)High School      5.140080  0.6190317   8.303420 1.225813e-16
factor(Education)9 - 11th Grade   4.688178  0.7267768   6.450643 1.196256e-10
factor(Education)8th Grade        7.920635  0.9413995   8.413681 4.864442e-17

Extracting quantities of interest

# point estimates
coef(lm_model1)
                    (Intercept)   factor(Education)Some College 
                     118.563395                        3.750341 
   factor(Education)High School factor(Education)9 - 11th Grade 
                       5.140080                        4.688178 
     factor(Education)8th Grade 
                       7.920635 
# confidence limits
confint(lm_model1)
                                     2.5 %     97.5 %
(Intercept)                     117.797497 119.329292
factor(Education)Some College     2.654966   4.845717
factor(Education)High School      3.926568   6.353593
factor(Education)9 - 11th Grade   3.263448   6.112907
factor(Education)8th Grade        6.075172   9.766097
  • We might be interested in looking at the coefficients and their 95% CIs together
cbind(coef(lm_model1), confint(lm_model1))
                                                2.5 %     97.5 %
(Intercept)                     118.563395 117.797497 119.329292
factor(Education)Some College     3.750341   2.654966   4.845717
factor(Education)High School      5.140080   3.926568   6.353593
factor(Education)9 - 11th Grade   4.688178   3.263448   6.112907
factor(Education)8th Grade        7.920635   6.075172   9.766097

Do you want to write a function?

  • Let’s write our own function to extract point estimates and 95% CI from the model object.
f_get_coefficients <- function(model){
  
  # grab the names of the effects in the model
  get_names <- names(coef(model))
  
  # grab the coefficients and the 95% confidence limits
  # and put them in a matrix
  estimates_and_cis <- cbind(coef(model), confint(model))
  
  # put everything into a tibble and return it
  return(tibble(term = get_names, 
         estimate = estimates_and_cis[,1],
         lci = estimates_and_cis[,2],
         uci = estimates_and_cis[,3]))
}
f_get_coefficients(lm_model1)
# A tibble: 5 × 4
  term                            estimate    lci    uci
  <chr>                              <dbl>  <dbl>  <dbl>
1 (Intercept)                       119.   118.   119.  
2 factor(Education)Some College       3.75   2.65   4.85
3 factor(Education)High School        5.14   3.93   6.35
4 factor(Education)9 - 11th Grade     4.69   3.26   6.11
5 factor(Education)8th Grade          7.92   6.08   9.77

broom::tidy

  • The broom package has a function called tidy that extracts model output and puts it in tibble format
broom::tidy(lm_model1)
# A tibble: 5 × 5
  term                            estimate std.error statistic  p.value
  <chr>                              <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                       119.       0.391    303.   0       
2 factor(Education)Some College       3.75     0.559      6.71 2.09e-11
3 factor(Education)High School        5.14     0.619      8.30 1.23e-16
4 factor(Education)9 - 11th Grade     4.69     0.727      6.45 1.20e-10
5 factor(Education)8th Grade          7.92     0.941      8.41 4.86e-17
  • We can add the confidence limits to the tibble using the conf.int=TRUE argument
broom::tidy(lm_model1, conf.int=TRUE)
# A tibble: 5 × 7
  term                  estimate std.error statistic  p.value conf.low conf.high
  <chr>                    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)             119.       0.391    303.   0          118.      119.  
2 factor(Education)Som…     3.75     0.559      6.71 2.09e-11     2.65      4.85
3 factor(Education)Hig…     5.14     0.619      8.30 1.23e-16     3.93      6.35
4 factor(Education)9 -…     4.69     0.727      6.45 1.20e-10     3.26      6.11
5 factor(Education)8th…     7.92     0.941      8.41 4.86e-17     6.08      9.77

Subsetting data in calls to lm()

  • Many regression functions in R allow us to subset or restrict the data to a particular group
lm_model1_female <- lm(BPSysAve ~ factor(Education), 
                data=df,
                subset=(Gender=="female"))
broom::tidy(lm_model1_female)
# A tibble: 5 × 5
  term                            estimate std.error statistic  p.value
  <chr>                              <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                       116.       0.568    204.   0       
2 factor(Education)Some College       5.20     0.811      6.42 1.60e-10
3 factor(Education)High School        6.42     0.904      7.10 1.57e-12
4 factor(Education)9 - 11th Grade     6.58     1.11       5.94 3.15e- 9
5 factor(Education)8th Grade          7.99     1.44       5.53 3.37e- 8
  • We could accomplish something similar by calling
lm_model1_female <- lm(BPSysAve ~ factor(Education), 
                data=df |> filter(Gender=="female"))

Restricting to non-missing observations

  • Though most regression functions will automatically drop observations with NA, sometimes we may want to explicitly filter out missing observations on any of the covariates we are going to include while model building to make sure that we can compare models based on the same number of observations.
df_completecase <- df |>
  filter(!is.na(BPSysAve) & !is.na(agecat) & !is.na(Gender) & !is.na(racecat))

Fitting multiple models

lm_model1 <- lm(BPSysAve ~ factor(
  Education), 
                data=df_completecase)
lm_model2 <- lm(BPSysAve ~ factor(Education) + factor(agecat) + Gender, 
                data=df_completecase)
lm_model3 <- lm(BPSysAve ~ factor(Education) + factor(agecat) + Gender + factor(racecat), 
                data=df_completecase)

Models with interactions

  • We also think we should try to fit a model with an interaction between gender and racialized group
lm_model4a <- lm(BPSysAve ~ factor(Education) + factor(agecat) + Gender*factor(racecat), 
                 data=df_completecase)
broom::tidy(lm_model4a)
# A tibble: 17 × 5
   term                                   estimate std.error statistic   p.value
   <chr>                                     <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)                             109.        0.598   183.    0        
 2 factor(Education)Some College             2.89      0.512     5.65  1.69e-  8
 3 factor(Education)High School              3.41      0.572     5.97  2.46e-  9
 4 factor(Education)9 - 11th Grade           2.67      0.677     3.94  8.31e-  5
 5 factor(Education)8th Grade                3.68      0.929     3.96  7.58e-  5
 6 factor(agecat)35-44                       4.06      0.616     6.59  4.83e- 11
 7 factor(agecat)45-54                       6.64      0.614    10.8   4.70e- 27
 8 factor(agecat)55-64                      12.7       0.649    19.6   8.15e- 83
 9 factor(agecat)65-74                      15.8       0.736    21.5   3.14e- 99
10 factor(agecat)75+                        23.8       0.832    28.7   6.50e-170
11 Gendermale                                3.70      0.477     7.74  1.11e- 14
12 factor(racecat)Black Non-Hispanic         4.79      0.903     5.30  1.18e-  7
13 factor(racecat)Hispanic                  -0.716     0.904    -0.792 4.28e-  1
14 factor(racecat)Other Non-Hispanic        -3.03      1.08     -2.81  4.95e-  3
15 Gendermale:factor(racecat)Black Non-H…   -0.758     1.30     -0.582 5.61e-  1
16 Gendermale:factor(racecat)Hispanic        1.40      1.19      1.18  2.39e-  1
17 Gendermale:factor(racecat)Other Non-H…    3.63      1.55      2.34  1.92e-  2

Models with interactions


  • When we specify the interaction in the formula using * we get the main effects of Gender and factor(racecat) as well as the interactions

Models with interactions

  • Another way to specify the interaction is to use the interaction() function to create a categorical variable representing the cross-classified categories
  • Useful when we want to compare to a common referent category.
lm_model4b <- lm(BPSysAve ~ factor(Education) + factor(agecat) + interaction(Gender,factor(racecat)), 
                 data=df_completecase)
broom::tidy(lm_model4b)
# A tibble: 17 × 5
   term                                   estimate std.error statistic   p.value
   <chr>                                     <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)                             109.        0.598   183.    0        
 2 factor(Education)Some College             2.89      0.512     5.65  1.69e-  8
 3 factor(Education)High School              3.41      0.572     5.97  2.46e-  9
 4 factor(Education)9 - 11th Grade           2.67      0.677     3.94  8.31e-  5
 5 factor(Education)8th Grade                3.68      0.929     3.96  7.58e-  5
 6 factor(agecat)35-44                       4.06      0.616     6.59  4.83e- 11
 7 factor(agecat)45-54                       6.64      0.614    10.8   4.70e- 27
 8 factor(agecat)55-64                      12.7       0.649    19.6   8.15e- 83
 9 factor(agecat)65-74                      15.8       0.736    21.5   3.14e- 99
10 factor(agecat)75+                        23.8       0.832    28.7   6.50e-170
11 interaction(Gender, factor(racecat))m…    3.70      0.477     7.74  1.11e- 14
12 interaction(Gender, factor(racecat))f…    4.79      0.903     5.30  1.18e-  7
13 interaction(Gender, factor(racecat))m…    7.73      0.952     8.12  5.69e- 16
14 interaction(Gender, factor(racecat))f…   -0.716     0.904    -0.792 4.28e-  1
15 interaction(Gender, factor(racecat))m…    4.38      0.858     5.11  3.40e-  7
16 interaction(Gender, factor(racecat))f…   -3.03      1.08     -2.81  4.95e-  3
17 interaction(Gender, factor(racecat))m…    4.30      1.12      3.82  1.33e-  4

Models with interactions

Another way to fit the interaction is with the / operator. In this example, it is giving us the racialized group effect WITHIN gender categories

lm_model4c <- lm(BPSysAve ~ factor(Education) + factor(agecat) + factor(racecat)/Gender, 
                 data=df_completecase)
broom::tidy(lm_model4c)
# A tibble: 17 × 5
   term                                   estimate std.error statistic   p.value
   <chr>                                     <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)                             109.        0.598   183.    0        
 2 factor(Education)Some College             2.89      0.512     5.65  1.69e-  8
 3 factor(Education)High School              3.41      0.572     5.97  2.46e-  9
 4 factor(Education)9 - 11th Grade           2.67      0.677     3.94  8.31e-  5
 5 factor(Education)8th Grade                3.68      0.929     3.96  7.58e-  5
 6 factor(agecat)35-44                       4.06      0.616     6.59  4.83e- 11
 7 factor(agecat)45-54                       6.64      0.614    10.8   4.70e- 27
 8 factor(agecat)55-64                      12.7       0.649    19.6   8.15e- 83
 9 factor(agecat)65-74                      15.8       0.736    21.5   3.14e- 99
10 factor(agecat)75+                        23.8       0.832    28.7   6.50e-170
11 factor(racecat)Black Non-Hispanic         4.79      0.903     5.30  1.18e-  7
12 factor(racecat)Hispanic                  -0.716     0.904    -0.792 4.28e-  1
13 factor(racecat)Other Non-Hispanic        -3.03      1.08     -2.81  4.95e-  3
14 factor(racecat)White Non-Hispanic:Gen…    3.70      0.477     7.74  1.11e- 14
15 factor(racecat)Black Non-Hispanic:Gen…    2.94      1.21      2.42  1.55e-  2
16 factor(racecat)Hispanic:Gendermale        5.09      1.09      4.68  2.92e-  6
17 factor(racecat)Other Non-Hispanic:Gen…    7.33      1.48      4.96  7.07e-  7

Predictions

  • A common task we might want to do having fit a model is to generate model-based predictions

  • We can use the predict() function to predict the fitted BPSysAve for everyone in the original dataset

# predicting on the original dataset
head(predict(lm_model4b, 
        interval = "prediction"))
       fit      lwr      upr
1 116.2887 85.47004 147.1074
2 116.2887 85.47004 147.1074
3 116.2887 85.47004 147.1074
4 118.7097 87.89539 149.5239
5 115.8203 85.00795 146.6326
6 115.8203 85.00795 146.6326

Predictions

  • We could also predict for a set of new observations
  • The new dataset to predict needs to contain values for all of the covariates that were included in the original model.
  • Here, we predict average systolic blood pressure for five women age 45-54 who are Black Non-Hispanic at each of the education levels.
data_to_predict <- data.frame(Education = c("8th Grade", 
                                            "9 - 11th Grade", 
                                            "High School", 
                                            "Some College", 
                                            "College Grad"),
                      Gender = rep("female",5),
                      agecat = rep("45-54", 5),
                      racecat = rep("Black Non-Hispanic", 5))

print(data_to_predict)
       Education Gender agecat            racecat
1      8th Grade female  45-54 Black Non-Hispanic
2 9 - 11th Grade female  45-54 Black Non-Hispanic
3    High School female  45-54 Black Non-Hispanic
4   Some College female  45-54 Black Non-Hispanic
5   College Grad female  45-54 Black Non-Hispanic

Predictions

predict(lm_model4b, 
        newdata=data_to_predict,
        interval = "confidence")
       fit      lwr      upr
1 124.2857 121.8173 126.7540
2 123.2756 121.2338 125.3175
3 124.0219 122.0992 125.9446
4 123.4971 121.6112 125.3830
5 120.6077 118.6939 122.5215

Predictions using broom::augment()

  • The broom package can also do this for us using the augment() function
broom::augment(lm_model4b, newdata=data_to_predict, interval = "confidence")
# A tibble: 5 × 7
  Education      Gender agecat racecat            .fitted .lower .upper
  <chr>          <chr>  <chr>  <chr>                <dbl>  <dbl>  <dbl>
1 8th Grade      female 45-54  Black Non-Hispanic    124.   122.   127.
2 9 - 11th Grade female 45-54  Black Non-Hispanic    123.   121.   125.
3 High School    female 45-54  Black Non-Hispanic    124.   122.   126.
4 Some College   female 45-54  Black Non-Hispanic    123.   122.   125.
5 College Grad   female 45-54  Black Non-Hispanic    121.   119.   123.

Note that both predict() and broom::augment() can compute confidence intervals and prediction intervals

Model fit statistics

  • We can also look at a variety of model fit statistics using broom::glance()
bind_rows(broom::glance(lm_model1),
  broom::glance(lm_model2),
  broom::glance(lm_model3),
  broom::glance(lm_model4b)) |>
  bind_cols(c("Model 1", "Model 2", "Model 3", "Model 4"))
# A tibble: 4 × 13
  r.squared adj.r.squared sigma statistic   p.value    df  logLik    AIC    BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>   <dbl>  <dbl>  <dbl>
1    0.0189        0.0182  17.3      30.4 4.46e- 25     4 -26991. 53994. 54035.
2    0.183         0.182   15.8     142.  1.78e-268    10 -26411. 52847. 52928.
3    0.190         0.188   15.7     114.  2.11e-276    13 -26386. 52801. 52902.
4    0.191         0.189   15.7      92.9 6.73e-275    16 -26382. 52800. 52921.
# ℹ 4 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
#   ...13 <chr>

Comparing models

  • You may recall from statistics classes that we can compare two nested linear regression models.
anova(lm_model4b, lm_model3)
Analysis of Variance Table

Model 1: BPSysAve ~ factor(Education) + factor(agecat) + interaction(Gender, 
    factor(racecat))
Model 2: BPSysAve ~ factor(Education) + factor(agecat) + Gender + factor(racecat)
  Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
1   6307 1556063                              
2   6310 1557850 -3   -1787.3 2.4148 0.06461 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Some take aways

  • Like everything in R, when we fit regression models we assign the results to an object that we can then manipulate to extract the most relevant pieces of output

  • These are essentially list objects that contain different elements

  • Don’t be afraid to poke around and explore what are the elements of the model object.

  • We can often write our own functions to help us extract the elements we’re interested in.

  • The broom package, by David Robinson, has lots of powerful functions to turn models into tidy data. Once you have tidy data, you can manipulate them using many of the techniques you’ve been learning for managing dataframes and tibbles.

Further reading

  • Introduction to broom https://cran.r-project.org/web/packages/broom/vignettes/broom.html

  • broom.mixed documentation (useful if you are fitting random effects or mixed models) https://cran.r-project.org/web/packages/broom.mixed/vignettes/broom_mixed_intro.html