Day 5
Outline of topics:
- Reproducibility and Robustness
- Visualizing and Reporting on Regression
- QR Code Activity
- Data Linkage Methods
- Onikye et al Article
Find this script online here or written out below:
# know how to install packages:
# install.packages("tidyverse")
# set up a project so we could use the {here} package
# dependencies ------------------------------------------------------------
library(tidyverse)
library(broom)
library(here)
library(palmerpenguins)
library(gtsummary)
# read in data ------------------------------------------------------------
# we could use a csv dataset like this:
# df <- readr::read_csv(here("data.csv"))
# or use an example dataset like penguins from palmerpenguins:
# use View to look at it in RStudio
View(penguins)
# data manipulation -------------------------------------------------------
# use group_by and summarize together to create summary statistics per-group
<- penguins |>
penguins_summarized group_by(species) |>
summarize(
mean_flipper_length_mm = mean(flipper_length_mm, na.rm=TRUE))
# know how to use mutate to update columns (either creating new ones or updating
# existing ones):
# here, we'll just convert species to a character vector just for an example
# so then we can next practice making it a factor:
<- penguins |>
penguins mutate(species = as.character(species))
# convert a variable to a factor:
#
# method 1: base R
# here, the levels will be assumed from the output of unique(penguins$species):
$species <- factor(penguins$species)
penguins#
# method 2: dplyr
<- penguins |>
penguins mutate(species = factor(species))
# if I wanted to change the reference category, I could use relevel:
$species <- relevel(penguins$species, 'Chinstrap')
penguins
# or the dplyr way:
<- penguins |>
penguins mutate(species = relevel(species, 'Chinstrap'))
# you can also use forcats::fct_relevel
# data visualization ------------------------------------------------------
# use ggplot2 to make some graphics
# a scatter plot:
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point() +
ggtitle("Penguin Bill Lengths and Depths by Species")
# use ggsave to save your work
ggsave(here("output/penguins_scatterplot.png"), width = 7, height = 5)
# a histogram with facets:
ggplot(penguins, aes(x = flipper_length_mm)) +
geom_histogram() +
facet_wrap(~species) +
ggtitle("Penguin Bill Lengths and Depths by Species")
# again use ggsave and here() to save it within your project
ggsave(here("output/penguins_faceted_histogram.png"), width = 8, height = 3)
# you might also want to plot regression lines in ggplot quickly so
# use geom_smooth:
ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point() +
geom_smooth(method = 'lm') +
ggtitle("Linear regression of flipper length on body mass")
# analyze a model -------------------------------------------------------------
<- lm(flipper_length_mm ~ body_mass_g + species, penguins)
model
# use broom::tidy to extract the coefficients and their statistics
<- broom::tidy(model, conf.int = TRUE)
model_output
# visualize model results
|>
model_output filter(term != '(Intercept)') |>
ggplot(aes(x = estimate, y = term, xmin = conf.low, xmax = conf.high)) +
geom_pointrange()
# create a table of the results
::tbl_regression(model)
gtsummary
# one example with multiple models --------------------------------------------
<- lm(flipper_length_mm ~ species, penguins)
model1 <- lm(flipper_length_mm ~ species + body_mass_g, penguins)
model2 <- lm(flipper_length_mm ~ species + body_mass_g + island, penguins)
model3
# extract tables of results
<- list(
model_results bind_cols(model = 'model1', broom::tidy(model1, conf.int = TRUE)),
bind_cols(model = 'model2', broom::tidy(model2, conf.int = TRUE)),
bind_cols(model = 'model3', broom::tidy(model3, conf.int = TRUE)))
# make into one data frame
<- bind_rows(model_results)
model_results
# create a plot of covariates from multiple models
|>
model_results filter(term %in% c('speciesGentoo', 'speciesAdelie')) |>
ggplot(
aes(x = estimate,
y = term,
xmin = conf.low,
xmax = conf.high,
color = model,
shape = model)) +
geom_pointrange(position = position_dodge(width = 0.5)) +
ggtitle("Coefficient estimates for species effect",
::str_wrap(
stringrpaste(
"Model 1 includes no other covariates, model 2 includes body mass,",
"and model 3 includes body mass and island effects"
) ))
link to view slides fullscreen
link to slide PDFs
link to view slides fullscreen
link to repository
link to view slides fullscreen
link to follow along code
Preface
In 2021 Brown et al published an article titled A Reproduction of the Results of Onyike et al. (2003) in Meta-Psychology, a journal that is free, open-access and conducts open peer review. The Onikye et al. article they reproducing, Is obesity associated with major depression? Results from the Third National Health and Nutrition Examination Survey, was published in the American Journal of Epidemiology has been cited 1159+ times according to Google Scholar.
I want to point out an interesting section from the About page describing the Meta-Psychology journal that you might keep in mind as you read on:
Prior to publication, all statistical analyses are reproduced by our statistical reproduction team, which consists of the Statistical Editor and our editorial assistant. This makes the article eligible for the reproducibility badge.
Recommended Reading
Please read the article by Brown et al (https://open.lnu.se/index.php/metapsychology/article/view/2071).
Abstract repeated here as a teaser:
Onyike et al. (2003) analyzed data from a large-scale US-American data set, the Third National Health and Nutrition Examination Survey (NHANES-III), and reported an association between obesity and major depression, especially among people with severe obesity. Here, we report the results of a detailed replication of Onyike et al.’s analyses. While we were able to reproduce the majority of these authors’ descriptive statistics, this took a substantial amount of time and effort, and we found several minor errors in the univariate descriptive statistics reported in their Tables 1 and 2. We were able to reproduce most of Onyike et al.’s bivariate findings regarding the relationship between obesity and depression (Tables 3 and 4), albeit with some small discrepancies (e.g., with respect to the magnitudes of standard errors). On the other hand, we were unable to reproduce Table 5, containing Onyike et al.’s findings with respect to the relationship between obesity and depression when controlling for plausible confounding variables—arguably the paper’s most important results—because some of the included predictor variables appear to be either unavailable, or not coded in the way reported by Onyike et al., in the public NHANES-III data sets. We discuss the implications of our findings for the transparency of reporting and the reproducibility of published results.
Their code is freely, publicly accessible on OSFHOME, a file storage service provided by the Open Science Framework from the Center for Open Science.
- Brown et al’s code+file repository: https://osf.io/j32yw/
- Download their code+files (direct link): https://files.osf.io/v1/resources/j32yw/providers/osfstorage/?zip=
Note that in order to run their code, you will either want to a) make a new R project in the folder with their code on your computer, or b) open a new RStudio window, open up their .R file, and use setwd('filepath/goes/here/')
to make sure your R session can run their R code.
Consider the following questions:
- Do you believe that the results Brown et al. have shared are more likely to be correct than those that Onikye et al published? If so, why? If not, why not?
- What do you find compelling about their re-analysis and code?
- What do you find lacking about their re-analysis and code?
- How do you think the non-reproducibility of Onikye et al.’s article could have been avoided?
- When, if at all, do you think articles should be required to share code and data?
- What about in situations where the data relates to private or sensitive data?
- What about in situations where the subject matter is highly politically charged and there might be malicious actors who could see shared data and code as additional surface area to attack?
- Has reading this article made you more skeptical of research publications that don’t share code?
- Do you think for articles where code & data are too sensitive to be shared, is there an alternative process that would make you similarly confident in the stated results?
An important aside
This isn’t a class about stigma and health, but I think being in a Population Health Science program, it’s important to leave the breadcrumbs here for you to do your own followup reading and learning.
Because the articles in this homework discuss body-weight and health, I want to emphatically point out that this subject matter is not at all cut and dry. It’s important to acknowledge that:
- Weight-based stigma is real and causes harm to health through multiple mechanisms including at least discrimination and health care practitioners’ attitudes and behaviors:
- https://ajph.aphapublications.org/doi/full/10.2105/AJPH.2009.159491
- https://onlinelibrary.wiley.com/doi/full/10.1111/obr.12266
- The decision by health organizations to classify obesity as a “disease” is debated:
- https://www.healthline.com/health/is-obesity-a-disease
- The language and terminology that we use can perpetuate stigma:
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5051141/?report=classic
- https://news.yale.edu/2012/07/12/choosing-words-wisely-when-talking-patients-about-their-weight
If anything, what I hope you take away from this aside is that data do not speak for themselves, but rather are subject to interpretation and leave room for either the perpetuation or casting aside of pre-existing biases (See “Data Never Speak for Themselves” from Nancy Krieger’s article Structural Racism, Health Inequities, and the Two-Edged Sword of Data: Structural Problems Require Structural Solutions). It’s not enough to engage with open-science practices and leverage sophisticated statistical analyses made possible in programs like R; instead, it’s necessary to combine advances in the state of the art in computing with advances in our conceptual frameworks to do science that can truly shift narratives in ways that benefit marginalized groups.