Data Visualization with ggplot2

some morals

“The simple graph has brought more information to the data analyst’s mind than any other device.” —John Tukey

“It is true that data visualization is part data science and part art. That being said, even the most creative art is supported by theories that explain why it works.”
—Michiko Wolcott

what is tidy data

tidy data is a standard way of mapping the meaning of a dataset to its structure; in tidy data, each variable forms a column, each observation forms a row, and each cell is a single measure

what is tidy data

tidy datasets are all alike, but messy datasets are all unique

what is tidy data

in essence, tidy data is data that can be put through standardized tools

tidy datasets facilitate a standardized workflow

the grammar of graphics

the “grammar of graphics” is the answer to the question “what is a statistical graphic?”

basic structure

the basic structure of a ggplot looks something like:

library(ggplot2)
ggplot(dataset, aes(x = some_column, y = another_column)) + 
  geom_point() + 
  <more options as desired>

let’s do an example

palmerpenguins

here’s the palmerpenguins dataset

library(palmerpenguins)
head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

our first ggplot

ggplot(penguins, aes(x = bill_length_mm, bill_depth_mm)) + 
  geom_point()

our first ggplot

ggplot(penguins, aes(x = bill_length_mm, bill_depth_mm, color = species)) + 
  geom_point()

our first ggplot

ggplot(penguins, aes(x = bill_length_mm, bill_depth_mm, color = species)) + 
  geom_point() + 
  xlab("Bill Length [mm]")

our first ggplot

ggplot(penguins,
       aes(
         x = bill_length_mm,
         y = bill_depth_mm,
         color = species,
         shape = species
       )) + 
  geom_point() + 
  xlab("Bill Length [mm]") +
  ylab("Bill Depth [mm]")

our first ggplot

ggplot(penguins,
       aes(
         x = bill_length_mm,
         y = bill_depth_mm,
         color = species,
         shape = species,
         label = species,
         group = species
       )) + 
  stat_ellipse() + 
  geom_point() + 
  geom_label(
    data = penguins |> group_by(species) |> summarize(across(c(bill_length_mm, bill_depth_mm), mean, na.rm=T)),
    alpha = 0.8
  ) + 
  xlab("Bill Length [mm]") +
  ylab("Bill Depth [mm]") + 
  ggtitle("Relationship of Species, Bill Length, and Bill Depth",
          "Penguins observed near Palmer Station, Antarctica, 2007-2009") +
  theme_bw() + 
  theme(legend.position = 'none')

visual channels

visual channels include: position, size, length, shape, color, texture, angle, linetype, transparency, luminance

visual channels

univariate geoms

geom_bar

# if you want ggplot to tally up your data, use geom_bar
ggplot(penguins, aes(x = species)) + 
  geom_bar()

geom_col

# if your data are already tallied up, use geom_col
counts_df <- count(penguins, species)
counts_df
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124
ggplot(counts_df, aes(x = species, y = n)) + 
  geom_col()

geom_histogram

ggplot(penguins, aes(x = flipper_length_mm)) + 
  geom_histogram()

geom_histogram

ggplot(penguins, aes(x = flipper_length_mm)) + 
  geom_histogram(bins = 50)

geom_histogram

ggplot(penguins, aes(x = flipper_length_mm, fill = species)) + 
  geom_histogram()

geom_density

ggplot(penguins, aes(x = flipper_length_mm)) + 
  geom_density()

geom_density

ggplot(penguins, aes(x = flipper_length_mm, fill = species)) + 
  geom_density()

geom_density

ggplot(penguins, aes(x = flipper_length_mm, fill = species)) + 
  geom_density(alpha = 0.7)

bivariate geoms

geom_point

ggplot(penguins, 
       aes(
         x = flipper_length_mm, 
         y = bill_length_mm, 
         color = species)) +
  geom_point()

geom_point

ggplot(penguins, 
       aes(
         x = flipper_length_mm, 
         y = bill_length_mm, 
         color = species)) +
  geom_point(size = 3)

geom_point

ggplot(penguins, 
       aes(
         x = flipper_length_mm, 
         y = bill_length_mm, 
         color = species,
         size = body_mass_g)) +
  geom_point()

geom_point

ggplot(penguins, 
       aes(
         x = flipper_length_mm, 
         y = bill_length_mm, 
         color = species,
         size = body_mass_g)) +
  geom_point(alpha = 0.6)

geom_line

lines are great to use to depict observations that are conceptually connected — either across time, or thematically.

for example, maybe we would want to know if the average flipper length in the population of penguins observed is changing over time.

flipper_length_over_time <- 
  data.frame(
  year = c(2007, 2007, 2007, 
           2008, 2008, 2008, 
           2009, 2009, 2009),
  flipper_length_mm = 
    c(186.5, 192.4, 215.1,
      191.0, 197.7, 217.5,
      192.0, 198.0, 218.4), 
  species = as.factor(
    c(
      "Adelie", "Chinstrap", "Gentoo",
      "Adelie", "Chinstrap", "Gentoo",
      "Adelie", "Chinstrap", "Gentoo"
    )))

ggplot(flipper_length_over_time,
       aes(x = year,
           y = flipper_length_mm,
           color = species)) + 
  geom_line()

geom_line

a lot of the time when using geom_line, I like to pair that with a geom_point layer on top of it, so where the observations are is more clear.

ggplot(flipper_length_over_time,
       aes(x = year,
           y = flipper_length_mm,
           color = species)) + 
  geom_line() + 
  geom_point()

geom_area

geom_area when used with only one group of data is similar to geom_line, except it fills in the area beneath. this can be useful for depicting how a population breaks down into strata over time.

ggplot(population_over_time,
       aes(x = year,
           y = population_size,
           fill = group)) +
  geom_area()
year group population_size
1947 Employed 60.323
1947 Unemployed 235.600
1947 Armed.Forces 159.000
1948 Employed 61.122
1948 Unemployed 232.500
1948 Armed.Forces 145.600
1949 Employed 60.171
1949 Unemployed 368.200
1949 Armed.Forces 161.600
1950 Employed 61.187
1950 Unemployed 335.100
1950 Armed.Forces 165.000
1951 Employed 63.221
1951 Unemployed 209.900
1951 Armed.Forces 309.900
1952 Employed 63.639
1952 Unemployed 193.200
1952 Armed.Forces 359.400
1953 Employed 64.989
1953 Unemployed 187.000
1953 Armed.Forces 354.700
1954 Employed 63.761
1954 Unemployed 357.800
1954 Armed.Forces 335.000
1955 Employed 66.019
1955 Unemployed 290.400
1955 Armed.Forces 304.800
1956 Employed 67.857
1956 Unemployed 282.200
1956 Armed.Forces 285.700
1957 Employed 68.169
1957 Unemployed 293.600
1957 Armed.Forces 279.800
1958 Employed 66.513
1958 Unemployed 468.100
1958 Armed.Forces 263.700
1959 Employed 68.655
1959 Unemployed 381.300
1959 Armed.Forces 255.200
1960 Employed 69.564
1960 Unemployed 393.100
1960 Armed.Forces 251.400
1961 Employed 69.331
1961 Unemployed 480.600
1961 Armed.Forces 257.200
1962 Employed 70.551
1962 Unemployed 400.700
1962 Armed.Forces 282.700

geom_text

we can use geom_text similarly to geom_point, but instead of plotting a small mark, geom_text places text on the graph.

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, label = species)) + 
  geom_text()

geom_text

one use of geom_text that I particularly enjoy is to make what usually has to be guesstimated precise:

# remember counts_df from earlier? 
counts_df
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124
ggplot(counts_df, aes(x = species, y = n)) + 
  geom_col() + 
  geom_text(
    mapping = aes(y = n + 10, label = n),
    size = 8
  )

geom_boxplot

ggplot(penguins, aes(x = species, y = flipper_length_mm)) + 
  geom_boxplot()

geom_boxplot

if i am going to use boxplots, something i often like to do is to plot the data with jitter behind the boxplots and give the boxplots some transparency.

also, i’m a sucker for colorful figures, so i’ll almost always add color

ggplot(penguins, aes(x = species, y = flipper_length_mm,
                     color = species, fill = species)) + 
  geom_jitter() + 
  geom_boxplot(alpha = 0.6, color = 'black', outlier.color = NA)

geom_violin

ggplot(penguins, aes(x = species, y = flipper_length_mm)) + 
  geom_violin()

geom_violin

a similar layout can be done with geom_violin

ggplot(penguins, aes(x = species, y = flipper_length_mm,
                     color = species, fill = species)) + 
  geom_jitter() + 
  geom_violin(color = 'black', alpha = 0.5)

geom_tile

heatmaps can be created with geom_tile

ggplot(df,
       aes(x = week,
           y = weekday,
           fill = productivity)) + 
  geom_tile()

geom_tile

ggplot(df, aes(x = week, y = weekday, fill = productivity)) + 
  geom_tile() + 
  scale_fill_viridis_c()

facets

facet_wrap

faceting generates multiple panels within a visualization, each showing a different subset of the data.

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) + 
  geom_point() + 
  facet_wrap(~species)

facet_grid

facet_grid gives you more precision in how the faceting panels are laid out by using the left-hand-side and right-hand-side to indicate to the rows and columns.

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) + 
  geom_point() + 
  facet_grid(sex~species)

Hodu tip!

the easiest way for me to remember which variable corresponds to the columns vs. rows in the facet_grid formula is to remember how formulas usually look, as in y ~ x or y = m*x + b.

y is the variable that corresponds to the vertical dimension and is on the left-hand-side – so y or the left-hand-side corresponds to the rows, while x informs us about the horizontal dimension, so x or the right-hand-side corresponds to the columns.

statistics

let’s say you have a ton of data such that creating a scatter plot isn’t all that useful.

df <- data.frame(
  x = c(rnorm(n = 10000, sd = .5), rnorm(n = 20000, mean = 1.5), sd = .35),
  y = c(rnorm(n = 10000, sd = .5), rnorm(n = 20000, mean = 1.5), sd = .35)
)

ggplot(df, aes(x = x, y = y)) + 
  geom_point()

we want to know if this is one cluster or two… we might look at histograms in x and y.

ggplot(df, aes(x = x)) + 
  geom_histogram()

ggplot(df, aes(x = y)) + 
  geom_histogram()

geom_density2d

but when that still doesn’t work, we can turn to calculating some summary statistics in 2d with geom_density2d

ggplot(df, aes(x = x, y = y)) + 
  geom_point() + 
  geom_density_2d()

Hodu tip!

another handy way to deal with overplotting is to increase the transparency of the points plotted

ggplot(df, aes(x = x, y = y)) + 
  geom_point(alpha = 0.05)

stat_summary

stat_summary is a very flexible layer that can perform computations for you and depict the results with a geom of your choice

ggplot(penguins, aes(x = species, y = body_mass_g)) + 
  stat_summary(fun = mean, geom = 'point', size = 5)

stat_summary

we can use stat_summary to show us the mean, min and max using a pointrange geom

ggplot(penguins, aes(x = species, y = body_mass_g)) + 
  stat_summary(fun = mean, geom = 'pointrange',
               fun.max = max, fun.min = min,
               size = 5)

stat_summary

we can use stat_summary to show us the mean, min and max using a pointrange geom

ggplot(penguins, aes(x = species, y = body_mass_g)) + 
  stat_summary(fun.data = mean_cl_normal, geom = 'pointrange',
               size = 5)

coordinates

scale_x_* and scale_y_*

occasionally we have data that is best represented on a non-linear scale, like the log-scale. ggplot makes it very easy to do this. compare:

ggplot(df, aes(x = x)) + 
  geom_histogram()

ggplot(df, aes(x = x)) + 
  geom_histogram() + 
  scale_x_log10()

scale_color_* and scale_fill_*

often you will want to customize the color and fill palettes in your plots. there are a handful of ways to do this, and they typically fall into 3 categories:

  • categorical
    • scale_*_discrete()
    • scale_*_brewer()
  • quantitative
    • scale_*_continuous()
    • scale_*_distiller()
    • scale_*_gradient()
  • manual
    • scale_*_manual()

scale_color_* and scale_fill_*

RColorBrewer: https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/

ggplot(df, aes(x = x, y = y, fill = z)) + 
  geom_tile() +
  scale_fill_distiller()

ggplot(df, aes(x = x, y = y, fill = z)) + 
  geom_tile() +
  scale_fill_distiller(palette = 'RdYlBu')

viridis: https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html

ggplot(df, aes(x = x, y = y, fill = z)) + 
  geom_tile() +
  scale_fill_viridis_c()

ggplot(df, aes(x = x, y = y, fill = z)) + 
  geom_tile() +
  scale_fill_viridis_c(option = 'magma')

https://twitter.com/moriah_taylor58/status/1395431000977649665

coordinate labelling

sometimes you want to format the way the axis numbers appear in a particular way, like in scientific format or in dollars.

ggplot(df, aes(x = household_income)) + 
  geom_histogram() + 
  scale_x_log10(labels = scales::dollar_format())

ggplot(df, aes(x = program, y = admit_rate)) + 
  geom_point(size = 5) + 
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, NA))

themes

themes

you can basically customize every aspect of the theme in ggplot.

plt <- ggplot(df, aes(x = program, y = admit_rate)) + 
  geom_point(size = 5) + 
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, NA))

plt + theme_classic()

plt + theme_linedraw()

plt + theme_dark()

plt + theme_void()

labels

use the labs() function to set labels for any aesthetics.

ggplot(penguins,
       aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + 
  geom_point() + 
  labs(
    x = "Bill Length [mm]",
    y = "Bill Depth [mm]",
    color = "Species of Penguin",
    title = "Relationship between Bill Length and Bill Depth"
  )

legend position

the legend position can be moved using the legend.position argument to theme()

ggplot(penguins,
       aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + 
  geom_point() + 
  labs(
    x = "Bill Length [mm]",
    y = "Bill Depth [mm]",
    color = "Species of Penguin",
    title = "Relationship between Bill Length and Bill Depth"
  ) + 
  theme(legend.position = 'bottom')

saving plots

use ggsave() to save your plots. keep in mind that the filename saved will be relative to your working directly unless you use an absolute path.

# one option: 
# render your ggplot as normal
ggplot(df, aes(...)) + ...
# then run 
ggsave(filename = "your_filename.png", width = 5, 
  height = 7) # width and height default to inches
  
# another option:
# assign your plot to an object
plt <- ggplot(df, aes(...)) + ...
ggsave(filename = "your_filename.png", 
  plot = plt, # give the plt object to ggsave explicitly
  width = 5, 
  height = 7)

Hodu tip!

saving plots with ggsave() is one of the most important skills in using ggplot2.

it’s great if you can make your graphics in R, but you need to render your graphics to image files in order to be able to integrate them into manuscripts, websites, etc. and ggsave() gives you lots of great options for the filetype, dimensions, resolution, etc.

extensions

ggdist

screenshot of functionality 
from the ggdist package

learn more: https://mjskay.github.io/ggdist/

ggrepel

screenshot of functionality 
from the ggrepel package

learn more: https://ggrepel.slowkow.com/

patchwork

screenshot of functionality 
from the patchwork package

learn more: https://patchwork.data-imaginist.com/

key takeaways

  • ggplot2 is based on the grammar of graphics, meaning every plot is made up of: data, aesthetics, geometries, facets, statistics, coordinates, and a theme.
  • ggplot2 is “plug and play” (like legos) in the sense that there is a ton of varied geoms, stats, and customizations you can make with interchangeable layers
  • it’s easiest to build up your plots incrementally, starting from an overly simple, crude version and working your way up to something more refined; don’t try to make a masterpiece all in one go because it will make debugging your code harder
  • if the kind of data visualization you want to make isn’t supported out of the box in ggplot2, likely you can either use or create an extension to ggplot2 that will do what you want.