Factors and date-times

ID 529: Data Management and Analytic Workflows in R

Amanda Hernandez

Thursday, January 11^th, 2024

11 Jan 2024

01-11-24

2024-01-11

as_date(43840, origin = “1904-01-01”)

Follow along

https://bit.ly/fct_datetimeR24

Learning objectives

Understand the importance of properly handling factors and date-times in data analysis
Learn about challenges and common mistakes when working with factors and date-times in R
Be familiar with packages for working with factors and date-times
- forcats:: for manipulating factors in R
- lubridate:: for handling date-times

Amanda Hernandez (she/her)

Amanda's pic

github.com/amandabhernandez

hi there! I recently completed an MS in Environmental Health. I’m currently working in the public sector as a Presidential Management Fellow.

My master’s work was at the intersection of environmental geochemistry + public health, focused on addressing environmental health disparities through community-based participatory research and evidence-based decision-making.

I’m a self-taught R user, Shiny enthusiast, and advocate for coding in light mode.

Factors

What are factors?

A factor is an integer vector that uses levels to store attribute information.
- Levels serve as the logical link between integers and categorical values.

Factors retain the order of your variables through levels.
Factors have a lot more rules than character strings
- Once you understand the rules, you have a lot more manual control over your data (while still being reproducible)

What are factors?

Factors are particularly useful for ordinal data , where our data is categorical, but there is an order to the categories.

Factors are also useful when values are repeated frequently, and there’s a pre-specified set of distinct levels.

For example:

Age groups
Quantile groups
Months/Days of the week

Working with factors

Your data may already have variables as factors, or you can set them manually with factor().

The penguins data from the palmerpenguins package have several variables that come pre-set as factors. Based on the column names, which ones seem like good factor candidates?

library(palmerpenguins)
colnames(penguins)

[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"

Working with factors

The glimpse() function gives us an idea of the class of each column.

glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Working with factors

Let’s look at the species column to see how R handles factors.

What do you notice about the output?

R returns the values in the order they appear in the dataset
It also returns a “levels” statement with the values in alphabetical order

unique(penguins$species)

[1] Adelie    Gentoo    Chinstrap
Levels: Adelie Chinstrap Gentoo

Factor rules

R by default returns your data in the order it occurs
Factors create an order and retain that order for all future uses of the variable

ggplot(penguins, aes(x = species)) + 
  geom_bar()

Advantages of factors/use cases

Retain the order of a variable, even if it is different between facets
- Improves reproducibility! Between scripts, computers, datasets…
Recode variables to have more intuitive labels
Regressions/other analyses
- Set reference levels for categorical data

Incorporating factors into your workflow

When you read in data, check how your variables load
- Do you have factors, when you really want strings? Do you have strings, when you really want factors?
- If you do have factors, check the levels with levels() or unique()
Plan out your script with pseudocode
- On your second pass through, think through which stages factors might be most helpful for and add it to your pseudocode
  - Is your data ordinal? Do you want it sorted by another variable? Is there a reference category/group?
  - How do you want to handle NAs?
  - Do you want to set factor order globally or locally?
Once you have a first draft script, make sure to check that your factors aren’t doing anything weird

Supplementary factor slides

Topics covered:

forcats:: functions (fct_infreq() and fct_rev())
Missingness + factors (empty groups, NAs)
NHANES example
WARNING: unexpected complications 🥴

Changing the factor order: `fct_infreq()`

We may want to change the default factor order (alphabetical) and rearrange the order on the x axis. forcats:: gives us lots of options for rearranging our factors without having to manually list out all of the levels.

fct_infreq() allows us to sort by occurrence:

ggplot(penguins, aes(x = fct_infreq(species))) + 
  geom_bar()

Changing the factor order: `fct_rev()`

fct_rev() flips the order in reverse

ggplot(penguins, aes(x = fct_rev(species))) + 
  geom_bar()

Hodu tip! A picture of Hodu booping a bubble

If you’re doing a lot of different analyses and visualization, you may want to change the factor order quite often.

Consider using these functions locally within your code so that you’re not actually changing the underlying dataset and future analyses.

Factor rules: Missingness

Factors have a specific set of rules for missing values
- Factors retain NAs, but do not return NAs as a level by default
- Helpfully, it doesn’t drop NAs from the analysis, just because it’s not a level
- NAs will always be last in the factor order

Let’s look at the sex column:

unique(penguins$sex)

[1] male   female <NA>  
Levels: female male

Factor rules: Missingness

ggplot(penguins, aes(x = sex)) + 
  geom_bar()

Factor rules: Missingness

ggplot(penguins, aes(x = species, fill = sex)) +
  geom_bar(position = "dodge") +
  facet_wrap(~island)

Empty groups + NAs in factors

Once levels are set, they will be retained and kept consistent between groups, even when there is nothing in the group
NAs are not considered their own group by default
- They are not dropped, but they aren’t considered a “level”

Empty groups + NAs in factors

Let’s look at just the Adelie penguins.

Because species is a factor, the information about other species is retained, even when there is nothing in that category.

adelie_penguins <- penguins %>% 
  filter(species == "Adelie")

unique(adelie_penguins$species)

[1] Adelie
Levels: Adelie Chinstrap Gentoo

table(adelie_penguins$species)


   Adelie Chinstrap    Gentoo 
      152         0         0

table(as.character(adelie_penguins$species))


Adelie 
   152

Empty groups + NAs in factors

ggplot(adelie_penguins, 
       aes(x = sex, fill = species)) +
  geom_bar(position = "dodge") +
  facet_wrap(~island)

ggplot(adelie_penguins, 
       aes(x = sex, fill = species)) +
  geom_bar(position = "dodge") +
  scale_fill_discrete(drop=FALSE) +
  facet_wrap(~island)

Empty groups + NAs in factors

Remember: NAs are not considered a factor level, so if we filter out the NAs, they will not be included in the legend even if we specify to keep all levels.

female_penguins <- penguins %>% 
  filter(sex == "female")

ggplot(female_penguins, aes(x = species, fill = sex)) +
  geom_bar(position = "dodge") +
  scale_fill_discrete(drop=FALSE) +
  facet_wrap(~island)

Example: NHANES

Let’s say we want to turn a continuous variable into categorical groups:

Age quartiles
Clinically relevant blood pressure categories

Creating age quartiles

# what is in the age column? 
summary(nhanes_id529$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  12.00   23.00   42.00   42.78   60.00   80.00

# create age quartiles
nhanes_id529$age_quartiles <- ntile(nhanes_id529$age, 4)

# what type of data is age_quartile? 
class(nhanes_id529$age_quartiles)

[1] "integer"

ggplot(nhanes_id529, aes(x = age_quartiles, y = mean_BP)) + 
  geom_boxplot()

Transforming `age_quartiles` into a factor

nhanes_id529$age_quartiles <- factor(nhanes_id529$age_quartiles)

# now what type of data is age_quartile? 
class(nhanes_id529$age_quartiles)

[1] "factor"

ggplot(nhanes_id529, aes(x = age_quartiles, y = mean_BP)) + 
  geom_boxplot()

Transforming `age_quartiles` into a factor

What if we wanted the labels to convey more information?

nhanes_id529 <- nhanes_id529 %>% 
  group_by(age_quartiles) %>% 
  mutate(age_quartiles = factor(age_quartiles, 
                                labels = c(paste0("(", min(age), "-", max(age), ")"))))

unique(nhanes_id529$age_quartiles)

[1] (23-42) (60-80) (12-23) (42-60)
Levels: (12-23) (23-42) (42-60) (60-80)

Transforming `age_quartiles` into a factor

ggplot(nhanes_id529, aes(x = age_quartiles, y = mean_BP)) + 
  geom_boxplot()

Creating blood pressure categories

We also want to create clinically relevant categories of blood pressure:

nhanes_id529$bp_cat <- case_when(nhanes_id529$mean_BP < 90 ~ "low BP",
                           nhanes_id529$mean_BP > 140 ~ "high BP",
                           TRUE ~ "normal BP")

Creating blood pressure categories

Because bp_cat is not a factor yet, R has plotted the variable in alphabetical order by default. There may be some situations where this is sufficient, but for ordinal data, the order of the data is important!

# look at blood pressure categories with age
ggplot(nhanes_id529, aes(x = bp_cat, y = age)) + 
  geom_boxplot()

Transforming `bp_cat` into a factor

# manually set the order
nhanes_id529$bp_cat <- factor(nhanes_id529$bp_cat, levels = c("high BP", "normal BP", "low BP"))

# now the levels will retain the order for us 
unique(nhanes_id529$bp_cat)

[1] normal BP high BP   low BP   
Levels: high BP normal BP low BP

# when we plot it, the order will be determined by the levels
ggplot(nhanes_id529, aes(x = bp_cat, y = age)) + 
  geom_boxplot()

Working with Factors

The forcats:: package (short for “For Categorical”) is a helpful set of functions for working with factors.

forcats pacakge hex logo

Useful `forcats::` functions

Check out the forcats:: cheatsheet for more info on how these functions work!

fct_drop()
fct_relevel()
fct_rev()
fct_infreq()
fct_inorder

forcats pacakge hex logo

⚠️ Caution: factors

Factors operate under a very strict set of rules. If you aren’t careful, you can accidentally create issues in your dataset

⚠️ Caution: factor → numeric

Let’s say we wanted to transform a “year” column from an integer to a factor to make a plot with a different boxplot for each year:

penguins_fctyr <- penguins %>% 
  mutate(year = factor(year))

ggplot(penguins_fctyr, aes(x = species, y = flipper_length_mm, color = year)) + 
  geom_boxplot()

⚠️ Caution: factor → numeric

Later in your script, you decide you want to include year as a continuous variable, so you transform year into an integer.

penguins_num <- penguins_fctyr %>% 
  mutate(year = as.integer(year))

unique(penguins_num$year)

[1] 1 2 3

Oh no!

So what do you do?

In some cases, you may want to set factors locally within a particular piece of your script, rather than globally.
For example, you could run as.factor() around year within your ggplot().

ggplot(penguins, aes(x = species, 
                     y = flipper_length_mm, 
                     color = as.factor(year))) + 
  geom_boxplot()

⚠️ Caution: typos

If you don’t have exact matches when you assign factor levels and labels, you’re going to end up with a lot of NAs

penguins_sizes <- penguins %>% 
  mutate(size_cat = case_when(bill_length_mm > mean(bill_length_mm, na.rm = T) &
                                bill_depth_mm > mean(bill_depth_mm, na.rm = T) & 
                                flipper_length_mm > mean(flipper_length_mm, na.rm = T) ~ "big penguins",
                              bill_length_mm < mean(bill_length_mm, na.rm = T) &
                                bill_depth_mm < mean(bill_depth_mm, na.rm = T) & 
                                flipper_length_mm < mean(flipper_length_mm, na.rm = T) ~ "small penguins",
                              TRUE ~ "average penguins"))

table(penguins_sizes$size_cat, useNA = "ifany")


average penguins     big penguins   small penguins 
             293               20               31

penguins_sizes$size_cat <- factor(penguins_sizes$size_cat, levels = c("smol penguins", "average penguins", "big penguins"))

table(penguins_sizes$size_cat, useNA = "ifany")


   smol penguins average penguins     big penguins             <NA> 
               0              293               20               31

Date-Times

Working with dates

What are some challenges you might anticipate working with dates?

Working with dates

Often, we need dates to function as both strings and numbers

As strings, we want to have a fair amount of control over how they are presented.
As numbers, we may want to add/subtract time, account for time zones, and present them at different scales.

Working with dates

There are lots of packages and functions that are helpful for working with dates. We’ll talk primarily about the lubridate:: package, but the goal today is to understand the components and rules of date-time objects so that you can apply these functions in your work.

Working with dates

There are 3 ways that we will work with date/time data:

dates
times
date-times

Working with dates 🤯😳

For the most part, we’re going to try to work with dates as Date objects, but you may see a date that defaults to POSIXct.

All computers store dates as numbers, typically as time (in seconds) since some origin. That’s all that POSIXct is – to be more specific, it is the time in seconds since 1970 in the UTC time zone (GMT).

🚨!IMPORTANT!🚨
From this point forward, all times will be presented in 24-hour time!!

Working with dates

To get an idea of how R works with dates, let’s ask for the current date/time:

today()

[1] "2024-01-11"

now()

[1] "2024-01-11 11:34:28 CST"

Working with dates

But how do we actually work with this data? In practice, we might want to:

Know how much time has elapsed between two samples
Collapse daily measurements into monthly averages
Check whether a measurement was taken in the morning or evening
Know which day of the week a measurement was taken on
Convert time zones

`lubridate::`

The lubridate:: package is a handy way of storing and processing date-time objects. lubridate:: categorizes date-time objects by the component of the date-time string they represent:

year
month
day
hour
minute
second

`lubridate::` basics

Once you have a date-time object, you can use lubridate:: functions to extract and manipulate the different components.

year(now())

[1] 2024

month(now())

[1] 1

month(now(), label = TRUE)

[1] Jan
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

day(now())

[1] 11

yday(now())

[1] 11

wday(now())

[1] 5

wday(now(), label = TRUE)

[1] Thu
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

hour(now())

[1] 11

minute(now())

[1] 34

second(now())

[1] 29.54731

`lubridate::` basics

The function now() returns a date-time object, while today() returns just a date. lubridate:: also has functions that allow us to force the date into a date-time and the date-time into a date:

today()

[1] "2024-01-11"

as_datetime(today())

[1] "2024-01-11 UTC"

now()

[1] "2024-01-11 11:34:29 CST"

as_date(now())

[1] "2024-01-11"

`lubridate::` basics

With lubridate::, we can also work with dates as a whole, rather than their individual components. Let’s say we have a character string with the date, but we want R to transform it into a date-time object:

ymd("2024-01-11")

[1] "2024-01-11"

ymd_hms("2024-01-11 13:30:00")

[1] "2024-01-11 13:30:00 UTC"

ymd_hms("2024-01-11 13:30:00", tz = "EST")

[1] "2024-01-11 13:30:00 EST"

Time spans

There are three classes we can apply to our date-times so that we can work with them arithmetically 🧙️

durations (measured in seconds)
periods (measured in weeks/months)
intervals (have a start and end point)

Time spans

Durations

fixed length in seconds

ddays(9)

[1] "777600s (~1.29 weeks)"

Periods

“human” times

days(9)

[1] "9d 0H 0M 0S"

Intervals

a duration with a start/end point

interval(start = today(), 
         end = today() + days(8))

[1] 2024-01-11 UTC--2024-01-19 UTC

Time spans

Arithmetic operators allowed for different classes of date-time data

From R4DS Chapter 16: Dates and times

Supplemental date-time slides

Topics covered:

Durations vs Periods vs Intervals
WARNING: How to avoid time travel + other nuances
- Avoiding issues when collaborating
- Time is a construct: leap years + daylight savings time
Classroom CO_2 example
First day live coding demo example

Durations: Subtraction

We can use arithmetic operators with durations:

# how long does this class meet for each day? 
class_length <- ymd_hms("2024-01-11 17:30:00") -  ymd_hms("2024-01-11 13:30:00")
class_length

Time difference of 4 hours

Using an as.duration() wrapper, the result will be returned as a duration object:

as.duration(class_length)

[1] "14400s (~4 hours)"

Durations: Multiplication

What if we want to know the total amount of time you get to spend together? 😊

# how many days does this class meet for? 
class_dates <- c(seq(ymd('2024-01-08'),ymd('2024-01-12'), by = 1),
                 seq(ymd('2024-01-16'),ymd('2024-01-19'), by = 1))
class_dates

[1] "2024-01-08" "2024-01-09" "2024-01-10" "2024-01-11" "2024-01-12"
[6] "2024-01-16" "2024-01-17" "2024-01-18" "2024-01-19"

class_length*length(class_dates)

Time difference of 36 hours

as.duration(class_length)*length(class_dates)

[1] "129600s (~1.5 days)"

Durations: Addition

How else might we calculate it?

class_meeting_times <- tibble(dates =  class_dates,
                              week = c(rep(1, 5), rep(2, 4)),
                              start_time = c(rep(hms("13:30:00"), 9)),
                              start_datetime = c(ymd_hms(paste(dates, start_time), tz = "EST")),
                              end_time = c(rep(hms("17:30:00"), 9)),
                              end_datetime = c(ymd_hms(paste(dates, end_time), tz = "EST")),
                              class_time_int = interval(start = start_datetime,
                                                        end = end_datetime))

class(class_meeting_times$dates)

[1] "Date"

class(class_meeting_times$start_time)

[1] "Period"
attr(,"package")
[1] "lubridate"

class(class_meeting_times$class_time_int)

[1] "Interval"
attr(,"package")
[1] "lubridate"

as.duration(sum(as.duration(class_meeting_times$end_time-class_meeting_times$start_time)))

[1] "129600s (~1.5 days)"

Periods

Since periods operate using “human” time, we can add to periods using functions like minutes(), hours(), days(), and weeks()

class_meeting_times$start_time[1] + hours(4)

[1] "17H 30M 0S"

Intervals

Intervals can be created with the interval() function or with %--%. By default, intervals will be created in the date-time format you input.

interval(start = today(), end = ymd("2024-01-19"))

[1] 2024-01-11 UTC--2024-01-19 UTC

today() %--% ymd("2024-01-19")

[1] 2024-01-11 UTC--2024-01-19 UTC

We can use %within% to check whether a date falls within our interval:

class_date_interval <- interval(start = min(ymd(class_dates)), 
                                end = max(ymd(class_dates)))

#check whether a date happens during class
ymd("2024-01-22") %within% class_date_interval

[1] FALSE

ymd("2024-01-11") %within% class_date_interval

[1] TRUE

⚠️ Caution: Working with intervals, durations, and periods

class_meeting_times$start_time[1] + hours(4)

[1] "17H 30M 0S"

class_meeting_times$start_time[1] + dhours(4)

Error: Incompatible classes: <Period> + <Duration>

# only specifying one time zone leaves you vulnerable to time changes between collaborators/devices!
interval(start = now(), 
         end = ymd_hms("2024-01-11 17:30:00", 
                       tz = "EST"))

[1] 2024-01-11 11:34:30 CST--2024-01-11 16:30:00 CST

⚠️ Caution: time is a construct 🫠

Leap years

leap_year(2024)

[1] TRUE

ymd("2024-01-12") - dyears(1)

[1] "2023-01-11 18:00:00 UTC"

ymd("2023-02-28") + ddays(1)

[1] "2023-03-01"

ymd("2024-02-28") + ddays(1)

[1] "2024-02-29"

Daylight savings

durations measure consistent time in seconds
periods work more like “human” time

dst(today())

[1] FALSE

dst("2024-03-10 13:30:00")

[1] TRUE

as_datetime(ymd_hms("2024-03-09 13:30:00")) + dhours(24)

[1] "2024-03-10 13:30:00 UTC"

ymd_hms("2024-03-09 13:30:00") + days(1)

[1] "2024-03-10 13:30:00 UTC"

Hodu tip! A picture of Hodu with some nice pink flowers

The parse_date() function from the parsedate:: package is useful for converting a list of messy dates into a standard format!

parsedate::parse_date(c("11 January 2024",
                        "01/11/2024",
                        "01/11/24"))

[1] "2024-01-11 UTC" "2024-01-11 UTC" "2024-01-11 UTC"

Example: Classroom CO\(_2\)

On the first day of ID529 in January 2023, I set up an instrument to log indoor temperature, relative humidity, and CO\(_2\) in G2.

The logger (called a HOBO) was set to collect data at 1-second intervals ~15 minutes before class began and ~15 minutes after class ended.

The data were cleaned and are now “long”

glimpse(hobo_g2)

Rows: 51,303
Columns: 3
$ date_time <chr> "2023-01-09T13:15:00Z", "2023-01-09T13:15:00Z", "2023-01-09T…
$ metric    <chr> "temp_f", "rh_percent", "temp_f", "rh_percent", "temp_f", "r…
$ result    <dbl> 71.834, 34.128, 71.834, 33.995, 71.834, 33.929, 71.834, 33.7…

A picture of a HOBO CO2 logger set up in G2 while Jarvis makes some really great points at the podium

Example: Classroom CO\(_2\)

jan23_meeting_times <- tibble(dates =  c(seq(ymd('2023-01-09'),ymd('2023-01-13'), by = 1),
                                        seq(ymd('2023-01-17'),ymd('2023-01-20'), by = 1)),
                             week = c(rep(1, 5), rep(2, 4)),
                             start_time = c(rep(hms("13:30:00"), 9)),
                             start_datetime = c(ymd_hms(paste(dates, start_time), tz = "EST")),
                             end_time = c(rep(hms("17:30:00"), 9)),
                             end_datetime = c(ymd_hms(paste(dates, end_time), tz = "EST")),
                             class_time_int = interval(start = start_datetime,
                                                       end = end_datetime))

hobo_g2_dt <- hobo_g2 %>% 
  mutate(metric = factor(metric, levels = c("co2_ppm", "temp_f", "rh_percent"),
                         labels = c("CO2 (ppm)", "Temperature (F)", "Relative Humidity (%)")), 
         date_time = force_tz(as_datetime(date_time), tz = "EST"), 
         time = hms::as_hms(date_time),
         date = as_date(date_time),
         hour = hour(date_time),
         minute = minute(date_time),
         second = second(date_time))

glimpse(hobo_g2_dt)

Rows: 51,303
Columns: 8
$ date_time <dttm> 2023-01-09 13:15:00, 2023-01-09 13:15:00, 2023-01-09 13:15:…
$ metric    <fct> Temperature (F), Relative Humidity (%), Temperature (F), Rel…
$ result    <dbl> 71.834, 34.128, 71.834, 33.995, 71.834, 33.929, 71.834, 33.7…
$ time      <time> 13:15:00, 13:15:00, 13:15:01, 13:15:01, 13:15:02, 13:15:02,…
$ date      <date> 2023-01-09, 2023-01-09, 2023-01-09, 2023-01-09, 2023-01-09,…
$ hour      <int> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, …
$ minute    <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, …
$ second    <dbl> 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, …

Example: Classroom CO\(_2\)

ggplot(hobo_g2_dt, aes(x = date_time, y = result))  + 
  geom_line() + 
  facet_wrap(~metric, scales = "free_y", ncol = 1) + 
  scale_x_datetime(breaks = scales::date_breaks("30 mins"), date_labels = "%H:%M") + 
  xlab("Time") + 
  ylab("") +
  ggtitle(paste0("Indoor conditions in G2 (", unique(hobo_g2_dt$date), ")"))

Example: Classroom CO\(_2\)

Often, we have really granular data that we want to report in some aggregate form. This data is measured in 1-second intervals, but let’s say we wanted to calculate a 1-minute average or an hourly average.

Because we separated date_time into it’s various components, we can now use group_by() and summarize() to calculate averages.

hobo_g2_1min <- hobo_g2_dt %>% 
  group_by(metric, date, hour, minute) %>% 
  summarize(avg_1min = mean(result)) %>% 
  mutate(date_time = ymd_hm(paste0(date, " ", hour, ":", minute), tz = "EST"))

hobo_g2_1hr <- hobo_g2_dt %>% 
  group_by(metric, date, hour) %>% 
  summarize(avg_1hr = mean(result)) %>% 
  mutate(date_time = ymd_h(paste0(date, hour), tz = "EST"))

Example: Classroom CO\(_2\)

ggplot(hobo_g2_dt, aes(x = date_time, y = result))  + 
  geom_line(color = "lightgrey", size = 1) + 
  facet_wrap(~metric, scales = "free_y", ncol = 1) + 
  theme_bw()

Example: Classroom CO\(_2\)

ggplot(hobo_g2_dt, aes(x = date_time, y = result))  + 
  geom_line(color = "lightgrey", size = 1) + 
  geom_line(hobo_g2_1min, mapping = aes(x = date_time, y = avg_1min), color = "slateblue3") + 
  facet_wrap(~metric, scales = "free_y", ncol = 1) + 
  theme_bw()

Example: Classroom CO\(_2\)

ggplot(hobo_g2_dt, aes(x = date_time, y = result))  + 
  geom_line(color = "lightgrey", size = 1) + 
  geom_line(hobo_g2_1min, mapping = aes(x = date_time, y = avg_1min), color = "slateblue3") + 
  geom_line(hobo_g2_1hr, mapping = aes(x = date_time, y = avg_1hr), color = "paleturquoise2", alpha = 0.7) + 
  facet_wrap(~metric, scales = "free_y", ncol = 1) + 
  theme_bw()

Example: Classroom CO\(_2\)

Another thing we might be interested in is whether a measurement occurred during a specific interval, for example, during class time.

We can use %within%, which works similarly to %in% but for date-times.

table(hobo_g2_dt$date_time %within% interval(start = ymd_hms("2023-01-09 13:30:00", tz = "EST"),
                                             end = ymd_hms("2023-01-09 17:30:00", tz = "EST")))


FALSE  TRUE 
 8100 43203

Example: Classroom CO\(_2\)

ggplot(hobo_g2_dt)  + 
  geom_line(aes(x = date_time, y = result), color = "lightgrey", size = 1) + 
  geom_line(hobo_g2_1min, mapping = aes(x = date_time, y = avg_1min), color = "slateblue3") + 
  geom_line(hobo_g2_1hr, mapping = aes(x = date_time, y = avg_1hr), color = "paleturquoise2") +
  geom_rect(data = jan23_meeting_times %>%
              filter(dates %in% hobo_g2_dt$date),
              mapping = aes(xmin = start_datetime,
                            xmax = end_datetime,
                            ymin = 0, ymax = Inf),
              alpha = 0.2, fill = "lightpink") +
  facet_wrap(~metric, scales = "free_y", ncol = 1) + 
  theme_bw()

Example: COVID data

If you can remember back to the demo on the first day, Christian demonstrated an application of factors and date-times into his workflow

covid <- list(
  readr::read_csv("us-counties-2020.csv"),
  readr::read_csv("us-counties-2021.csv"),
  readr::read_csv("us-counties-2022.csv")
)

# convert to 1 data frame
covid <- bind_rows(covid)


# cleaning covid data -----------------------------------------------------------

# create year_month variable
covid$year_month <- paste0(lubridate::year(covid$date), "-",
                           lubridate::month(covid$date))

# aggregate/summarize by year and month by county
covid <- covid |>
  group_by(geoid, county, state, year_month) |>
  summarize(deaths_avg_per_100k = mean(deaths_avg_per_100k, na.rm=TRUE))

# cast year_month to a factor
year_month_levels <- paste0(rep(2020:2023, each = 12), "-", rep(1:12, 4))
covid$year_month <- factor(covid$year_month, levels = year_month_levels)

Example: COVID data

ggplot(
  covid_by_poverty_level |> filter(! is.na(poverty_cut)),
  aes(x = year_month, 
      y = deaths_avg_per_100k, 
      color = poverty_cut,
      group = poverty_cut)) + 
  geom_line() + 
  scale_color_brewer(palette = 'RdBu', direction = -1) + 
  xlab("Date") + 
  ylab("COVID-19 Mortality per 100k (monthly observations)") + 
  ggtitle("Monthly County COVID-19 Mortality Estimates by Poverty Level in the US") + 
  theme(axis.text.x = element_text(angle = 75, hjust = 1))

Plot showing monthly county COVID-19 mortality estimates by poverty level in the US

Key takeaways

Knowing how to manipulate factors and date-times can save you a ton of headache – you’ll have a lot more control over your data which can help with cleaning, analysis, and visualization!
forcats:: and lubridate:: give you a lot of the functionality you might need
- hms:: is another package for working with times (stores time as seconds since 00:00:00, so you can easily convert between numeric and hms)
Factors and date-times can be tricky
- Double check things are working as you expect along the way!!
- If you can’t figure out why something isn’t working, take a break, and revisit it with fresh eyes.
- There’s lots of documentation out there! The cheatsheets are great, as is The Epidemiologist R Handbook (https://epirhandbook.com/en/working-with-dates.html#working-with-dates-1)

Factors and date-times

Follow along

Learning objectives

Amanda Hernandez (she/her)

Factors

What are factors?

What are factors?

Working with factors

Working with factors

Working with factors

Factor rules

Advantages of factors/use cases

Incorporating factors into your workflow

Supplementary factor slides

Changing the factor order: fct_infreq()

Changing the factor order: fct_rev()

Factor rules: Missingness

Factor rules: Missingness

Factor rules: Missingness

Empty groups + NAs in factors

Empty groups + NAs in factors

Empty groups + NAs in factors

Empty groups + NAs in factors

Example: NHANES

Creating age quartiles

Transforming age_quartiles into a factor

Transforming age_quartiles into a factor

Transforming age_quartiles into a factor

Creating blood pressure categories

Creating blood pressure categories

Transforming bp_cat into a factor

Working with Factors

Useful forcats:: functions

⚠️ Caution: factors

⚠️ Caution: factor → numeric

⚠️ Caution: factor → numeric

⚠️ Caution: typos

Date-Times

Working with dates

Working with dates

Working with dates

Working with dates

Working with dates 🤯😳

Working with dates

Working with dates

lubridate::

lubridate:: basics

lubridate:: basics

lubridate:: basics

Time spans

Time spans

Time spans

Supplemental date-time slides

Durations: Subtraction

Durations: Multiplication

Durations: Addition

Periods

Intervals

⚠️ Caution: Working with intervals, durations, and periods

⚠️ Caution: time is a construct 🫠

Example: Classroom CO\(_2\)

Example: Classroom CO\(_2\)

Example: Classroom CO\(_2\)

Example: Classroom CO\(_2\)

Example: Classroom CO\(_2\)

Example: Classroom CO\(_2\)

Example: Classroom CO\(_2\)

Example: Classroom CO\(_2\)

Example: Classroom CO\(_2\)

Example: COVID data

Example: COVID data

Key takeaways

Changing the factor order: `fct_infreq()`

Changing the factor order: `fct_rev()`

Transforming `age_quartiles` into a factor

Transforming `age_quartiles` into a factor

Transforming `age_quartiles` into a factor

Transforming `bp_cat` into a factor

Useful `forcats::` functions

`lubridate::`

`lubridate::` basics

`lubridate::` basics

`lubridate::` basics