Data Analysis From Start to Finish

ID 529: Data Management and Analytic Workflows in R

Dean Marengi | Wednesday, January 17^th, 2024

Context for what we’ll be discussing

To walk through the methods implemented with R to go from raw data to data analysis for an exposure assessment project for “EH 263 Analytical Methods and Exposure Assessment”.

The project aimed to explore the additional contribution of food truck exhaust to Ultra-fine Particle Concentrations in ambient air, above and beyond background levels.

Methods implemented that we have discussed

Project based workflows
Reading in data with readr
Conditionals and for loops
Writing functions
Data manipulation using dplyr
Combining (linking) datasets
Text processing using stringr
Working with factors, dates, and times using lubridate
Data visualizations using ggplot
Functional programming for efficient file imports and modeling

Why food trucks?

UFPs have an aerodynamic diameter of 0.1 µm or less (Li et al., 2016; Moreno-Ríos et al., 2022)

Diesel exhaust substantially contributes to UFP concentrations

Food trucks often use diesel or gas generators to power truck operations
- These generators are typically run for many hours (i.e., for the duration of time trucks are on-site)

Existing regulations

To improve the ambient air quality of air pollutants from idling vehicles, the Massachusetts Anti-Idling Law limits engine idling up to 5 minutes (MA Department of Environmental Protection, n.d.).
- However, there are no restrictions in the Mass. Anti-Idling Law, or other laws, that aim to constrain air-pollution emissions from food truck generators)

Hypotheses

UFP concentrations will be higher when food trucks are present compared with when they are not present

UFP measurements at a 5-meter distance from food trucks will be associated with higher UFP concentrations when compared with measurements taken at a 10-meter distance

Study design overview

Figure A: Sampling site configuration at the HSCP. Condensation Particle Counter (CPC) placement in relation to the HSCP outdoor dining area and food truck parking area is shown in red.

Figure B: Approximate timeline for the sampling protocol. Data collection began at approximately 10:00 AM and ended at approximately 11:30 AM. Food truck arrival times varied, but generally occurred between 10:15 and 11:00 AM.

Data processing problems to solve

Part 1: Compiling and organizing UFP data

Compile 34 raw data files that were exported from CPC device software

For each file (Part A):
- Standardize the file structure
- Derive new variables based on existing columns, and metadata embedded in file names
  - A sample ID variable
  - File name
  - Several date/time variables
  - An indicator for which CPC device collected the measurements
  - An indicator for which distance the measurements were collected at (5- versus 10-meter)
- Re-organize the columns

CPC raw data example file (n=34, each with 5000+ rows)

Part 1: Compiling and organizing UFP data (cont.)

For each file (Part B):
- Reference a sample log to “look up” truck arrival times for a given sample ID (e.g., FT001, FT002, etc.)
- Use this truck arrival time to create an indicator for whether a one-second measurement was taken before or after truck arrival (i.e., Pre-truck versus Post-truck)
  - Note: This is because the CPCs were continuously collecting measurements for the whole measurement period (all pre- and post-truck arrival measurements were contained within a single file)

Sample log example file (n=1)

Part 2: Joining measurement and sample log data with covariate data

The Pre- versus Post-truck indicator in Part 1 allows us to join data from a third file containing covariate data

The covariate data file contained:
- Sample ID
- A pre- versus post-truck indicator variable
- Meteorological parameters (e.g., temperature (F), % relative humidity, wind speed (mph), etc.