Text Manipulation with R

ID 529: Data Management and Analytic Workflows in R

Dean Marengi | Wednesday, January 10th, 2024

Motivation

  • We’ve now been introduced to dplyr and how it can be used to:
    • Write efficient R code to perform basic data manipulation tasks
    • Chain together data manipulation operations in a concise, readable sequence


  • But we have not yet discussed methods for handling and manipulating text data
    • Text manipulation is an important component of the data cleaning process
    • Text strings introduce many challenges, including:
      • Data import issues (data types, data structures)
      • Inconsistent variable names
      • Poor standardization of data


  • Often we need to process and clean text strings before data analysis can occur

Learning objectives

  • Understand the basic principles of text manipulation
    • The ways that text can be handled and manipulated
    • How regular expressions (regexps) can be used to match complex patterns
  • Learn about different approaches for text manipulation in R
    • Base R functions for text manipulation
    • The stringr package, which can help to simplify text manipulation
  • Learn how to implement stringr functions to process text, and prepare data for analysis
    • We will discuss simple examples for some key functions for text manipulation
    • Then work through an applied example to implement these functions
      • To address common text manipulation challenges
      • To demonstrate stringr functions work together with dplyr

Background

How is text different from other data types?

  • Text strings can contain inconsistent data
    • Unstructured
    • Upper and lower case letters
    • Special characters
    • Punctuation
    • White-space
    • Misspellings
    • Abbreviations
  • Data containing such issues must first be standardized before it is used in an analysis
    • Inconsistent text strings are a common data cleaning challenge in research!

Text Manipulation in Research

  • Standardizing variable names to a common format
    • Convert to all lower-case or simplifying names
    • Remove punctuation or leading/trailing white-space
    • Replace spaces with “_”
  • Pre-processing and cleaning data
    • Standardize data contained in each column
    • Remove text or punctuation in otherwise numeric columns
    • Make consistent ‘levels’ for a factor variable
  • Extracting specific information from the dataset
    • Use text data in one column to generate new, or manipulate existing, columns
  • Identifying files to read into R based on file naming conventions
    • E.g., a subset of csv files, each with results for an individual study participant

Text Manipulation in Research (cont.)

  • We won’t always received clean, curated datasets
    • Manually cleaning data files is poor practice (errors likely and not reproducible)
    • Must understand how to programatically manipulate text to address common challenges
  • Goal: Systematically implement a series of cleaning steps to prepare data for analysis
    • Often more than one solution for the same problem
    • Implementation of specific cleaning steps may vary from person-to-person
    • Overall sequence of cleaning steps should be similar, and follow an efficient order

Regular Expressions

  • Regular expression (regex) patterns
    • Sequence of text characters that have a specific meaning
    • Can match specific patterns in text strings
      • Use matches to extract, subset, replace, or perform other operations in a string
  • Learning regex takes time
    • Highly-specific syntax
    • Difficult to read and understand at first
    • Patterns vary widely in their complexity
    • Powerful data processing tool for your toolkit once you become more comfortable
# Load tidyverse
library(tidyverse)

# Create a simple text string
string <- "Regex patterns are a useful tool!"

# Match the word 'useful'
str_view_all(string, pattern = "useful")
[1] │ Regex patterns are a <useful> tool!
# Match 'useful', and anything that comes after
str_view_all(string, pattern = "useful.*[^\\!]")
[1] │ Regex patterns are a <useful tool>!
# Match letter 'e' when not followed by x or space a
str_view_all(string, pattern = "e(?!x|\\sa)")
[1] │ R<e>gex patt<e>rns are a us<e>ful tool!
# Match exclamation point or spaces
str_view_all(string, pattern = "\\!|\\s")
[1] │ Regex< >patterns< >are< >a< >useful< >tool<!>




“Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.”

Jamie Zawinski (1997)


Regex syntax


Regex syntax (cont.)


















How do we manipulate text using R?

  • Base R includes several powerful string manipulation functions
  • stringr package also provides functions to perform text manipulation
    • Easy-to-remember function names and syntax
      • All functions prefixed with str_*
    • Fewer arguments and often more intuitive to implement
  • Base R and stringr can both:
    • Standardize letter case and white space
    • Concatenate strings
    • Split strings
    • Locate patterns in strings
    • Extract patterns from strings
    • Replace sub-strings
    • Quantify the occurrence of patterns

Base R text manipulation

Base R text manipulation overview

  • Base R includes a number of useful functions for text manipulation
    • Converting letter case and trimming white space
    • Concatenating and splitting strings
    • Detecting, substituting, and extracting patterns
  • Pattern matching functions
    • grepl(): Detect patterns in a string
    • grep(): Keep strings where pattern matched
    • sub(): Replace first matches
    • gsub(): Replace all matches
    • regmatches(): Extract first or all matches
    • strsplit(): Split string using matched pattern
  • Combine strings
    • paste(): Concatenate with space
    • paste0(): Concatenate with no space
  • Change letter case of strings
    • tolower(): To lower case
    • toupper(): To upper case
  • Remove white space from strings
    • trimws(): Trim leading/trailing WS

stringr

stringr overview

  • Part of the core tidyverse package ecosystem
  • Consistent functions for performing string manipulation
  • Functions designed to work with dplyr verbs
    • Text manipulation can be performed in chained sequences with other data cleaning steps!
  • Core functions we will discuss:
    • str_detect(): Detect patterns in a string
    • str_subset(): Keep strings where pattern matched
    • str_replace(): Replace first matched pattern
    • str_replace_all(): Replace all matched patterns
    • str_extract(): Extract first matched pattern
    • str_extract_all(): Extract all matched patterns

Thinking about vectors

  • We talk a lot about vectors on the subsequent slides (terminology can be confusing!!)
  • A vector can contain a lot of different things (files names, numbers, words, etc.)
  • A simple way to conceptualize a vector is by thinking of it as a column in a table of data
vector <- c("element 1", "element 2", "...", "element n")
data.frame(column = vector)
##      column
## 1 element 1
## 2 element 2
## 3       ...
## 4 element n

stringr case coversion and whitespace


  • Change letter case of string
    • str_to_lower(): To lower-case
    • str_to_upper(): To upper-case
    • str_to_title(): To title-case
    • str_to_sentence(): To sentence-case
    • Default locale is English
  • Remove leading/training spaces from a string
    • str_trim(): Trim white space
    • str_squish(): Trim white space/extra space
# Load stringr library
library(stringr)

# Create a simple text string and print
text_string <- "  Text manipulation with R is    FUN!!!   "

Examples

# To lower case
str_to_lower(text_string)

# To upper case
str_to_upper(text_string)

# To title case
str_to_title(text_string)

# To sentence case
str_to_sentence(text_string)

# Remove white space
str_trim(text_string, side = "both")

# Remove white space, and extra internal spaces
str_squish(text_string)


## [1] "  text manipulation with r is    fun!!!   "
## [1] "  TEXT MANIPULATION WITH R IS    FUN!!!   "
## [1] "  Text Manipulation With R Is    Fun!!!   "
## [1] "  Text manipulation with r is    fun!!!   "
## [1] "Text manipulation with R is    FUN!!!"
## [1] "Text manipulation with R is FUN!!!"

stringr::str_detect()

Function(s): str_detect()

Main arguments

str_detect(string, pattern, negate = FALSE)
  • string = A character vector
  • pattern = The pattern to look for
  • negate = If TRUE, return non-matches

Description

  • Searches for the occurrence of a pattern in a string
  • Returns a logical vector (TRUE or FALSE)
    • Indicates whether the pattern was found in each element of the string
  • Equivalent to grepl()


Examples

# Character vector containing animal names as elements
string <- c("cat", "dog", "penguin", "kangaroo", "lion")

# Elements that match 'dog'
str_detect(string, pattern = "dog")
## [1] FALSE  TRUE FALSE FALSE FALSE

# Elements that do not match 'dog'
str_detect(string, pattern = "dog", negate = TRUE)
## [1]  TRUE FALSE  TRUE  TRUE  TRUE

# Elements that match 'dog' or 'cat'
str_detect(string, pattern = "dog|cat")
## [1]  TRUE  TRUE FALSE FALSE FALSE

# Elements that end with 'n'
str_detect(string, pattern = "n$")
## [1] FALSE FALSE  TRUE FALSE  TRUE

# Elements where 'g' is FOLLOWED BY a vowel
str_detect(string, pattern = "g(?=[aeiou])")
## [1] FALSE FALSE  TRUE  TRUE FALSE

# Elements where 'g' is PRECEDED BY a vowel
str_detect(string, pattern ="(?<=[aeiou])g")
## [1] FALSE  TRUE FALSE FALSE FALSE

# Elements where 'g' is FOLLOWED OR PRECEDED BY a vowel 
str_detect(string, pattern = "g(?=[aeiou])|(?<=[aeiou])g")
## [1] FALSE  TRUE  TRUE  TRUE FALSE

stringr::str_subset()

Function(s): str_subset(), str_which()

Main arguments

str_subset(string, pattern, negate = FALSE)
  • string = A character vector
  • pattern = The pattern to look for
  • negate = If TRUE, returns non-matches

Description

  • Both search for the occurrence of a pattern in a string
  • Both functions return a vector
  • str_subset() returns matched values
  • str_which() returns matched indices
    • Position where matches were found
  • Equivalent to grep()


Examples

string 
## [1] "cat"      "dog"      "penguin"  "kangaroo" "lion"

# Elements that match 'dog'
str_subset(string, "dog")
## [1] "dog"

# Elements that do not match 'dog'
str_subset(string, "dog", negate = TRUE)
## [1] "cat"      "penguin"  "kangaroo" "lion"

# Elements that match 'dog' or 'cat'
str_subset(string, "dog|cat")
## [1] "cat" "dog"

# Elements that end with 'n'
str_subset(string, "n$")
## [1] "penguin" "lion"

# Elements where 'g' is FOLLOWED BY a vowel
str_subset(string, "g(?=[aeiou])")
## [1] "penguin"  "kangaroo"

str_which(string, "g(?=[aeiou])") # returns indices
## [1] 3 4

# Elements where 'g' is PRECEDED BY a vowel
str_subset(string, "(?<=[aeiou])g")
## [1] "dog"

stringr::str_replace()

Function(s): str_replace(), str_replace_all()

Main arguments

str_replace(string, pattern, replacement)
str_replace_all(string, pattern, replacement)
  • string = A character vector
  • pattern = The pattern to look for
  • replacement= Character vector of replacements to supplant matches

Description

  • Replaces a matched pattern
  • Returns input vector (with replacements)
  • str_replace(): Replace first match
  • str_replace_all(): Replace all matches
  • Equivalent to sub(), gsub()


Examples

string
## [1] "cat"      "dog"      "penguin"  "kangaroo" "lion"

# Replace letter 'o' for the FIRST match in an element
str_replace(string, pattern = "o", replacement = "*")
## [1] "cat"      "d*g"      "penguin"  "kangar*o" "li*n"

# Replace letter 'o' for ALL matches in an element
str_replace_all(string, "o", "*")
## [1] "cat"      "d*g"      "penguin"  "kangar**" "li*n"

# Replace FIRST vowel matched in an element
str_replace(string, "[aeiou]", "*")
## [1] "c*t"      "d*g"      "p*nguin"  "k*ngaroo" "l*on"

# Replace ALL vowels matched in an element
str_replace_all(string, "[aeiou]", "*")
## [1] "c*t"      "d*g"      "p*ng**n"  "k*ng*r**" "l**n"

# Replace ALL non-vowels matched in an element
str_replace_all(string, "[^aeiou]", "*")
## [1] "*a*"      "*o*"      "*e**ui*"  "*a**a*oo" "*io*"

# Replace o's that occur sequentially in an element ('oo')
# o{2}: read "o matches exactly 2 times
str_replace_all(string, "o{2}", "*")
## [1] "cat"     "dog"     "penguin" "kangar*" "lion"

stringr::str_extract()

Function(s): str_extract(), str_extract_all()

Main arguments

str_extract(string, pattern)
str_extract_all(string, pattern, simplify = FALSE)
  • string = A character vector
  • pattern = The pattern to look for
  • simplify= If TRUE, returns a character matrix. FALSE returns a character vector.

Description

  • Extract matching patterns from a string
  • str_extract(): Extract first match
    • Returns character vector
  • str_extract_all(): Extract all matches
    • Returns list of character vectors or a character matrix


Examples

string
## [1] "cat"      "dog"      "penguin"  "kangaroo" "lion"

# Extract first letter 'o' of each vector element
str_extract(string, pattern = "o")
## [1] NA  "o" NA  "o" "o"

# Extract first non-vowel from each element
str_extract(string, pattern = "[^aeiou]")
## [1] "c" "d" "p" "k" "l"

# Extract all non-vowels from each element
str_extract_all(string, pattern = "[^aeiou]")
## [[1]]
## [1] "c" "t"
## 
## [[2]]
## [1] "d" "g"
## 
## [[3]]
## [1] "p" "n" "g" "n"
## 
## [[4]]
## [1] "k" "n" "g" "r"
## 
## [[5]]
## [1] "l" "n"

# Extract character vector for the 5th list element
str_extract_all(string, pattern = "[^aeiou]")[[5]]
## [1] "l" "n"

Case study

Case study: Overview


Coming back to the messy dataset we discussed at the start of the lecture

  • What steps are necessary to prepare these data for analysis?
    • Why are the data problematic as shown?
  • Pseudocode the steps necessary to address the problems you’ve identified.
    • Think about the order in which steps should occur, and why.

Case study: Import the data

# Load packages
library(kableExtra) # For formatting table output 

# Read in the data
data <- read_csv("data/messy_data.csv", na = character())

# Take a quick look
glimpse(data)
Rows: 15
Columns: 11
$ `Participant ID`   <dbl> 22419, 22491, 22111, 22254, 22248, 22319, 22283, 21582, 21373, 22875, 21866, 21537, 21797, 22313, 226…
$ `Age (Years)`      <dbl> 53, 44, 52, 42, 27, 60, 34, 47, 44, 46, 54, 29, 48, 22, 40
$ `Race/Ethnicity`   <chr> "Asian & Pacific Islander", "Black", "white", "white", "white", "Middle Eastern", "Asian & Pacific Is…
$ `GAD 7 Question 1` <chr> "Nearly every day", "SEVERAL DAYS", "0", "Not at all", "1", "NEarly every Day", "Several days", "2", …
$ `GAD 7 Q2`         <chr> "NEarly every Day", "More than 1/2 the days", "NEarly every Day", "SEVERAL DAYS", "0", "NEarly every …
$ `GAD 7_Q3`         <dbl> 3, 1, 2, 0, 0, 1, 3, 0, 0, 1, 1, 2, 0, 3, 2
$ `GAD 7_Q_4`        <dbl> 0, 0, 3, 3, 2, 1, 0, 0, 3, 2, 2, 1, 3, 2, 0
$ GAD.7.Question.5   <dbl> 2, 3, 2, 0, 1, 1, 0, 2, 1, 2, 3, 3, 0, 0, 3
$ `GAD 7 Question 6` <dbl> 0, 3, 1, 2, 0, 2, 1, 2, 3, 1, 0, 1, 0, 3, 3
$ `GAD 7 Question 7` <dbl> 3, 2, 1, 1, 0, 3, 3, 3, 0, 2, 1, 3, 1, 2, 1
$ Medications        <chr> "none", "None", "None", "Celexa for Anxiety", "Zoloft - Depression", "None", "Zoloft - Depression", "…

Case study: Check distinct values

# Function to get a simple look at all the distinct values that appear in the data
check_distinct <- function(data) {
  # Extract existing data types for each column and store as character vector
  col_types <- map_chr(data, class)
  
  # Helper function to replace NA values and sort the distinct values
  str_clean <- function(x) str_sort(str_replace_na(unique(x), replacement = "."))
  
  # Create data frame view to summarize the messy data
  map(data, ~ str_c(str_clean(as.character(.x)), collapse = " | ")) %>% 
    bind_rows() %>% 
    pivot_longer(everything(), names_to = "col_name", values_to = "distinct_levels") %>% 
    mutate(distinct_count = str_count(distinct_levels, pattern = "(?<!\\.\\s)\\|") + 1) %>%
    add_column(.after = "col_name", type = col_types) %>% 
    kbl() %>%
    kable_styling(font_size = 13)
}

check_distinct(data)
col_name type distinct_levels distinct_count
Participant ID numeric 21373 | 21537 | 21582 | 21797 | 21866 | 22111 | 22248 | 22254 | 22283 | 22313 | 22319 | 22419 | 22491 | 22652 | 22875 15
Age (Years) numeric 22 | 27 | 29 | 34 | 40 | 42 | 44 | 46 | 47 | 48 | 52 | 53 | 54 | 60 14
Race/Ethnicity character Asian & Pacific Islander | black | Black | Middle Eastern | NA | white 6
GAD 7 Question 1 character 0 | 1 | 2 | 3 | More than 1/2 the days | Nearly every day | NEarly every Day | Not at all | Several days | SEVERAL DAYS 10
GAD 7 Q2 character 0 | 1 | More than 1/2 the days | More than half the days | Nearly every day | NEarly every Day | Not at all | Several days | SEVERAL DAYS 9
GAD 7_Q3 numeric 0 | 1 | 2 | 3 4
GAD 7_Q_4 numeric 0 | 1 | 2 | 3 4
GAD.7.Question.5 numeric 0 | 1 | 2 | 3 4
GAD 7 Question 6 numeric 0 | 1 | 2 | 3 4
GAD 7 Question 7 numeric 0 | 1 | 2 | 3 4
Medications character Buspirone (Anxiety) | Celexa for Anxiety | N/A none | none | None | None Reported | Zoloft - Depression 7

Case study: Clean-up column names

One approach

# Chain together column name clean-up steps
new_data <- data %>% 
  # Remove leading and trailing spaces in the column names
  rename_with(~ str_trim(str_to_lower(.))) %>% 
  # Remove spaces (\\s) or periods (\\.) between 'gad' and '7' (i.e., make var prefix gad7)   
  rename_with(~ str_replace_all(., "d\\s7|d\\.7", "d7")) %>% 
  # Remove parentheses, and replace spaces, periods, or forward slashes with "_"
  rename_with(~ str_replace_all(., c("\\(|\\)" = "", "\\s|\\.|\\/" = "_"))) %>% 
  # Name clean-up for specific columns
  rename_with(~ str_replace_all(., c("question_|q_" = "q", 
                                     "ethn.*" = "ethn", 
                                     "med.*" = "psych_meds", 
                                     "ye.*" = "yrs",
                                     ".*id" = "id"))) 

check_distinct(new_data)
col_name type distinct_levels distinct_count
id numeric 21373 | 21537 | 21582 | 21797 | 21866 | 22111 | 22248 | 22254 | 22283 | 22313 | 22319 | 22419 | 22491 | 22652 | 22875 15
age_yrs numeric 22 | 27 | 29 | 34 | 40 | 42 | 44 | 46 | 47 | 48 | 52 | 53 | 54 | 60 14
race_ethn character Asian & Pacific Islander | black | Black | Middle Eastern | NA | white 6
gad7_q1 character 0 | 1 | 2 | 3 | More than 1/2 the days | Nearly every day | NEarly every Day | Not at all | Several days | SEVERAL DAYS 10
gad7_q2 character 0 | 1 | More than 1/2 the days | More than half the days | Nearly every day | NEarly every Day | Not at all | Several days | SEVERAL DAYS 9
gad7_q3 numeric 0 | 1 | 2 | 3 4
gad7_q4 numeric 0 | 1 | 2 | 3 4
gad7_q5 numeric 0 | 1 | 2 | 3 4
gad7_q6 numeric 0 | 1 | 2 | 3 4
gad7_q7 numeric 0 | 1 | 2 | 3 4
psych_meds character Buspirone (Anxiety) | Celexa for Anxiety | N/A none | none | None | None Reported | Zoloft - Depression 7

Case study: Clean-up column names

Another approach

# Write a function for name column clean-up 
name_cleanup <- function(data, patterns) {
  data <- data %>%
    # Convert to lower-case and replace
    rename_with(~ str_trim(str_to_lower(.))) %>%
    rename_with(~ str_replace_all(., patterns))
  return(data)
}

# Patterns in a "named vector"
# format: pattern = replacement
patterns <- c("d\\s7|d\\.7" = "d7",   # gad7 prefix
              "\\(|\\)" = "",         # parentheses
              "\\s|\\.|\\/" = "_",    # spaces/periods
              "question_|q_" = "q",   # gad7 suffix
              "ethn.*" = "ethn",      # race/ethnicity
              "med.*" = "psych_meds", # medications
              "ye.*" = "yrs",         # age, years suffix
              ".*id" = "id")          # participant id

# Take our original data
data %>% 
  # Then clean-up the variable names
  name_cleanup(., patterns) %>% 
  # Then check-out the result
  check_distinct()
col_name type distinct_levels distinct_count
id numeric 21373 | 21537 | 21582 | 21797 | 21866 | 22111 | 22248 | 22254 | 22283 | 22313 | 22319 | 22419 | 22491 | 22652 | 22875 15
age_yrs numeric 22 | 27 | 29 | 34 | 40 | 42 | 44 | 46 | 47 | 48 | 52 | 53 | 54 | 60 14
race_ethn character Asian & Pacific Islander | black | Black | Middle Eastern | NA | white 6
gad7_q1 character 0 | 1 | 2 | 3 | More than 1/2 the days | Nearly every day | NEarly every Day | Not at all | Several days | SEVERAL DAYS 10
gad7_q2 character 0 | 1 | More than 1/2 the days | More than half the days | Nearly every day | NEarly every Day | Not at all | Several days | SEVERAL DAYS 9
gad7_q3 numeric 0 | 1 | 2 | 3 4
gad7_q4 numeric 0 | 1 | 2 | 3 4
gad7_q5 numeric 0 | 1 | 2 | 3 4
gad7_q6 numeric 0 | 1 | 2 | 3 4
gad7_q7 numeric 0 | 1 | 2 | 3 4
psych_meds character Buspirone (Anxiety) | Celexa for Anxiety | N/A none | none | None | None Reported | Zoloft - Depression 7

Case study: Clean up values

# Begin cleaning up values in the data
new_data <- new_data %>% 
  mutate(
    # Convert strings in character columns to lower case
    # Remove leading and trailing white space
    across(where(is.character), ~ str_trim(str_to_lower(.))),
    # Correct discrepant values in gad7_q1 and _q2 columns
    across(matches("gad.*[1-2]"), ~ str_replace_all(., "1/2", "half")), 
    # Make specific string replacements across columns
    across(everything(), ~ str_replace_all(., c("n/a" = "na", 
                                                     ".*none.*" = "none", 
                                                     "\\s-\\s|\\s\\(" = "_", 
                                                     "\\)" = "", 
                                                     "\\sfor\\s" = "_"))),
    # Replace text string 'na' values with true NA's
    across(everything(), ~ na_if(., "na"))
    ) 

check_distinct(new_data)
col_name type distinct_levels distinct_count
id character 21373 | 21537 | 21582 | 21797 | 21866 | 22111 | 22248 | 22254 | 22283 | 22313 | 22319 | 22419 | 22491 | 22652 | 22875 15
age_yrs character 22 | 27 | 29 | 34 | 40 | 42 | 44 | 46 | 47 | 48 | 52 | 53 | 54 | 60 14
race_ethn character . | asian & pacific islander | black | middle eastern | white 4
gad7_q1 character 0 | 1 | 2 | 3 | more than half the days | nearly every day | not at all | several days 8
gad7_q2 character 0 | 1 | more than half the days | nearly every day | not at all | several days 6
gad7_q3 character 0 | 1 | 2 | 3 4
gad7_q4 character 0 | 1 | 2 | 3 4
gad7_q5 character 0 | 1 | 2 | 3 4
gad7_q6 character 0 | 1 | 2 | 3 4
gad7_q7 character 0 | 1 | 2 | 3 4
psych_meds character buspirone_anxiety | celexa_anxiety | none | zoloft_depression 4

Case study: Clean up more values

# Correct response values for GAD-7 questions 1 and 2
new_data <- new_data %>% 
  mutate(
    across(matches("gad.*[1-2]"), 
           ~ case_when(
           . == "not at all" ~ "0",
           . == "several days" ~ "1",
           . == "more than half the days" ~ "2",
           . == "nearly every day" ~ "3", 
           TRUE ~ .)
           )
    )

check_distinct(new_data)
col_name type distinct_levels distinct_count
id character 21373 | 21537 | 21582 | 21797 | 21866 | 22111 | 22248 | 22254 | 22283 | 22313 | 22319 | 22419 | 22491 | 22652 | 22875 15
age_yrs character 22 | 27 | 29 | 34 | 40 | 42 | 44 | 46 | 47 | 48 | 52 | 53 | 54 | 60 14
race_ethn character . | asian & pacific islander | black | middle eastern | white 4
gad7_q1 character 0 | 1 | 2 | 3 4
gad7_q2 character 0 | 1 | 2 | 3 4
gad7_q3 character 0 | 1 | 2 | 3 4
gad7_q4 character 0 | 1 | 2 | 3 4
gad7_q5 character 0 | 1 | 2 | 3 4
gad7_q6 character 0 | 1 | 2 | 3 4
gad7_q7 character 0 | 1 | 2 | 3 4
psych_meds character buspirone_anxiety | celexa_anxiety | none | zoloft_depression 4

Case study: Coerce data types/derive variables

# Coerce data types and generate new variables based on existing columns
new_data <- new_data %>%   
  mutate(
    across(matches("id|age|gad"), as.numeric), 
    indication = str_extract(psych_meds, "(?<=_).*"),
    psych_meds = str_replace_all(psych_meds, "_.*", ""),
    depression = if_else(indication == "depression", 1, 0, missing = NA_real_),
    anxiety = if_else(indication == "anxiety", 1, 0, missing = NA_real_),
    gad7_raw = rowSums(across(matches("gad"))), 
    gad7_cat = 
      case_when(
        gad7_raw <= 4 ~ "minimal",
        gad7_raw %in% c(5:9) ~ "mild",
        gad7_raw %in% c(10:14) ~ "moderate",
        gad7_raw >= 15 ~ "severe"
        ),
    across(matches("race|_cat"), as.factor)
  )
    
check_distinct(new_data)

Case study: Coerce data types/derive variables

col_name type distinct_levels distinct_count
id numeric 21373 | 21537 | 21582 | 21797 | 21866 | 22111 | 22248 | 22254 | 22283 | 22313 | 22319 | 22419 | 22491 | 22652 | 22875 15
age_yrs numeric 22 | 27 | 29 | 34 | 40 | 42 | 44 | 46 | 47 | 48 | 52 | 53 | 54 | 60 14
race_ethn factor . | asian & pacific islander | black | middle eastern | white 4
gad7_q1 numeric 0 | 1 | 2 | 3 4
gad7_q2 numeric 0 | 1 | 2 | 3 4
gad7_q3 numeric 0 | 1 | 2 | 3 4
gad7_q4 numeric 0 | 1 | 2 | 3 4
gad7_q5 numeric 0 | 1 | 2 | 3 4
gad7_q6 numeric 0 | 1 | 2 | 3 4
gad7_q7 numeric 0 | 1 | 2 | 3 4
psych_meds character buspirone | celexa | none | zoloft 4
indication character . | anxiety | depression 2
depression numeric . | 0 | 1 2
anxiety numeric . | 0 | 1 2
gad7_raw numeric 12 | 13 | 14 | 4 | 7 | 8 | 9 7
gad7_cat factor mild | minimal | moderate 3

Key takeaways

  • Text strings often contain inconsistent data, and are common in research data
    • Data containing such issues must be standardized before it is used in an analysis!**
  • Base R and stringr offer a range of functions for manipulating text strings
  • stringr is designed to work with dplyr and other tidyverse packages
    • Text manipulation can be performed in chained sequences with other data cleaning steps!
  • Regular expressions (regex) are a powerful tool for cleaning and preprocessing research data
    • Understanding when and how to implement regex will take some practice!

Resources