Text Manipulation with R

ID 529: Data Management and Analytic Workflows in R

Dean Marengi | Wednesday, January 10^th, 2024

Motivation

We’ve now been introduced to dplyr and how it can be used to:
- Write efficient R code to perform basic data manipulation tasks
- Chain together data manipulation operations in a concise, readable sequence

But we have not yet discussed methods for handling and manipulating text data
- Text manipulation is an important component of the data cleaning process
- Text strings introduce many challenges, including:
  - Data import issues (data types, data structures)
  - Inconsistent variable names
  - Poor standardization of data

Often we need to process and clean text strings before data analysis can occur

Learning objectives

Understand the basic principles of text manipulation
- The ways that text can be handled and manipulated
- How regular expressions (regexps) can be used to match complex patterns

Learn about different approaches for text manipulation in R
- Base R functions for text manipulation
- The stringr package, which can help to simplify text manipulation

Learn how to implement stringr functions to process text, and prepare data for analysis
- We will discuss simple examples for some key functions for text manipulation
- Then work through an applied example to implement these functions
  - To address common text manipulation challenges
  - To demonstrate stringr functions work together with dplyr

Background

How is text different from other data types?

Text strings can contain inconsistent data
- Unstructured
- Upper and lower case letters
- Special characters
- Punctuation
- White-space
- Misspellings
- Abbreviations

Data containing such issues must first be standardized before it is used in an analysis
- Inconsistent text strings are a common data cleaning challenge in research!

Text Manipulation in Research

Standardizing variable names to a common format
- Convert to all lower-case or simplifying names
- Remove punctuation or leading/trailing white-space
- Replace spaces with “_”

Pre-processing and cleaning data
- Standardize data contained in each column
- Remove text or punctuation in otherwise numeric columns
- Make consistent ‘levels’ for a factor variable

Extracting specific information from the dataset
- Use text data in one column to generate new, or manipulate existing, columns

Identifying files to read into R based on file naming conventions
- E.g., a subset of csv files, each with results for an individual study participant

Text Manipulation in Research (cont.)

We won’t always received clean, curated datasets
- Manually cleaning data files is poor practice (errors likely and not reproducible)
- Must understand how to programatically manipulate text to address common challenges

Goal: Systematically implement a series of cleaning steps to prepare data for analysis
- Often more than one solution for the same problem
- Implementation of specific cleaning steps may vary from person-to-person
- Overall sequence of cleaning steps should be similar, and follow an efficient order

Regular Expressions

Regular expression (regex) patterns
- Sequence of text characters that have a specific meaning
- Can match specific patterns in text strings
  - Use matches to extract, subset, replace, or perform other operations in a string

Learning regex takes time
- Highly-specific syntax
- Difficult to read and understand at first
- Patterns vary widely in their complexity
- Powerful data processing tool for your toolkit once you become more comfortable

# Load tidyverse
library(tidyverse)

# Create a simple text string
string <- "Regex patterns are a useful tool!"

# Match the word 'useful'
str_view_all(string, pattern = "useful")

[1] │ Regex patterns are a <useful> tool!

# Match 'useful', and anything that comes after
str_view_all(string, pattern = "useful.*[^\\!]")

[1] │ Regex patterns are a <useful tool>!

# Match letter 'e' when not followed by x or space a
str_view_all(string, pattern = "e(?!x|\\sa)")

[1] │ R<e>gex patt<e>rns are a us<e>ful tool!

# Match exclamation point or spaces
str_view_all(string, pattern = "\\!|\\s")

[1] │ Regex< >patterns< >are< >a< >useful< >tool<!>

“Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.”

– Jamie Zawinski (1997)

Regex syntax

Regex syntax (cont.)

How do we manipulate text using R?

Base R includes several powerful string manipulation functions
stringr package also provides functions to perform text manipulation
- Easy-to-remember function names and syntax
  - All functions prefixed with str_*
- Fewer arguments and often more intuitive to implement

Base R and stringr can both:
- Standardize letter case and white space
- Concatenate strings
- Split strings
- Locate patterns in strings
- Extract patterns from strings
- Replace sub-strings
- Quantify the occurrence of patterns

Base R text manipulation

Base R text manipulation overview

Base R includes a number of useful functions for text manipulation
- Converting letter case and trimming white space
- Concatenating and splitting strings
- Detecting, substituting, and extracting patterns

Pattern matching functions
- grepl(): Detect patterns in a string
- grep(): Keep strings where pattern matched
- sub(): Replace first matches
- gsub(): Replace all matches
- regmatches(): Extract first or all matches
- strsplit(): Split string using matched pattern

Combine strings
- paste(): Concatenate with space
- paste0(): Concatenate with no space
Change letter case of strings
- tolower(): To lower case
- toupper(): To upper case
Remove white space from strings
- trimws(): Trim leading/trailing WS

`stringr`

`stringr` overview

Part of the core tidyverse package ecosystem
Consistent functions for performing string manipulation
Functions designed to work with dplyr verbs
- Text manipulation can be performed in chained sequences with other data cleaning steps!

Core functions we will discuss:
- str_detect(): Detect patterns in a string
- str_subset(): Keep strings where pattern matched
- str_replace(): Replace first matched pattern
- str_replace_all(): Replace all matched patterns
- str_extract(): Extract first matched pattern
- str_extract_all(): Extract all matched patterns

Thinking about vectors

We talk a lot about vectors on the subsequent slides (terminology can be confusing!!)
A vector can contain a lot of different things (files names, numbers, words, etc.)
A simple way to conceptualize a vector is by thinking of it as a column in a table of data

vector <- c("element 1", "element 2", "...", "element n")
data.frame(column = vector)
##      column
## 1 element 1
## 2 element 2
## 3       ...
## 4 element n

`stringr` case coversion and whitespace

Change letter case of string
- str_to_lower(): To lower-case
- str_to_upper(): To upper-case
- str_to_title(): To title-case
- str_to_sentence(): To sentence-case
- Default locale is English

Remove leading/training spaces from a string
- str_trim(): Trim white space
- str_squish(): Trim white space/extra space

# Load stringr library
library(stringr)

# Create a simple text string and print
text_string <- "  Text manipulation with R is    FUN!!!   "

Examples

# To lower case
str_to_lower(text_string)

# To upper case
str_to_upper(text_string)

# To title case
str_to_title(text_string)

# To sentence case
str_to_sentence(text_string)

# Remove white space
str_trim(text_string, side = "both")

# Remove white space, and extra internal spaces
str_squish(text_string)

## [1] "  text manipulation with r is    fun!!!   "
## [1] "  TEXT MANIPULATION WITH R IS    FUN!!!   "
## [1] "  Text Manipulation With R Is    Fun!!!   "
## [1] "  Text manipulation with r is    fun!!!   "
## [1] "Text manipulation with R is    FUN!!!"
## [1] "Text manipulation with R is FUN!!!"

`stringr::str_detect()`

Function(s): str_detect()

Main arguments

str_detect(string, pattern, negate = FALSE)

string = A character vector
pattern = The pattern to look for
negate = If TRUE, return non-matches

Description

Searches for the occurrence of a pattern in a string
Returns a logical vector (TRUE or FALSE)
- Indicates whether the pattern was found in each element of the string
Equivalent to grepl()

Examples

# Character vector containing animal names as elements
string <- c("cat", "dog", "penguin", "kangaroo", "lion")

# Elements that match 'dog'
str_detect(string, pattern = "dog")
## [1] FALSE  TRUE FALSE FALSE FALSE

# Elements that do not match 'dog'
str_detect(string, pattern = "dog", negate = TRUE)
## [1]  TRUE FALSE  TRUE  TRUE  TRUE

# Elements that match 'dog' or 'cat'
str_detect(string, pattern = "dog|cat")
## [1]  TRUE  TRUE FALSE FALSE FALSE

# Elements that end with 'n'
str_detect(string, pattern = "n$")
## [1] FALSE FALSE  TRUE FALSE  TRUE

# Elements where 'g' is FOLLOWED BY a vowel
str_detect(string, pattern = "g(?=[aeiou])")
## [1] FALSE FALSE  TRUE  TRUE FALSE

# Elements where 'g' is PRECEDED BY a vowel
str_detect(string, pattern ="(?<=[aeiou])g")
## [1] FALSE  TRUE FALSE FALSE FALSE

# Elements where 'g' is FOLLOWED OR PRECEDED BY a vowel 
str_detect(string, pattern = "g(?=[aeiou])|(?<=[aeiou])g")
## [1] FALSE  TRUE  TRUE  TRUE FALSE

`stringr::str_subset()`

Function(s): str_subset(), str_which()

Main arguments

str_subset(string, pattern, negate = FALSE)

string = A character vector
pattern = The pattern to look for
negate = If TRUE, returns non-matches

Description

Both search for the occurrence of a pattern in a string
Both functions return a vector
str_subset() returns matched values
str_which() returns matched indices
- Position where matches were found
Equivalent to grep()

Examples

string 
## [1] "cat"      "dog"      "penguin"  "kangaroo" "lion"

# Elements that match 'dog'
str_subset(string, "dog")
## [1] "dog"

# Elements that do not match 'dog'
str_subset(string, "dog", negate = TRUE)
## [1] "cat"      "penguin"  "kangaroo" "lion"

# Elements that match 'dog' or 'cat'
str_subset(string, "dog|cat")
## [1] "cat" "dog"

# Elements that end with 'n'
str_subset(string, "n$")
## [1] "penguin" "lion"

# Elements where 'g' is FOLLOWED BY a vowel
str_subset(string, "g(?=[aeiou])")
## [1] "penguin"  "kangaroo"

str_which(string, "g(?=[aeiou])") # returns indices
## [1] 3 4

# Elements where 'g' is PRECEDED BY a vowel
str_subset(string, "(?<=[aeiou])g")
## [1] "dog"

`stringr::str_replace()`

Function(s): str_replace(), str_replace_all()

Main arguments

str_replace(string, pattern, replacement)
str_replace_all(string, pattern, replacement)

string = A character vector
pattern = The pattern to look for
replacement= Character vector of replacements to supplant matches

Description

Replaces a matched pattern
Returns input vector (with replacements)
str_replace(): Replace first match
str_replace_all(): Replace all matches
Equivalent to sub(), gsub()

Examples

string
## [1] "cat"      "dog"      "penguin"  "kangaroo" "lion"

# Replace letter 'o' for the FIRST match in an element
str_replace(string, pattern = "o", replacement = "*")
## [1] "cat"      "d*g"      "penguin"  "kangar*o" "li*n"

# Replace letter 'o' for ALL matches in an element
str_replace_all(string, "o", "*")
## [1] "cat"      "d*g"      "penguin"  "kangar**" "li*n"

# Replace FIRST vowel matched in an element
str_replace(string, "[aeiou]", "*")
## [1] "c*t"      "d*g"      "p*nguin"  "k*ngaroo" "l*on"

# Replace ALL vowels matched in an element
str_replace_all(string, "[aeiou]", "*")
## [1] "c*t"      "d*g"      "p*ng**n"  "k*ng*r**" "l**n"

# Replace ALL non-vowels matched in an element
str_replace_all(string, "[^aeiou]", "*")
## [1] "*a*"      "*o*"      "*e**ui*"  "*a**a*oo" "*io*"

# Replace o's that occur sequentially in an element ('oo')
# o{2}: read "o matches exactly 2 times
str_replace_all(string, "o{2}", "*")
## [1] "cat"     "dog"     "penguin" "kangar*" "lion"

`stringr::str_extract()`

Function(s): str_extract(), str_extract_all()

Main arguments

str_extract(string, pattern)
str_extract_all(string, pattern, simplify = FALSE)

string = A character vector
pattern = The pattern to look for
simplify= If TRUE, returns a character matrix. FALSE returns a character vector.

Description

Extract matching patterns from a string
str_extract(): Extract first match
- Returns character vector
str_extract_all(): Extract all matches
- Returns list of character vectors or a character matrix

Examples

string
## [1] "cat"      "dog"      "penguin"  "kangaroo" "lion"

# Extract first letter 'o' of each vector element
str_extract(string, pattern = "o")
## [1] NA  "o" NA  "o" "o"

# Extract first non-vowel from each element
str_extract(string, pattern = "[^aeiou]")
## [1] "c" "d" "p" "k" "l"

# Extract all non-vowels from each element
str_extract_all(string, pattern = "[^aeiou]")
## [[1]]
## [1] "c" "t"
## 
## [[2]]
## [1] "d" "g"
## 
## [[3]]
## [1] "p" "n" "g" "n"
## 
## [[4]]
## [1] "k" "n" "g" "r"
## 
## [[5]]
## [1] "l" "n"

# Extract character vector for the 5th list element
str_extract_all(string, pattern = "[^aeiou]")[[5]]
## [1] "l" "n"

Case study

Case study: Overview

Coming back to the messy dataset we discussed at the start of the lecture

What steps are necessary to prepare these data for analysis?
- Why are the data problematic as shown?

Pseudocode the steps necessary to address the problems you’ve identified.
- Think about the order in which steps should occur, and why.

Case study: Import the data

# Load packages
library(kableExtra) # For formatting table output 

# Read in the data
data <- read_csv("data/messy_data.csv", na = character())

# Take a quick look
glimpse(data)

Rows: 15
Columns: 11
$ `Participant ID`   <dbl> 22419, 22491, 22111, 22254, 22248, 22319, 22283, 21582, 21373, 22875, 21866, 21537, 21797, 22313, 226…
$ `Age (Years)`      <dbl> 53, 44, 52, 42, 27, 60, 34, 47, 44, 46, 54, 29, 48, 22, 40
$ `Race/Ethnicity`   <chr> "Asian & Pacific Islander", "Black", "white", "white", "white", "Middle Eastern", "Asian & Pacific Is…
$ `GAD 7 Question 1` <chr> "Nearly every day", "SEVERAL DAYS", "0", "Not at all", "1", "NEarly every Day", "Several days", "2", …
$ `GAD 7 Q2`         <chr> "NEarly every Day", "More than 1/2 the days", "NEarly every Day", "SEVERAL DAYS", "0", "NEarly every …
$ `GAD 7_Q3`         <dbl> 3, 1, 2, 0, 0, 1, 3, 0, 0, 1, 1, 2, 0, 3, 2
$ `GAD 7_Q_4`        <dbl> 0, 0, 3, 3, 2, 1, 0, 0, 3, 2, 2, 1, 3, 2, 0
$ GAD.7.Question.5   <dbl> 2, 3, 2, 0, 1, 1, 0, 2, 1, 2, 3, 3, 0, 0, 3
$ `GAD 7 Question 6` <dbl> 0, 3, 1, 2, 0, 2, 1, 2, 3, 1, 0, 1, 0, 3, 3
$ `GAD 7 Question 7` <dbl> 3, 2, 1, 1, 0, 3, 3, 3, 0, 2, 1, 3, 1, 2, 1
$ Medications        <chr> "none", "None", "None", "Celexa for Anxiety", "Zoloft - Depression", "None", "Zoloft - Depression", "…

Case study: Check distinct values

# Function to get a simple look at all the distinct values that appear in the data
check_distinct <- function(data) {
  # Extract existing data types for each column and store as character vector
  col_types <- map_chr(data, class)
  
  # Helper function to replace NA values and sort the distinct values
  str_clean <- function(x) str_sort(str_replace_na(unique(x), replacement = "."))
  
  # Create data frame view to summarize the messy data
  map(data, ~ str_c(str_clean(as.character(.x)), collapse = " | ")) %>% 
    bind_rows() %>% 
    pivot_longer(everything(), names_to = "col_name", values_to = "distinct_levels") %>% 
    mutate(distinct_count = str_count(distinct_levels, pattern = "(?<!\\.\\s)\\|") + 1) %>%
    add_column(.after = "col_name", type = col_types) %>% 
    kbl() %>%
    kable_styling(font_size = 13)
}

check_distinct(data)

col_name	type	distinct_levels	distinct_count
Participant ID	numeric	21373 \| 21537 \| 21582 \| 21797 \| 21866 \| 22111 \| 22248 \| 22254 \| 22283 \| 22313 \| 22319 \| 22419 \| 22491 \| 22652 \| 22875	15
Age (Years)	numeric	22 \| 27 \| 29 \| 34 \| 40 \| 42 \| 44 \| 46 \| 47 \| 48 \| 52 \| 53 \| 54 \| 60	14
Race/Ethnicity	character	Asian & Pacific Islander \| black \| Black \| Middle Eastern \| NA \| white	6
GAD 7 Question 1	character	0 \| 1 \| 2 \| 3 \| More than 1/2 the days \| Nearly every day \| NEarly every Day \| Not at all \| Several days \| SEVERAL DAYS	10
GAD 7 Q2	character	0 \| 1 \| More than 1/2 the days \| More than half the days \| Nearly every day \| NEarly every Day \| Not at all \| Several days \| SEVERAL DAYS	9
GAD 7_Q3	numeric	0 \| 1 \| 2 \| 3	4
GAD 7_Q_4	numeric	0 \| 1 \| 2 \| 3	4
GAD.7.Question.5	numeric	0 \| 1 \| 2 \| 3	4
GAD 7 Question 6	numeric	0 \| 1 \| 2 \| 3	4
GAD 7 Question 7	numeric	0 \| 1 \| 2 \| 3	4
Medications	character	Buspirone (Anxiety) \| Celexa for Anxiety \| N/A none \| none \| None \| None Reported \| Zoloft - Depression	7

Case study: Clean-up column names

One approach

# Chain together column name clean-up steps
new_data <- data %>% 
  # Remove leading and trailing spaces in the column names
  rename_with(~ str_trim(str_to_lower(.))) %>% 
  # Remove spaces (\\s) or periods (\\.) between 'gad' and '7' (i.e., make var prefix gad7)   
  rename_with(~ str_replace_all(., "d\\s7|d\\.7", "d7")) %>% 
  # Remove parentheses, and replace spaces, periods, or forward slashes with "_"
  rename_with(~ str_replace_all(., c("\\(|\\)" = "", "\\s|\\.|\\/" = "_"))) %>% 
  # Name clean-up for specific columns
  rename_with(~ str_replace_all(., c("question_|q_" = "q", 
                                     "ethn.*" = "ethn", 
                                     "med.*" = "psych_meds", 
                                     "ye.*" = "yrs",
                                     ".*id" = "id"))) 

check_distinct(new_data)

col_name	type	distinct_levels	distinct_count
id	numeric	21373 \| 21537 \| 21582 \| 21797 \| 21866 \| 22111 \| 22248 \| 22254 \| 22283 \| 22313 \| 22319 \| 22419 \| 22491 \| 22652 \| 22875	15
age_yrs	numeric	22 \| 27 \| 29 \| 34 \| 40 \| 42 \| 44 \| 46 \| 47 \| 48 \| 52 \| 53 \| 54 \| 60	14
race_ethn	character	Asian & Pacific Islander \| black \| Black \| Middle Eastern \| NA \| white	6
gad7_q1	character	0 \| 1 \| 2 \| 3 \| More than 1/2 the days \| Nearly every day \| NEarly every Day \| Not at all \| Several days \| SEVERAL DAYS	10
gad7_q2	character	0 \| 1 \| More than 1/2 the days \| More than half the days \| Nearly every day \| NEarly every Day \| Not at all \| Several days \| SEVERAL DAYS	9
gad7_q3	numeric	0 \| 1 \| 2 \| 3	4
gad7_q4	numeric	0 \| 1 \| 2 \| 3	4
gad7_q5	numeric	0 \| 1 \| 2 \| 3	4
gad7_q6	numeric	0 \| 1 \| 2 \| 3	4
gad7_q7	numeric	0 \| 1 \| 2 \| 3	4
psych_meds	character	Buspirone (Anxiety) \| Celexa for Anxiety \| N/A none \| none \| None \| None Reported \| Zoloft - Depression	7

Case study: Clean-up column names

Another approach

# Write a function for name column clean-up 
name_cleanup <- function(data, patterns) {
  data <- data %>%
    # Convert to lower-case and replace
    rename_with(~ str_trim(str_to_lower(.))) %>%
    rename_with(~ str_replace_all(., patterns))
  return(data)
}

# Patterns in a "named vector"
# format: pattern = replacement
patterns <- c("d\\s7|d\\.7" = "d7",   # gad7 prefix
              "\\(|\\)" = "",         # parentheses
              "\\s|\\.|\\/" = "_",    # spaces/periods
              "question_|q_" = "q",   # gad7 suffix
              "ethn.*" = "ethn",      # race/ethnicity
              "med.*" = "psych_meds", # medications
              "ye.*" = "yrs",         # age, years suffix
              ".*id" = "id")          # participant id

# Take our original data
data %>% 
  # Then clean-up the variable names
  name_cleanup(., patterns) %>% 
  # Then check-out the result
  check_distinct()

col_name	type	distinct_levels	distinct_count
id	numeric	21373 \| 21537 \| 21582 \| 21797 \| 21866 \| 22111 \| 22248 \| 22254 \| 22283 \| 22313 \| 22319 \| 22419 \| 22491 \| 22652 \| 22875	15
age_yrs	numeric	22 \| 27 \| 29 \| 34 \| 40 \| 42 \| 44 \| 46 \| 47 \| 48 \| 52 \| 53 \| 54 \| 60	14
race_ethn	character	Asian & Pacific Islander \| black \| Black \| Middle Eastern \| NA \| white	6
gad7_q1	character	0 \| 1 \| 2 \| 3 \| More than 1/2 the days \| Nearly every day \| NEarly every Day \| Not at all \| Several days \| SEVERAL DAYS	10
gad7_q2	character	0 \| 1 \| More than 1/2 the days \| More than half the days \| Nearly every day \| NEarly every Day \| Not at all \| Several days \| SEVERAL DAYS	9
gad7_q3	numeric	0 \| 1 \| 2 \| 3	4
gad7_q4	numeric	0 \| 1 \| 2 \| 3	4
gad7_q5	numeric	0 \| 1 \| 2 \| 3	4
gad7_q6	numeric	0 \| 1 \| 2 \| 3	4
gad7_q7	numeric	0 \| 1 \| 2 \| 3	4
psych_meds	character	Buspirone (Anxiety) \| Celexa for Anxiety \| N/A none \| none \| None \| None Reported \| Zoloft - Depression	7

Case study: Clean up values

# Begin cleaning up values in the data
new_data <- new_data %>% 
  mutate(
    # Convert strings in character columns to lower case
    # Remove leading and trailing white space
    across(where(is.character), ~ str_trim(str_to_lower(.))),
    # Correct discrepant values in gad7_q1 and _q2 columns
    across(matches("gad.*[1-2]"), ~ str_replace_all(., "1/2", "half")), 
    # Make specific string replacements across columns
    across(everything(), ~ str_replace_all(., c("n/a" = "na", 
                                                     ".*none.*" = "none", 
                                                     "\\s-\\s|\\s\\(" = "_", 
                                                     "\\)" = "", 
                                                     "\\sfor\\s" = "_"))),
    # Replace text string 'na' values with true NA's
    across(everything(), ~ na_if(., "na"))
    ) 

check_distinct(new_data)

col_name	type	distinct_levels	distinct_count
id	character	21373 \| 21537 \| 21582 \| 21797 \| 21866 \| 22111 \| 22248 \| 22254 \| 22283 \| 22313 \| 22319 \| 22419 \| 22491 \| 22652 \| 22875	15
age_yrs	character	22 \| 27 \| 29 \| 34 \| 40 \| 42 \| 44 \| 46 \| 47 \| 48 \| 52 \| 53 \| 54 \| 60	14
race_ethn	character	. \| asian & pacific islander \| black \| middle eastern \| white	4
gad7_q1	character	0 \| 1 \| 2 \| 3 \| more than half the days \| nearly every day \| not at all \| several days	8
gad7_q2	character	0 \| 1 \| more than half the days \| nearly every day \| not at all \| several days	6
gad7_q3	character	0 \| 1 \| 2 \| 3	4
gad7_q4	character	0 \| 1 \| 2 \| 3	4
gad7_q5	character	0 \| 1 \| 2 \| 3	4
gad7_q6	character	0 \| 1 \| 2 \| 3	4
gad7_q7	character	0 \| 1 \| 2 \| 3	4
psych_meds	character	buspirone_anxiety \| celexa_anxiety \| none \| zoloft_depression	4

Case study: Clean up more values

# Correct response values for GAD-7 questions 1 and 2
new_data <- new_data %>% 
  mutate(
    across(matches("gad.*[1-2]"), 
           ~ case_when(
           . == "not at all" ~ "0",
           . == "several days" ~ "1",
           . == "more than half the days" ~ "2",
           . == "nearly every day" ~ "3", 
           TRUE ~ .)
           )
    )

check_distinct(new_data)

col_name	type	distinct_levels	distinct_count
id	character	21373 \| 21537 \| 21582 \| 21797 \| 21866 \| 22111 \| 22248 \| 22254 \| 22283 \| 22313 \| 22319 \| 22419 \| 22491 \| 22652 \| 22875	15
age_yrs	character	22 \| 27 \| 29 \| 34 \| 40 \| 42 \| 44 \| 46 \| 47 \| 48 \| 52 \| 53 \| 54 \| 60	14
race_ethn	character	. \| asian & pacific islander \| black \| middle eastern \| white	4
gad7_q1	character	0 \| 1 \| 2 \| 3	4
gad7_q2	character	0 \| 1 \| 2 \| 3	4
gad7_q3	character	0 \| 1 \| 2 \| 3	4
gad7_q4	character	0 \| 1 \| 2 \| 3	4
gad7_q5	character	0 \| 1 \| 2 \| 3	4
gad7_q6	character	0 \| 1 \| 2 \| 3	4
gad7_q7	character	0 \| 1 \| 2 \| 3	4
psych_meds	character	buspirone_anxiety \| celexa_anxiety \| none \| zoloft_depression	4

Case study: Coerce data types/derive variables

# Coerce data types and generate new variables based on existing columns
new_data <- new_data %>%   
  mutate(
    across(matches("id|age|gad"), as.numeric), 
    indication = str_extract(psych_meds, "(?<=_).*"),
    psych_meds = str_replace_all(psych_meds, "_.*", ""),
    depression = if_else(indication == "depression", 1, 0, missing = NA_real_),
    anxiety = if_else(indication == "anxiety", 1, 0, missing = NA_real_),
    gad7_raw = rowSums(across(matches("gad"))), 
    gad7_cat = 
      case_when(
        gad7_raw <= 4 ~ "minimal",
        gad7_raw %in% c(5:9) ~ "mild",
        gad7_raw %in% c(10:14) ~ "moderate",
        gad7_raw >= 15 ~ "severe"
        ),
    across(matches("race|_cat"), as.factor)
  )
    
check_distinct(new_data)

Case study: Coerce data types/derive variables

col_name	type	distinct_levels	distinct_count
id	numeric	21373 \| 21537 \| 21582 \| 21797 \| 21866 \| 22111 \| 22248 \| 22254 \| 22283 \| 22313 \| 22319 \| 22419 \| 22491 \| 22652 \| 22875	15
age_yrs	numeric	22 \| 27 \| 29 \| 34 \| 40 \| 42 \| 44 \| 46 \| 47 \| 48 \| 52 \| 53 \| 54 \| 60	14
race_ethn	factor	. \| asian & pacific islander \| black \| middle eastern \| white	4
gad7_q1	numeric	0 \| 1 \| 2 \| 3	4
gad7_q2	numeric	0 \| 1 \| 2 \| 3	4
gad7_q3	numeric	0 \| 1 \| 2 \| 3	4
gad7_q4	numeric	0 \| 1 \| 2 \| 3	4
gad7_q5	numeric	0 \| 1 \| 2 \| 3	4
gad7_q6	numeric	0 \| 1 \| 2 \| 3	4
gad7_q7	numeric	0 \| 1 \| 2 \| 3	4
psych_meds	character	buspirone \| celexa \| none \| zoloft	4
indication	character	. \| anxiety \| depression	2
depression	numeric	. \| 0 \| 1	2
anxiety	numeric	. \| 0 \| 1	2
gad7_raw	numeric	12 \| 13 \| 14 \| 4 \| 7 \| 8 \| 9	7
gad7_cat	factor	mild \| minimal \| moderate	3

Key takeaways

Text strings often contain inconsistent data, and are common in research data
- Data containing such issues must be standardized before it is used in an analysis!**

Base R and stringr offer a range of functions for manipulating text strings

stringr is designed to work with dplyr and other tidyverse packages
- Text manipulation can be performed in chained sequences with other data cleaning steps!

Regular expressions (regex) are a powerful tool for cleaning and preprocessing research data
- Understanding when and how to implement regex will take some practice!

Resources

RegExplain: https://github.com/gadenbuie/regexplain

R for Data Science (strings): https://r4ds.had.co.nz/strings.html

stringr vs. Base R text manipulation: https://stringr.tidyverse.org/articles/from-base.html

Text Manipulation with R

Motivation

Learning objectives

Background

How is text different from other data types?

Text Manipulation in Research

Text Manipulation in Research (cont.)

Regular Expressions

Regex syntax

Regex syntax (cont.)

How do we manipulate text using R?

Base R text manipulation

Base R text manipulation overview

stringr

stringr overview

Thinking about vectors

stringr case coversion and whitespace

stringr::str_detect()

stringr::str_subset()

stringr::str_replace()

stringr::str_extract()

Case study

Case study: Overview

Case study: Import the data

Case study: Check distinct values

Case study: Clean-up column names

Case study: Clean-up column names

Case study: Clean up values

Case study: Clean up more values

Case study: Coerce data types/derive variables

Case study: Coerce data types/derive variables

Key takeaways

Resources

`stringr`

`stringr` overview

`stringr` case coversion and whitespace

`stringr::str_detect()`

`stringr::str_subset()`

`stringr::str_replace()`

`stringr::str_extract()`