Data Dictionaries

ID 529: Data Management and Analytic Workflows in R

Dean Marengi | Tuesday, January 10th 2023

Motivation

  • We’ve learned a bit about:
    • Git and GitHub for code version control and management
    • Reproducible data analysis workflows


  • But have not discussed data dictionaries and their role in reproducible research
    • Facilitate collaboration with other researchers
    • Provide clear and detailed information about the data being used
      • Data sources and considerations
      • Structure and format
      • Variable definitions and interpretation of their values

Learning objectives

  • Understand the importance of data dictionaries for data analysis
    • Ensuring consistency of data, and how the data are used
    • Facilitating research collaboration
  • Learn about the common components of data dictionaries
    • Variable names and definitions
    • Data types
    • Value labels, and other attributes
    • Missing data codes
  • Discuss creating and maintaining data dictionaries
    • Well-described entries
    • Keeping data dictionaries up-to date

What are data dictionaries?

  • Data dictionaries
    • Provide key information and metadata about variables included in a dataset
    • Serve as an important reference guide for data analysts
    • The structure, format, and contents will vary depending on the project
  • Core elements typically include
    • A description of the study or data source(s)
    • Variable names, definitions (and units, where applicable)
    • Data types
    • Value labels
    • Calculations performed (where applicable)
    • Other details relevant to a specific project

Dataset descriptions

  • Dataset descriptions help to contextualize the data they contain
  • Descriptions should typically include
    • An overview of the project or data origins
    • Why the data were collected
    • What data was collected
    • Who collected the data
    • When the data were collected
    • Data collection methodology and related considerations
      • e.g., data quality issues
    • Other details relevant for specific projects

Variable entries

  • Variable entries provide specific information pertaining to each variable in the data
  • Entries should include
  • Concise but descriptive variable names
    • Accurately reflect the underlying data
  • Detailed descriptions of what the variables are
  • Data type
  • Characteristics such as:
    • Units
    • Factor levels
    • Range of values
    • Transformations
  • You may also want to include:
    • Links to relevant documentation
    • Notes on data quality issues
    • Other relevant information helpful for use and interpretation
    • Data type and format of the variable, including any missing data codes

Examples of variable entries

Bad Entry

  • Variable Name: Age
  • Definition: Age of participant
  • Data Type: Number
  • Notes: N/A


Good Entry

  • Variable Name: Age
  • Definition: The age of the patient at the time of the study, measured in years
  • Data Type: Integer
  • Notes: Age data was collected via self-report on the baseline questionnaire and verified with a government-issued ID.


Bad Entry

  • Variable Name: smk
  • Definition: Smokes
  • Data Type: Yes/No
  • Notes: N/A


Good Entry

  • Variable Name: Smoking status
  • Definition: Whether or not the patient currently smokes cigarettes
  • Data Type: Dichotomous (yes=1 | no=0)
  • Notes: Smoking status was determined through self-report on the baseline questionnaire and verified at clinical study visits.

Common mistakes

  • Insufficient detail
    • Inconsistent and poor variable naming conventions
    • Imprecise or vague definitions
    • Omission of pertinent information about data quality issues
  • Inaccuracies
    • Failure to keep current with evolving research
    • Discrepancies with the dataset
  • Poor standardization and formatting
    • Difficult for others to understand
    • Inconsistencies make it difficult to maintain

Discussion questions


  • What do you look for in a data dictionary? What do you include?


  • Have you ever had a negative experience working with poorly documented data?


  • Do you follow a specific process when developing your own data dictionaries?

Key takeaways

  • Data dictionaries are important tools in research
    • Provide a clear, standardized way of describing the data being used
      • Ensure consistency in how data should be interpreted by investigators
      • Overall, help to promote transparent and reproducible research!
  • Data dictionaries should be detailed, consistent, and accurate
    • Include all variables relevant for data analysis and, for each, provide:
      • Variable names
        • Detailed variable descriptions
        • Data types
        • Notes relevant to data collection procedures, calculations, etc.
        • Any other details that will help to interpret the data

Key takeaways (cont.)

  • Data dictionaries should be kept up to date over the course of a project
    • Include new and derived variables
    • Document changes in data collection and/or analysis methods