Principles for Data Analysis from Start to Finish

morals

“Happy families are all alike; every unhappy family is unhappy in its own way.”
—Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.”
—Hadley Wickham

data cleaning

an nytimes article headline talking about janitor work in data science

Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

If 80% of the data scientist’s job is data cleaning, perhaps that is the job.

https://twitter.com/ChelseaParlett/status/1221251025983565824

a framework for creating quality

a framework consisting of fundamentals, verify, explore, ask, and document

from https://kbroman.org/Talk_DataCleaning/data_cleaning_notes.pdf

fundamentals

fundamentals

  1. don’t clean when you’re hungry or tired
    • data cleaning requires considerable concentration, and you need to allow sufficient time to do the work. if you’re in a hurry, you’ll miss things.
  2. don’t trust anyone (even yourself)

spreadsheets: a dystopian moonscape of unrecorded user actions

from Jenny Bryan’s talk on spreadsheets https://speakerdeck.com/jennybc/spreadsheets?slide=4

a dog with a sign saying they
transposed sheet 4 in their workbook for no reason

from Jenny Bryan’s talk on spreadsheets https://speakerdeck.com/jennybc/spreadsheets?slide=28

https://twitter.com/JeanVAdams/status/707241263645392896

There was a schism in 2007, when a sect advocating OpenOffice created a fork of Sunday.xlsx and maintained it independently for several months. The efforts to reconcile the conflicting schedules led to the reinvention, within the cells of the spreadsheet, of modern version control.

https://xkcd.com/1667/

fundamentals

  1. don’t clean when you’re hungry or tired
    • data cleaning requires considerable concentration, and you need to allow sufficient time to do the work. if you’re in a hurry, you’ll miss things.
  2. don’t trust anyone (even yourself)
  1. think about what might have gone wrong and how it might be revealed
  2. use care in merging
  3. dates & categories suck

https://twitter.com/brookLYNevery1/status/954368989181902848

verify

verify

  1. check that distinct things are distinct
  2. check that matching things match
  3. check calculations
  4. look for other instances of a problem

explore

explore

  1. make lots of plots

https://twitter.com/patilindrajeets/status/1375006386795524098

explore

  1. make lots of plots
  1. look at missing value patterns
  2. with massive data, make more plots not fewer
  3. follow up all artifacts

ask

ask

  1. ask questions
  2. ask for the primary data
  3. ask for metadata
  4. ask why data are missing

document

document

  1. create checklists & pipelines
  2. document not just what but why
  3. expect to recheck

https://twitter.com/RLadiesChicago/status/963576859152744456

https://twitter.com/TrevorABranch/status/1157006269292507136

in sum

references