“Happy families are all alike; every unhappy family is unhappy in its own way.”
—Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own way.”
—Hadley Wickham
Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
https://twitter.com/ChelseaParlett/status/1221251025983565824Cleaning data 🧼🧽 pic.twitter.com/MMCJkTYmgL
— Chelsea Parlett-Pelleriti (@ChelseaParlett) January 26, 2020
from Jenny Bryan’s talk on spreadsheets https://speakerdeck.com/jennybc/spreadsheets?slide=4
from Jenny Bryan’s talk on spreadsheets https://speakerdeck.com/jennybc/spreadsheets?slide=28
https://twitter.com/JeanVAdams/status/707241263645392896If your collaborator asks, “In what form would you like the data?” you should respond, “In its current form.” via @kwbroman
— Jean Adams (@JeanVAdams) March 8, 2016
https://twitter.com/brookLYNevery1/status/954368989181902848Updated Turing Test concept:
— Brooke Watson Madubuonwu (@brookLYNevery1) January 19, 2018
A spreadsheet of dates, hand-entered by interns more than a decade ago, featuring such well-known time formats as "1996ish", "1941/xd01944", "1955?" and "WWII."
I'm not worried about AI until someone shows me the algorithm that can make sense of this. pic.twitter.com/IhzofigX2b
https://twitter.com/patilindrajeets/status/1375006386795524098I am often asked why I insist on complicating a plot by overlaying a violin plot on top of a box plot when the latter already gives a good visual summary of the distribution.
— Indrajeet Patil (इंद्रजीत पाटील) (@patilindrajeets) March 25, 2021
This gif provides a reason.
For a more nuanced argument, see:https://t.co/GhGmqxRoLi#rstats #dataviz pic.twitter.com/Qk1lQMuBJm
"Writing documentation is all about making future you remember things that present you knows future you will forget" – @data_stephanie #rstats #Rladies
— R-Ladies Chicago (@RLadiesChicago) February 14, 2018
https://twitter.com/RLadiesChicago/status/963576859152744456
Every scientific project will be redone in its entirety about 10-20 times from start to publication. Plan your work flow accordingly, e.g. a piece of R code that takes a data file and produces the analysis and figure/s. Change the data? Just rerun the code.
— Trevor Branch (@TrevorABranch) August 1, 2019
https://twitter.com/TrevorABranch/status/1157006269292507136