Documenting Data

What are the critical components of data documentation?

  • Who collected the data
  • Why the data was collected
  • What the data is about
  • When the data was collected
  • Where the data was collected
  • How the data was generated/collected
  • Structure of the data
  • Formatting decisions in the data
  • Data validation/quality control
  • How the data can be reused/license
  • Suggested data analysis methods
  • Measurement instruments used

Documentation is Project Dependent

Project 1: Building a Shoe Print Wear Database

  • 150 pairs of shoes
  • 2 brands of shoes
  • several sizes for each brand
  • step counters used on the shoes
  • questionnaires measuring activities
  • wearer weight/height/gait

Shoe Measurements

Initial measurement period + 2-3 additional measurement periods (~6 weeks between)

  • Photos of shoe soles
  • Digital shoe sole prints
  • Powder prints
    • Film
    • Paper
    • Vinyl flooring
  • 3d scans of shoe soles

Measurements taken in the lab by research assistants.

Important Documentation?

  • See all of the metadata here
  • Probably should have included which research assistant was wearing the shoe, how much they weighed, their gait/height/etc., and so on.

Whoops.

A gif of a child wearing a lampshade walking into an oven and falling backwards

Documentation is Project Dependent

Project 2: Wire cuts

  • Goal is to estimate the length of sharp surfaces on all wire-cutting tool in peoples’ homes
  • General survey with instructions for measuring each type of tool
  • Collected data is a list of
    • tool types,
    • # blades on the tool,
    • # cutting surfaces,
    • blade length, and
    • # of that type of tool
  • Estimates are generated by adding up total length of cutting surfaces

Important Documentation?

  • What tools would the participant use to cut a wire typically?

  • What tools would they use in a pinch?

  • Is participant a professional craftsperson? hobbyist? not handy at all?

  • How was blade length measured?

  • … I haven’t run this study yet. But I got over 980 cm of cutting surface in my garage/house and my dad got over 2200 cm.

Codebooks

Basic documentation that contains:

  • Variable name in the code
  • Long-form description of what was measured
  • Units of measurement
  • Acceptable values
  • Values used to indicate missingness, refusal to respond, etc.
  • Additional notes that may be relevant

Very common for government data - CDC codebooks are intense.

Data Doc Influences Analysis

  • Experimental design
  • Randomization
  • Sampling strategy
  • Random effects
  • Transformations of collected data
  • Sources of measurement error

Data Documentation

Documentation is a love letter that you write to your future self

Damian Conway

Additional Resources

  • DDI Alliance - probably overkill but in a good way

  • Data Librarians are amazing to work with