What are the critical components of data documentation?
- Who collected the data
- Why the data was collected
- What the data is about
- When the data was collected
- Where the data was collected
- How the data was generated/collected
- Structure of the data
- Formatting decisions in the data
- Data validation/quality control
- How the data can be reused/license
- Suggested data analysis methods
- Measurement instruments used
Documentation is Project Dependent
Project 1: Building a Shoe Print Wear Database
- 150 pairs of shoes
- 2 brands of shoes
- several sizes for each brand
- step counters used on the shoes
- questionnaires measuring activities
- wearer weight/height/gait
Shoe Measurements
Initial measurement period + 2-3 additional measurement periods (~6 weeks between)
- Photos of shoe soles
- Digital shoe sole prints
- Powder prints
- Film
- Paper
- Vinyl flooring
- 3d scans of shoe soles
Measurements taken in the lab by research assistants.
Important Documentation?
- See all of the metadata here
- Probably should have included which research assistant was wearing the shoe, how much they weighed, their gait/height/etc., and so on.
Whoops.
![A gif of a child wearing a lampshade walking into an oven and falling backwards]()
Documentation is Project Dependent
Project 2: Wire cuts
- Goal is to estimate the length of sharp surfaces on all wire-cutting tool in peoples’ homes
- General survey with instructions for measuring each type of tool
- Collected data is a list of
- tool types,
- # blades on the tool,
- # cutting surfaces,
- blade length, and
- # of that type of tool
- Estimates are generated by adding up total length of cutting surfaces
Important Documentation?
What tools would the participant use to cut a wire typically?
What tools would they use in a pinch?
Is participant a professional craftsperson? hobbyist? not handy at all?
How was blade length measured?
- … I haven’t run this study yet. But I got over 980 cm of cutting surface in my garage/house and my dad got over 2200 cm.
Codebooks
Basic documentation that contains:
- Variable name in the code
- Long-form description of what was measured
- Units of measurement
- Acceptable values
- Values used to indicate missingness, refusal to respond, etc.
- Additional notes that may be relevant
Very common for government data - CDC codebooks are intense.
Data Doc Influences Analysis
- Experimental design
- Randomization
- Sampling strategy
- Random effects
- Transformations of collected data
- Sources of measurement error
Data Documentation
Documentation is a love letter that you write to your future self
Damian Conway