[1] 0.80463517 0.64060029 0.64842185 0.26700195 0.90290315 0.87371680
[7] 0.09969296 0.68885715 0.28142305 0.10479012
2025-04-02
Reading quiz 9 due
Focus on homework 8 on Tuesday, start working on homework 9 on Thursday
Start thinking about the project
Is the data tidy? - question of the key
the key
the whole key
and nothing but the key
A data set can be written in a rectangular form:
each observation is in one row
each variable is in a column
there is a key: one or more variables in the dataset provide a unique descriptor for each observation
Example 1: no key
[1] 0.80463517 0.64060029 0.64842185 0.26700195 0.90290315 0.87371680
[7] 0.09969296 0.68885715 0.28142305 0.10479012
Example 2: not rectangular
A table is in 2nd normal form, when:
it is in 1st normal form
and all non-key columns depend on all parts of the key
“No split key”
Tables in 1st normal form with a single key variable are automatically in 2nd normal form
A table is in 3rd normal form, when:
it is in 2nd normal form
and no non-key column can be determined by another non-key column
e.g.: zip code determines county and state
Process of getting table into higher normal forms is called normalization
Normalization gives a framework to organize and ensure
Normalization is generally terrible for any statistical modelling
Stricter version of the 3rd NF:
Note: a table like this is automatically in 3rd normal form.
Breed Affectionate.With.Family Good.With.Young.Children
1 Retrievers (Labrador) 5 5
2 French Bulldogs 5 5
3 German Shepherd Dogs 5 5
4 Retrievers (Golden) 5 5
5 Bulldogs 4 3
6 Poodles 5 5
Good.With.Other.Dogs Shedding.Level Coat.Grooming.Frequency Drooling.Level
1 5 4 2 2
2 4 3 1 3
3 3 4 2 2
4 5 4 2 2
5 3 3 3 3
6 3 1 4 1
Coat.Type Coat.Length Openness.To.Strangers Playfulness.Level
1 Double Short 5 5
2 Smooth Short 5 5
3 Double Medium 3 4
4 Double Medium 5 4
5 Smooth Short 4 4
6 Curly Long 5 5
Watchdog.Protective.Nature Adaptability.Level Trainability.Level Energy.Level
1 3 5 5 5
2 3 5 4 3
3 5 5 5 5
4 3 5 5 3
5 3 3 4 3
6 5 4 5 4
Barking.Level Mental.Stimulation.Needs
1 3 4
2 1 3
3 3 5
4 1 4
5 2 3
6 4 5
Data set is in “Wide Form”
Without knowing the data well, it could be that some of the non-key variables could determine parts of other non-key variables (violations of 3rd normal form)
traits %>% select(Breed, 2:4) %>%
pivot_longer(cols=2:4, names_to = "Trait", values_to = "Score") %>%
head()
# A tibble: 6 × 3
Breed Trait Score
<chr> <chr> <int>
1 Retrievers (Labrador) Affectionate.With.Family 5
2 Retrievers (Labrador) Good.With.Young.Children 5
3 Retrievers (Labrador) Good.With.Other.Dogs 5
4 French Bulldogs Affectionate.With.Family 5
5 French Bulldogs Good.With.Young.Children 5
6 French Bulldogs Good.With.Other.Dogs 4
The combination of Breed
and Trait
(Key) uniquely determines the score value (Value).