Reshaping Dogs

Stat 251

2025-04-02

Logistics

  • Reading quiz 9 due

  • Focus on homework 8 on Tuesday, start working on homework 9 on Thursday

  • Start thinking about the project

Reshaping

Is the data tidy? - question of the key

  • the key

  • the whole key

  • and nothing but the key

The key - First normal form

A data set can be written in a rectangular form:

  • each observation is in one row

  • each variable is in a column

  • there is a key: one or more variables in the dataset provide a unique descriptor for each observation

Not in First Normal Form

Example 1: no key

 [1] 0.80463517 0.64060029 0.64842185 0.26700195 0.90290315 0.87371680
 [7] 0.09969296 0.68885715 0.28142305 0.10479012

Example 2: not rectangular

The whole key - 2nd normal form

A table is in 2nd normal form, when:

  • it is in 1st normal form

  • and all non-key columns depend on all parts of the key

“No split key”

Tables in 1st normal form with a single key variable are automatically in 2nd normal form

Is this table in 2nd normal form?

Nothing but the key - 3rd normal form

A table is in 3rd normal form, when:

  • it is in 2nd normal form

  • and no non-key column can be determined by another non-key column

e.g.: zip code determines county and state

Data Normalization

Process of getting table into higher normal forms is called normalization

Normalization gives a framework to organize and ensure

  • reduce redundancy in data
  • improve data consistency
  • simplify the database design
  • increase speed to access data
  • easier maintenance

Normalization is generally terrible for any statistical modelling

Key-Value Pairs

Stricter version of the 3rd NF:

  • table is in 2nd normal form,
  • there is only a single non-key element in a table

Note: a table like this is automatically in 3rd normal form.

Traits of Dog Breeds

                  Breed Affectionate.With.Family Good.With.Young.Children
1 Retrievers (Labrador)                        5                        5
2       French Bulldogs                        5                        5
3  German Shepherd Dogs                        5                        5
4   Retrievers (Golden)                        5                        5
5              Bulldogs                        4                        3
6               Poodles                        5                        5
  Good.With.Other.Dogs Shedding.Level Coat.Grooming.Frequency Drooling.Level
1                    5              4                       2              2
2                    4              3                       1              3
3                    3              4                       2              2
4                    5              4                       2              2
5                    3              3                       3              3
6                    3              1                       4              1
  Coat.Type Coat.Length Openness.To.Strangers Playfulness.Level
1    Double       Short                     5                 5
2    Smooth       Short                     5                 5
3    Double      Medium                     3                 4
4    Double      Medium                     5                 4
5    Smooth       Short                     4                 4
6     Curly        Long                     5                 5
  Watchdog.Protective.Nature Adaptability.Level Trainability.Level Energy.Level
1                          3                  5                  5            5
2                          3                  5                  4            3
3                          5                  5                  5            5
4                          3                  5                  5            3
5                          3                  3                  4            3
6                          5                  4                  5            4
  Barking.Level Mental.Stimulation.Needs
1             3                        4
2             1                        3
3             3                        5
4             1                        4
5             2                        3
6             4                        5

Data set is in “Wide Form”

Without knowing the data well, it could be that some of the non-key variables could determine parts of other non-key variables (violations of 3rd normal form)

Traits of Dog Breeds in Long Form

traits %>% select(Breed, 2:4) %>% 
  pivot_longer(cols=2:4, names_to = "Trait", values_to = "Score") %>% 
  head()
# A tibble: 6 × 3
  Breed                 Trait                    Score
  <chr>                 <chr>                    <int>
1 Retrievers (Labrador) Affectionate.With.Family     5
2 Retrievers (Labrador) Good.With.Young.Children     5
3 Retrievers (Labrador) Good.With.Other.Dogs         5
4 French Bulldogs       Affectionate.With.Family     5
5 French Bulldogs       Good.With.Young.Children     5
6 French Bulldogs       Good.With.Other.Dogs         4

The combination of Breed and Trait (Key) uniquely determines the score value (Value).

Homework 8

  • Reshape two data sets
  • Create a visual in each
  • One reshaping in python, one in R

Visuals in homework 8

Figure 1: Trait Distributions
Figure 2: Ranks of (some) Breeds over time