Predicting Dog Breed Rankings and Traits
Preliminaries
library(rmdformats)
library(here)
library(janitor)
library(magrittr)
library(naniar)
library(gtsummary)
library(broom)
library(rsample)
library(yardstick)
library(rms)
library(vcd)
library(patchwork)
library(car)
library(MASS)
library(nnet)
library(conflicted)
library(ggrepel)
library(GGally)
library(tidyverse)
theme_set(theme_bw())
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
1 Background
A major reason dogs are abandoned or brought to shelters is excessive barking. It would be helpful if a potential dog owner could predict a dog’s barking level based on the dog’s drooling level and how good they are around young children, which are two other factors potential owners consider. Perhaps if barking level could be anticipated in advance, potential owners could avoid the heartbreak of abandonment for them and the dog.
The American Kennel Club (AKC) displays a ranking of dog breeds annually. Many factors go into this ranking such as trainability, adaptability, energy level, shedding level, and many more. If I were to rank all the dog breeds, I would rank them by their outgoing nature with other dogs and people. These are traits I regard highly. Therefore, I seek to determine how well the factors ‘openness to strangers’ and ‘good with other dogs’ can predict the AKC 2020 Breed Ranking. I chose the year 2020 because that is the most recent year reported.
2 Research Questions
I will use an ordinal logistic regression model, the Bark Model, to predict a multi-categorical outcome, barking level, in order to answer the question:
1. How effectively can drooling level and goodness with young children predict a dog breed’s barking level?
I will then use an ordinary least squares method of linear regression modeling to fit the Rank Model to predict a quantitative outcome, 2020 AKC Breed Ranking, in order to answer the question:
2. How effectively can openness to strangers and goodness with other dogs predict a dog breed’s 2020 AKC ranking?
3 My Data
I will be combining two data sets. These data are from the Tidy Tuesday archive. Both data sets were put together from the AKC website by a GitHub user named kkakey. The first is a data set of various traits of each breed. The second data set includes the ranking of each breed for each year. I am extracting the 2020 Breed Ranking column from the second data set and adding it to the first to create one tibble that includes dog breed traits and their 2020 ranking. I will narrow down the list of traits to those of interest to my research questions: barking level, drooling level, good with young children, openness to strangers, and good with other dogs.
An advantage of these data is that the AKC is a renowned source of knowledge on dog breeds and I feel I can trust their annual rankings and trait designations. A limitation of these data is that the AKC website has been updated since it was scraped to create the .csv file, and now the website displays different traits than my data set. Some of the traits remained the same but have different names. This will make it difficult to update my models in the future.
Here is a link to the data: https://github.com/kkakey/dog_traits_AKC/tree/main/data
3.1 Data Ingest
First I will ingest the data set of traits into a tibble called
dog_traits
, then the data set of annual rankings into a
tibble called rank_2020
, selecting only the breed name and
the 2020 ranking.
<- read_csv(here("data", "breed_traits.csv")) %>%
dog_traits clean_names()
<- read_csv(here("data", "breed_rank_all.csv")) %>%
rank_2020 clean_names()
<- rank_2020 %>%
rank_2020 select(breed, x2020_rank)
Now I will combine the two data sets to create my analytic tibble.
The breed names have spaces in them, and each sequential space is
recognized differently by R. I need to run the mutate
function 3 times in each tibble in order to change the spaces to
underscores because the maximum numbers of spaces in a breed name is 3.
In dog_traits
I also need to change an apostrophe to an
underscore.
<- dog_traits %>%
dog_traits mutate(breed = str_replace(breed, " ", "_")) %>%
mutate(breed = str_replace(breed, " ", "_")) %>%
mutate(breed = str_replace(breed, " ", "_")) %>%
mutate(breed = str_replace(breed, "’", "_"))
<- rank_2020 %>%
rank_2020 mutate(breed = str_replace(breed, " ", "_")) %>%
mutate(breed = str_replace(breed, " ", "_")) %>%
mutate(breed = str_replace(breed, " ", "_"))
Now that the breed names in both data sets match each other, I can
join the two data sets by breed
. I will do so into a tibble
called dogs_raw
. Then I will check the number of rows (ie
breeds) in each data set.
<- left_join(dog_traits, rank_2020, by = "breed")
dogs_raw
dim(dog_traits)
[1] 195 17
dim(rank_2020)
[1] 198 2
dim(dogs_raw)
[1] 195 18
The number of rows in each data set do not match. This is because I
joined them by breed
so the combined tibble only includes
those breeds which existed in both data sets. There were 3 breeds in
rank_2020
that were not in dog_traits
. I have
a substantial number of breeds remaining, and I am doing a complete case
analysis, so I am not worried about this.
The number of columns is what I expected. I am adding 2 columns into a tibble with 17 columns, but 1 matches, so this creates a total of 19 columns.
I will save my 3 tibbles: the set of traits, the set of 2020 rankings, and the combined set. I will remove the initial 2 data sets now that they are saved, in order to keep my work space clean.
saveRDS(dog_traits, file = here("data", "dog_traits.rds"))
saveRDS(rank_2020, file = here("data", "rank_2020.rds"))
saveRDS(dogs_raw, file = here("data", "dogs_raw.rds"))
rm(dog_traits, rank_2020)
In order to help my work go faster on subsequent sessions, I will include the following code that will ingest my data from the saved, combined .rds file.
<- readRDS(here("data", "dogs_raw.rds")) dogs_raw
3.2 Tidying, Data Cleaning and Data Management
As originally loaded, the dogs_raw
data contain 195 rows
and 18 columns. I will view my variables of interest to see what their
types are so I can start cleaning them up.
%>%
dogs_raw select(breed, drooling_level, barking_level,
good_with_young_children, openness_to_strangers, %>%
good_with_other_dogs, x2020_rank) glimpse()
Rows: 195
Columns: 7
$ breed <chr> "Retrievers_(Labrador)", "French_Bulldogs", "…
$ drooling_level <dbl> 2, 3, 2, 2, 3, 1, 1, 3, 2, 2, 1, 1, 1, 3, 4, …
$ barking_level <dbl> 3, 1, 3, 1, 2, 4, 4, 1, 3, 5, 4, 3, 4, 3, 3, …
$ good_with_young_children <dbl> 5, 5, 5, 5, 3, 5, 5, 3, 5, 3, 3, 5, 5, 5, 3, …
$ openness_to_strangers <dbl> 5, 5, 3, 5, 4, 5, 3, 3, 4, 4, 4, 3, 5, 4, 3, …
$ good_with_other_dogs <dbl> 5, 4, 3, 5, 3, 3, 5, 3, 4, 4, 4, 3, 3, 3, 3, …
$ x2020_rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
All my variables are ‘dbl’ type. I need most of them to be factors,
and some to be characters. I will complete these conversions in the
following code. I will also make bark
an ordered factor so
it works well in my logistic regression model. I will rename my
variables and their categories to be more clear and concise. I will need
to collapse some of the categories of my categorical variables in order
to have each category contain at least 25 observations. I will also
relevel my categories because many are ordinal, else they will default
to alphabetical order. I will place the cleaned variables in a new
tibble called dogs
. Then I will remove the tibble of raw
data so as to keep my work space clean.
<- dogs_raw %>%
dogs rename(bark = barking_level,
drool = drooling_level,
kids = good_with_young_children,
dog_friends = good_with_other_dogs,
strangers = openness_to_strangers,
rank = x2020_rank) %>%
mutate(bark = factor(bark, ordered = TRUE),
bark = fct_recode(bark,
"Only to alert" = "1",
"Only to alert" = "2",
"Quiet" = "3",
"Loud" = "4",
"Very vocal" = "5"),
bark = fct_relevel(bark, "Only to alert", "Quiet", "Loud", "Very vocal"),
drool = factor(drool),
drool = fct_recode(drool,
"Dry" = "1",
"Regular" = "2",
"Drooler" = "3",
"Drooler" = "4",
"Drooler" = "5"),
drool = fct_relevel(drool, "Dry", "Regular", "Drooler"),
kids = factor(kids),
kids = fct_recode(kids,
"Adults only" = "1",
"Adults only" = "3",
"Kid friendly" = "4",
"Kid friendly" = "5"),
kids = fct_relevel(kids, "Adults only", "Kid friendly"),
dog_friends = factor(dog_friends),
dog_friends = fct_recode(dog_friends,
"Lone wolf" = "1",
"Lone wolf" = "2",
"Lone wolf" = "3",
"Pack animal" = "4",
"Pack animal" = "5"),
dog_friends = fct_relevel(dog_friends, "Lone wolf", "Pack animal"),
strangers = factor(strangers),
strangers = fct_recode(strangers,
"Stranger danger" = "1",
"Stranger danger" = "2",
"Stranger danger" = "3",
"Friendly" = "4",
"Outgoing" = "5"),
strangers = fct_relevel(strangers, "Stranger danger", "Friendly", "Outgoing")) %>%
select(breed, rank, bark, drool, kids, dog_friends, strangers)
rm(dogs_raw)
3.3 Missingness
I noticed in a data summary that I have an extra level in each of my
categorical variables that I did not expect, designated ‘0’. In order to
determine which breed is causing this issue, I will arbitrarily choose
bark
.
%>%
dogs filter(bark == 0)
# A tibble: 1 × 7
breed rank bark drool kids dog_friends strangers
<chr> <dbl> <ord> <fct> <fct> <fct> <fct>
1 Plott_Hounds 167 0 0 0 0 0
Plott hounds were not given scores for any of the traits in the trait data set. I will treat these as missing data, and therefore remove this row from my tibble for missing outcome data in my complete case analysis.
<- dogs %>%
dogs filter(bark != "0")
dim(dogs)
[1] 194 7
I am left with 194 rows as expected.
I will check the rest of my data for missingness.
miss_var_summary(dogs)
# A tibble: 7 × 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 breed 0 0
2 rank 0 0
3 bark 0 0
4 drool 0 0
5 kids 0 0
6 dog_friends 0 0
7 strangers 0 0
I have no missing data.
I will add row numbers so each breed has a unique identifier. This will not necessarily equal the ranking as I removed Plott Hounds as above. I will drop any empty categories.
<- dogs %>%
dogs mutate(id = row_number()) %>%
droplevels()
3.4 Tidied Tibble
My tibble dogs
contains 194 rows (breeds) and 8 columns
(variables). Each variable is contained in a column, and each row
represents a single breed. All variables now have appropriate types,
names, and categories.
dogs
# A tibble: 194 × 8
breed rank bark drool kids dog_friends strangers id
<chr> <dbl> <ord> <fct> <fct> <fct> <fct> <int>
1 Retrievers_(Labrador) 1 Quiet Regu… Kid … Pack animal Outgoing 1
2 French_Bulldogs 2 Only… Droo… Kid … Pack animal Outgoing 2
3 German_Shepherd_Dogs 3 Quiet Regu… Kid … Lone wolf Stranger… 3
4 Retrievers_(Golden) 4 Only… Regu… Kid … Pack animal Outgoing 4
5 Bulldogs 5 Only… Droo… Adul… Lone wolf Friendly 5
6 Poodles 6 Loud Dry Kid … Lone wolf Outgoing 6
7 Beagles 7 Loud Dry Kid … Pack animal Stranger… 7
8 Rottweilers 8 Only… Droo… Adul… Lone wolf Stranger… 8
9 Pointers_(German_Shortha… 9 Quiet Regu… Kid … Pack animal Friendly 9
10 Dachshunds 10 Very… Regu… Adul… Pack animal Friendly 10
# … with 184 more rows
I will save my tidied tibble as an R data set.
saveRDS(dogs, file = here("data", "dogs.rds"))
In order to help my work go faster on subsequent sessions, I will include the following code that will ingest my cleaned data from the saved .rds file.
<- readRDS(here("data", "dogs.rds")) dogs
4 Code Book and Clean Data Summary
The data in my complete dogs
sample consist of 194 dog
breeds on the American Kennel Club website. The AKC conveys no defined
exclusion criteria. Data are excluded from my tibble for outcome
missingness only. Of the 194 subjects, 194 have complete data on all
variables listed below. My logistic regression outcome
is bark
, which is a multi-categorical indicator of the
breed’s barking level. My ordinary least squares linear regression
outcome is rank
which is the 2020 AKC
Breed Ranking. I have a variable called id
which is a
unique identifier for each row and the variable breed
which
identifies the breed the data describe. All other variables listed below
will serve as predictors for my models.
4.1 Main Codebook Table
Variables included in my analyses are summarized in the following table. I will omit identifying variables in favor of being concise.
%>%
dogs select(-id, -breed) %>%
tbl_summary(.,
label = list(
bark = "Barking level",
drool = "Drooling level",
kids = "Good with Young Children",
strangers = "Openness to Strangers",
dog_friends = "Good with Other Dogs",
rank = "2020 American Kennel Club breed ranking"),
stat = list(all_continuous() ~
"{median} [{min} to {max}]" ))
Characteristic | N = 1941 |
---|---|
2020 American Kennel Club breed ranking | 98 [1 to 195] |
Barking level | |
Only to alert | 36 (19%) |
Quiet | 99 (51%) |
Loud | 34 (18%) |
Very vocal | 25 (13%) |
Drooling level | |
Dry | 92 (47%) |
Regular | 64 (33%) |
Drooler | 38 (20%) |
Good with Young Children | |
Adults only | 103 (53%) |
Kid friendly | 91 (47%) |
Good with Other Dogs | |
Lone wolf | 126 (65%) |
Pack animal | 68 (35%) |
Openness to Strangers | |
Stranger danger | 118 (61%) |
Friendly | 43 (22%) |
Outgoing | 33 (17%) |
1 Median [Range]; n (%) |
My data set has 194 subjects with AKC Breed Ranking ranging from 1 to 195. The breakdown of each variable (other than identifiers) is listed above and will be described in deeper detail below.
4.2 Data Distribution Description
This describes the distribution of my data.
%>% describe(.) %>% Hmisc::html() dogs
8 Variables 194 Observations
breed
n | missing | distinct |
---|---|---|
194 | 0 | 194 |
lowest : | Affenpinschers | Afghan_Hounds | Airedale_Terriers | Akitas | Alaskan_Malamutes |
highest: | Whippets | Wirehaired_Pointing_Griffons | Wirehaired_Vizslas | Xoloitzcuintli | Yorkshire_Terriers |
rank
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
194 | 0 | 194 | 1 | 97.64 | 65.25 | 10.65 | 20.30 | 49.25 | 97.50 | 145.75 | 175.70 | 185.35 |
bark
n | missing | distinct |
---|---|---|
194 | 0 | 4 |
Value Only to alert Quiet Loud Very vocal Frequency 36 99 34 25 Proportion 0.186 0.510 0.175 0.129
drool
n | missing | distinct |
---|---|---|
194 | 0 | 3 |
Value Dry Regular Drooler Frequency 92 64 38 Proportion 0.474 0.330 0.196
kids
n | missing | distinct |
---|---|---|
194 | 0 | 2 |
Value Adults only Kid friendly Frequency 103 91 Proportion 0.531 0.469
dog_friends
n | missing | distinct |
---|---|---|
194 | 0 | 2 |
Value Lone wolf Pack animal Frequency 126 68 Proportion 0.649 0.351
strangers
n | missing | distinct |
---|---|---|
194 | 0 | 3 |
Value Stranger danger Friendly Outgoing Frequency 118 43 33 Proportion 0.608 0.222 0.170
id
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
194 | 0 | 194 | 1 | 97.5 | 65 | 10.65 | 20.30 | 49.25 | 97.50 | 145.75 | 174.70 | 184.35 |
5 Analysis 1: Bark Prediction Model
I will use an ordinal logistic regression model called
Bark Model
to predict barking level, bark
,
using drooling level, drool
, and goodness with young
children, kids
.
5.1 Exploring the Data
5.1.1 My Outcome:
bark
bark
is a 4-category variable. I want to visualize what
percentage of breeds are in each category.
ggplot(dogs, aes(x = bark, fill = bark)) +
geom_bar(aes(y = (..count..)/sum(..count..)), col = "black") +
geom_text(aes(y = (..count..)/sum(..count..),
label = scales::percent((..count..) /
sum(..count..))),
stat = "count", vjust = 1.5,
color = "black", size = 5) +
scale_y_continuous(labels = scales::percent) +
scale_fill_brewer(palette = "BuPu") +
guides(fill = "none") +
labs(title = "Distribution of Barking Level Across Categories",
subtitle = "Outcome Distribution for `bark_model`",
x = "Barking Level",
y = "Percentage",
caption = "194 breeds in `dogs`")
Half of the breeds have a barking level of ‘quiet’. The other half is distributed fairly evenly across the other 3 categories: ‘only to alert’, ‘loud’, and ‘very vocal’. The category ‘very vocal’ has the fewest breeds.
5.1.2 My Predictors:
drool
and kids
I will be using drooling level and goodness with young children in order to predict barking level. I want to visualize the spread of barking level within each category of my predictors.
<- ggplot(dogs,
p1 aes(x = drool, fill = bark)) +
geom_bar(position = "dodge", col = "black") +
scale_fill_brewer(palette = "GnBu") +
labs(title = "Drooling Level",
x = "Drooling Level",
y = "Proportion",
fill = "Barking Level")
<- ggplot(dogs,
p2 aes(x = kids, fill = bark)) +
geom_bar(position = "dodge", col = "black") +
scale_fill_brewer(palette = "OrRd") +
labs(title = "Goodness with Young Children",
x = "Good with Young Children",
y = "Proportion",
fill = "Barking Level")
/ p2 +
p1 plot_annotation(title = "Spread of Barking Level (Outcome) Within Each Predictor Category",
caption = "194 breeds in `dogs`")
The ratio of the barking levels visually appears consistent within
each category of drooling level. Perhaps drooling level will not be an
effective predictor of barking level. For my predictor
kids
, describing goodness with young children, the breeds
who are kid friendly have more ‘very vocal’ breeds and fewer ‘only to
alert’ breeds than the breeds who prefer adults only. This predictor may
be more effective.
I will view a table of a breakdown of how many breeds are in each category, combining both of my predictors.
%>%
dogs tabyl(drool, kids) %>%
adorn_totals(c("row", "col")) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(rounding = "half up", digits = 0) %>%
adorn_ns() %>%
kable()
drool | Adults only | Kid friendly | Total |
---|---|---|---|
Dry | 61% (56) | 39% (36) | 100% (92) |
Regular | 42% (27) | 58% (37) | 100% (64) |
Drooler | 53% (20) | 47% (18) | 100% (38) |
Total | 53% (103) | 47% (91) | 100% (194) |
Goodness with young children is about even between the two categories: 53% of breeds prefer adults only and 47% are kid friendly. Within these categories, the spread of drooling level differs. Within the group of breeds who prefer adults only, more than half of the breeds are in the ‘dry’ category for drooling level (56 breeds out of 103 breeds). Within the group of breeds who are kid friendly, there is an even split between ‘dry’ and ‘regular’ drooling level breeds. In both, the minority of breeds are ‘droolers’.
I will check for collinearity between my predictors.
::vif(polr(bark ~ drool + kids, data = dogs, Hess = TRUE)) car
GVIF Df GVIF^(1/(2*Df))
drool 1.039467 2 1.009724
kids 1.039467 1 1.019543
These variance inflation factors are a bit higher than I would like, so I will keep collinearity in mind as I fit and evaluate my model.
5.1.3 Transformation Analysis & Assessment for Nonlinear Terms
I will not transform my outcome because it is categorical. I will not check a Spearman rho2 plot because I have only 1 option of a nonlinear term, an interaction between my 2 categorical predictors, which I will use.
5.2 Fitting the Model: A Proportional-Odds Cumulative-Logit Model
First I will ensure my outcome is an ordered factor.
is.ordered(dogs$bark)
[1] TRUE
Great - it is. Now I will fit my model using the polr
engine. I will keep in mind that the distance between each category may
not be equal, ie the difference between breeds who bark ‘only to alert’
vs ‘quiet’ breeds is not necessarily the same as the difference between
breeds who are ‘very vocal’ vs. ‘loud’.
<- polr(bark ~ drool * kids, data = dogs, Hess = TRUE)
bark_model
summary(bark_model)
Call:
polr(formula = bark ~ drool * kids, data = dogs, Hess = TRUE)
Coefficients:
Value Std. Error t value
droolRegular -0.03889 0.4446 -0.08748
droolDrooler -0.87278 0.5000 -1.74543
kidsKid friendly 0.93857 0.4085 2.29737
droolRegular:kidsKid friendly -0.45815 0.6277 -0.72988
droolDrooler:kidsKid friendly -0.34710 0.7317 -0.47435
Intercepts:
Value Std. Error t value
Only to alert|Quiet -1.4287 0.2830 -5.0477
Quiet|Loud 0.9993 0.2663 3.7521
Loud|Very vocal 2.1295 0.3094 6.8821
Residual Deviance: 461.5686
AIC: 477.5686
My model has 3 intercepts for the 4 bark
categories and
5 slopes for our binary predictor and 3-category predictor as well as
the interaction between them.
confint(bark_model)
2.5 % 97.5 %
droolRegular -0.9153908 0.8307751
droolDrooler -1.8612879 0.1060836
kidsKid friendly 0.1403315 1.7452809
droolRegular:kidsKid friendly -1.6917574 0.7727996
droolDrooler:kidsKid friendly -1.7869159 1.0872452
My model predicts that a breed with a ‘regular’ drooling level would not consistently have an impact on the outcome when compared to reference (‘dry’ drooling level) because the 95% confidence interval includes 0. Neither would a ‘drooler’ when compared to the same reference. My model predicts that goodness with young children is a trait which likely consistently impacts the outcome. Both of the interaction term slopes have confidence intervals which include 0, so it is unlikely that these meaningfully improve the predictive ability of my model.
Before I start making predictions, I will display a plot of effect
size for my model. In order to do that, I must fit the model with the
lrm
engine and in order to do that, I must set a
datadist.
<- datadist(dogs)
d options(datadist = "d")
<- lrm(bark ~ drool * kids, data = dogs, x = T, y = T)
bark_model_lrm
plot(summary(bark_model_lrm))
Similar to what I explained about the confidence intervals, the
kids
confidence interval is the one that does not include 0
and therefore would be most likely to improve the predictive ability of
the model.
5.2.1 Making Predictions
There are 6 possible predictions of my 2 variables, so I will fit the model to 6 new dogs with each of these combinations, show the new data as a tibble, and plot the predicted probabilities.
<- data.frame(name = c("Alfalfa", "Bernie", "Clancy",
dogs_new "Cooper", "Frankie", "Ted"),
drool = c("Drooler", "Drooler", "Regular",
"Regular", "Dry", "Dry"),
kids = c("Kid friendly", "Adults only", "Kid friendly",
"Adults only", "Kid friendly", "Adults only"))
tibble(dogs_new)
# A tibble: 6 × 3
name drool kids
<chr> <chr> <chr>
1 Alfalfa Drooler Kid friendly
2 Bernie Drooler Adults only
3 Clancy Regular Kid friendly
4 Cooper Regular Adults only
5 Frankie Dry Kid friendly
6 Ted Dry Adults only
predict(bark_model, dogs_new, interval = "prediction")
[1] Quiet Quiet Quiet Quiet Quiet Quiet
Levels: Only to alert Quiet Loud Very vocal
plot(predict(bark_model, dogs_new, interval = "prediction"))
Unfortunately, for every possible combination of my 2 variables, my model predicts the barking level will be ‘quiet’. This is probably the worst possible predictive ability for a categorical prediction model. I will move on to validation.
5.3 Model Validation
In order to validate my model, I will have to use the
lrm
fit. I will use the default number of bootstraps
(40).
set.seed(2022)
validate(bark_model_lrm)
index.orig training test optimism index.corrected n
Dxy 0.2310 0.2472 0.2036 0.0436 0.1874 40
R2 0.0750 0.0920 0.0571 0.0349 0.0401 40
Intercept 0.0000 0.0000 -0.1431 0.1431 -0.1431 40
Slope 1.0000 1.0000 0.8014 0.1986 0.8014 40
Emax 0.0000 0.0000 0.0753 0.0753 0.0753 40
D 0.0659 0.0832 0.0485 0.0347 0.0312 40
U -0.0103 -0.0103 -1.2004 1.1901 -1.2004 40
Q 0.0762 0.0935 1.2489 -1.1554 1.2315 40
B 0.2020 0.1978 0.2072 -0.0095 0.2114 40
g 0.5635 0.6089 0.4721 0.1368 0.4267 40
gp 0.1145 0.1211 0.0961 0.0251 0.0894 40
My R2 changes somewhat in new data, from an estimated R2 of 0.0920 across 40 training samples, to an average R2 of 0.0571, creating an optimism of 0.0349. My R2 corrected for overfitting is then 0.0401, which is quite different than my original R2 of 0.0750.
I will use Somers’ d (Dxy) to calculate the C statistic.
<- 0.5 + (0.2310/2)
initial <- 0.5 + (0.1874/2)
corrected
- corrected) (initial
[1] 0.0218
The change in the C-statistic is 0.0218. This is a small change, indicating that if my model were fit to new data, the area under the curve would be only 2 percentage points smaller.
5.4 Testing the Proportional Odds Assumption
In order to test the proportional odds assumption, I will fit my
model with the multinom
function.
<- multinom(bark ~ drool * kids, data = dogs)) (bark_model_poa
# weights: 28 (18 variable)
initial value 268.941106
iter 10 value 224.924428
iter 20 value 223.906699
iter 30 value 223.872516
final value 223.872459
converged
Call:
multinom(formula = bark ~ drool * kids, data = dogs)
Coefficients:
(Intercept) droolRegular droolDrooler kidsKid friendly
Quiet 0.96949830 -0.19640480 -0.5175377 0.3520780
Loud 0.08712077 -0.08724537 -15.1870165 0.4722902
Very vocal -1.01143305 -0.08736778 -0.2413670 1.9275416
droolRegular:kidsKid friendly droolDrooler:kidsKid friendly
Quiet 0.5331531 0.1122895
Loud -0.2489137 14.6276306
Very vocal -0.2690124 -17.5208386
Residual Deviance: 447.7449
AIC: 483.7449
I will compare the log likelihoods of the multinomial model and the proportional odds logistic regression model. The multinomial logit fits 3 intercepts and 15 slopes for a total of 18 parameters. The proportional odds logit fit 3 intercepts and 5 slopes for a total of 8 parameters. The difference in parameters between the models is 10.
<- logLik(bark_model)
LL_polr <- logLik(bark_model_poa)
LL_multinom <- -2 * (LL_polr[1] - LL_multinom[1])) (G
[1] 13.82372
pchisq(G, 10, lower.tail = FALSE)
[1] 0.181185
This result is not highly significant, so I am comfortable assuming proportional odds.
6 Analysis 2: Rank Prediction Model
I will use an ordinary least squares linear regression model called
‘Rank Model’ to predict the 2020 AKC Breed Ranking, rank
,
using openness to strangers, strangers
, and goodness with
other dogs, dog_friends
.
6.1 Exploring the Data
My outcome is a ranking, and I have included all the ranked breeds. Therefore, it is impossible for my data to be normally distributed as there is exactly 1 breed at each rank level. Instead of visualizing the distribution of my data in a Normal Q-Q plot or Density Histogram as I normally would, I will visualize the spread of my predictors across the rankings.
ggplot(data = dogs, aes(x = strangers,
y = rank,
col = dog_friends)) +
geom_boxplot() +
guides(fill=guide_legend("title")) +
labs(title = "Spread of 'Goodness with Other Dogs' and 'Openness to Strangers'",
subtitle = "Across 194 Ranked Breeds",
x = "Openness to Strangers",
y = "Rank",
col = "Good with Other Dogs",
shape = "Good with Other Dogs",
caption = "194 breeds from `dogs`")
The outcome is on the y-axis, with higher ranked breeds lower on the plot. The red color represents breeds that are ‘lone wolves’ or do not get along with other dogs. The blue color represents ‘pack animals’ or breeds who do get along well with other dogs. The x-axis is openness to strangers with the furthest left points being ‘stranger danger’ or lack of openness and the furthest right points being ‘outgoing’.
The only group with outliers is breeds who are outgoing pack animals, and these outliers are 2 breeds ranked in the lowest quartile. Within the ‘stranger danger’ category, the mean rank of pack animals is lower than that of the lone wolves, but both groups range almost the full length of the rankings. Within the ‘friendly’ and ‘outgoing’ categories, the mean rank of lone wolves is lower than that of the pack animals. The range of friendly breeds’ rankings is similar to the range of ‘stranger danger’ breeds’ rankings, but the range of outgoing breeds rankings is smaller.
6.1.1 Visualizing in Groups
I will take a closer look at only the top 10 breeds to see if I can see a pattern there.
<- dogs %>%
dogs_top10 arrange(rank) %>%
filter(rank == 1:10)
ggplot(data = dogs_top10, aes(x = strangers,
y = rank,
col = dog_friends)) +
geom_point(aes(shape = dog_friends),
alpha = 0.6,
size = 5) +
geom_text_repel(label= dogs_top10$breed,
col = "black",
size = 4) +
guides(fill=guide_legend("title")) +
labs(title = "Spread of 'Goodness with Other Dogs' and 'Openness to Strangers'",
subtitle = "Across Top 10 Dog Breeds",
x = "Openness to Strangers",
y = "Rank",
col = "Good with Other Dogs",
shape = "Good with Other Dogs",
caption = "breeds ranked 1-10 in `dogs`")
The red circles represent long wolves and the blue triangles represent pack animals. There is really no pattern here. Four of the breeds are ‘lone wolves’ and 6 are ‘pack animals’. There are 4 breeds in the ‘outgoing’ category and 3 breeds in each of the ‘stranger danger’ and ‘friendly’ categories of openness to strangers. This doesn’t answer much for me.
I will now try taking a look at the highest quartile ranked breeds and compare them to the lowest quartile.
194 / 4
[1] 48.5
Each quartile contains 48.5 breeds. I would rather underestimate than overestimate, so I will sample the highest ranked 48 breeds and the lowest ranked 48 breeds.
<- dogs %>%
dogs_top48 arrange(rank) %>%
filter(rank == 1:48)
<- dogs %>%
dogs_low48 arrange(rank) %>%
top_n(48, rank)
<- ggplot(data = dogs_top48, aes(x = strangers,
p3 y = rank,
col = dog_friends)) +
geom_boxplot() +
guides(col = "none") +
labs(title = "Highest Ranked Quartile of Breeds",
x = "Openness to Strangers",
y = "Rank",
col = "Good with Other Dogs",
shape = "Good with Other Dogs")
<- ggplot(data = dogs_low48, aes(x = strangers,
p4 y = rank,
col = dog_friends)) +
geom_boxplot() +
labs(title = "Lowest Ranked Quartile of Breeds",
x = "Openness to Strangers",
y = "Rank",
col = "Good with Other Dogs",
shape = "Good with Other Dogs")
+ p4 +
p3 plot_annotation(title = "Spread of 'Goodness with Other Dogs' and 'Openness to Strangers'",
subtitle = "Across Highest and Lowest Quartiles of Ranked Breeds",
caption = "96 breeds ranked 1-48 and 147-195 in `dogs`")
The only group with an outlier is ‘stranger danger’ pack animals, who
have an outlier ranked higher than the rest of that group. In the
highest quartile of rankings, within the ‘stranger danger’ category, the
pack animals are ranked higher than the lone wolves. The opposite is
true for friendlt and outgoing dogs in the highest quartile. In the
lowest quartile, lone wolves are ranked higher than pack animals in all
of the categories of openness to strangers
.
6.1.2 My Predictors
I want to take a look at the spread of breeds across each category of my 2 predictors.
%>%
dogs tabyl(strangers, dog_friends) %>%
adorn_totals(c("row", "col")) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(rounding = "half up", digits = 0) %>%
adorn_ns() %>%
kable()
strangers | Lone wolf | Pack animal | Total |
---|---|---|---|
Stranger danger | 72% (85) | 28% (33) | 100% (118) |
Friendly | 58% (25) | 42% (18) | 100% (43) |
Outgoing | 48% (16) | 52% (17) | 100% (33) |
Total | 65% (126) | 35% (68) | 100% (194) |
More breeds are ‘lone wolves’ (approximately 2/3) than are ‘pack animals’. Within the ‘lone wolf’ category of goodness with other dogs, many more breeds are in the ‘stranger danger’ category of openness to strangers as compared to the ‘friendly’ and ‘outgoing’ categories, which makes sense. Within the ‘pack animal’ category of goodness with other dogs, the breeds are more evenly split, but still most are in the ‘stranger danger’ category of openness to strangers.
I need to check collinearity between my predictors. I have already set a datadist for my data set.
::vif(ols(rank ~ strangers * dog_friends, data = dogs, x = T, y = T)) car
GVIF Df GVIF^(1/(2*Df))
strangers 14.68864 0 Inf
dog_friends 14.68864 0 Inf
strangers:dog_friends 14.68864 0 Inf
These variance inflation factors are high, so I will keep collinearity in mind as I fit and evaluate my model.
6.1.3 Transformation Analysis & Assessment for Nonlinear Terms
A transformation is unlikely to improve the distribution of my outcome because it is of uniform distribution and I do not think a Box-Cox assessment/plot would change my mind here. I have only 1 option of nonlinear term to add, so I will include an interaction between my 2 categorical predictors.
6.2 Fitting the Model: A Linear Regression Model using Ordinary Least Squares
I will fit my model using the ols
engine. I have already
set a datadist for my data set.
<- ols(data = dogs, rank ~ strangers * dog_friends,
(rank_model x = TRUE, y = TRUE))
Linear Regression Model
ols(formula = rank ~ strangers * dog_friends, data = dogs, x = TRUE,
y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 194 LR chi2 15.92 R2 0.079
sigma54.8111 d.f. 5 R2 adj 0.054
d.f. 188 Pr(> chi2) 0.0071 g 16.607
Residuals
Min 1Q Median 3Q Max
-111.091 -43.818 1.806 39.276 127.118
Coef S.E. t Pr(>|t|)
Intercept 103.6941 5.9451 17.44 <0.0001
strangers=Friendly -9.6941 12.4705 -0.78 0.4379
strangers=Outgoing -12.8816 14.9369 -0.86 0.3896
dog_friends=Pack animal 14.3968 11.2420 1.28 0.2019
strangers=Friendly * dog_friends=Pack animal -30.9523 20.3336 -1.52 0.1296
strangers=Outgoing * dog_friends=Pack animal -44.3269 22.1556 -2.00 0.0469
The g statistic is 16.607, which indicates that if I randomly selected 2 breeds, this would be the average difference in their predicted rankings. In a ranked list of 194 breeds, this is not very good.
summary(rank_model)
Effects Response : rank
Factor Low High Diff. Effect S.E. Lower 0.95
strangers - Friendly:Stranger danger 1 2 NA -9.6941 12.471 -34.2940
strangers - Outgoing:Stranger danger 1 3 NA -12.8820 14.937 -42.3470
dog_friends - Pack animal:Lone wolf 1 2 NA 14.3970 11.242 -7.7798
Upper 0.95
14.906
16.584
36.573
Adjusted to: strangers=Stranger danger dog_friends=Lone wolf
This summary tells me that if I had 2 dog breeds, Cooper (a black Labrador) and Ted (a Cavalier King Charles Spaniel) who are both ‘lone wolves’ when it comes to goodness with other dogs, but Cooper is ‘outgoing’ and Ted’s category of openness to strangers is ‘stranger danger’, then my model predicts that Cooper’s 2020 AKC Breed Ranking will be 12.9 points lower than Ted’s. Here is a visualization of these effect sizes.
plot(summary(rank_model))
All of the confidence intervals in this model include 0, showing that these predictors probably have poor predictive ability.
6.3 Residual Diagnostics
I will check which points may be influential, and which predictors might have more influential points.
which.influence(rank_model)
$Intercept
[1] 3
$strangers
[1] 5 6 13 14 18 32 155 162 163 165 175 180 190
$dog_friends
[1] 7 20 27 28 31 36 191
$`strangers * dog_friends`
[1] 6 13 160 163 164 169 187 189 193
This is interesting. There are about half the amount of influential
point in dog_friends
as there are in strangers
and most of the influential points in ‘strangers’ are in very low ranked
breeds. If I were to drop some of these rows, it would impact the
estimates of the intercept for strangers
.
I will check the residuals for my model and visualize a plot.
<- data.frame(area = 1:194)
temp $resid <- residuals(rank_model, type = "ordinary")
temp
ggplot(temp, aes(x = area, y = resid)) +
geom_point() +
geom_line()
The residuals range from about -100 to 100. The most accurate predictions occur in the breeds ranked about 130-155. The higher ranked points appear to have larger residuals.
6.4 Model Validation
I will validate my model using the default number of bootstraps.
set.seed(2022)
validate(rank_model)
index.orig training test optimism index.corrected n
R-square 0.0788 0.1118 0.0469 0.0650 0.0138 40
MSE 2911.3372 2776.2205 3012.2478 -236.0273 3147.3645 40
g 16.6071 18.3806 14.2807 4.0999 12.5072 40
Intercept 0.0000 0.0000 17.7570 -17.7570 17.7570 40
Slope 1.0000 1.0000 0.8238 0.1762 0.8238 40
My R2 changes substantially in new data, from 0.1118 across 40 training samples to an estimated 0.0469 in a test sample, creating an optimism of 0.065. My R2 corrected for overfitting is 0.0138, which is much lower than my original of 0.0788. My corrected mean squared error is 3147, much higher than 2911.
7 Conclusions
In this data science project, I sought to answer two research questions: 1) “How effectively can drooling level and goodness with young children predict a dog breed’s barking level?”, for which I fit an ordinal logistic regression model to predict this multi-categorical outcome, and 2) “how effectively can openness to strangers and goodness with other dogs predict a dog breed’s 2020 AKC ranking?”, for which I used an ordinary least squares method to fit a linear regression model.
In exploring my data, fitting the models, making predictions, visualizing effect sizes, and interpreting summary statistics, I have learned valuable insight into the predictive ability of my models. Firstly, drooling level and goodness with young children are not very effective predictors of barking level within my data set nor within new data. Secondly, openness to strangers and goodness with other dogs were not very effective predictors of 2020 AKC Breed Ranking within my data set nor within new data. I have learned why the AKC takes so many other traits into consideration when ranking dog breeds, because the traits I have investigated here do not correlate well with each other.
A major limitation is that the AKC website has updated their traits search algorithm and no longer uses the same phrasing nor the same measures as when this data was scraped. Therefore, I am not able to easily update my models with each annual ranking. Another limitation is that many of the variables in the raw data set had very high variance inflation factors when I added more than 2 variables to the model. Some were predictably collinear, such as coat length and grooming frequency, or energy level and mental stimulation needs. However, most were surprising to me, such as grooming frequency and energy level, or trainability and coat length. I felt that this limited me in predictor choice, but I was still able to answer my research questions.
A next step would be to scrape the AKC website again to obtain the more updated measures, fit prediction models, and compare those models to my models fit here. This would help determine if the changes to the website affected the predictive ability of these variables. Another interesting next step would be to adjust for age or life stage of the dogs. Puppies, young dogs, middle aged dogs, and elderly dogs all behave differently, regardless of their breed. For instance, puppies usually sleep more than young or middle aged dogs but eat less. Puppies go through ‘fear phases’ where their openness to strangers or goodness with young children or other dogs may be much lower than an older dog of the same breed who has more experience.
The impact of this data science project helps me understand which predictors do (or do not, rather) weigh strongly in the AKC’s annual breed rankings, and taught me that breeds who drool more do not necessarily bark more, which is something I used to think. For someone contemplating getting a dog, learning about these traits and how they interact could be vital to the lifelong success and happiness of the both the owner and their new pup.
As a result of this data science project, I have learned about data selection and variable selection. In choosing a data set to analyze, I wish I had run some simple toy models in a subset to better appreciate possible collinearity or high correlations between predictors. If I knew how closely the variables in the data sets I used correlated, I might have combined a third data set with some different measures, like perhaps breeds’ life expectancies or inclinations toward illness. Furthermore in my variable selection, I would have liked to add a third or fourth predictor to one or both of my models, but collinearity concerns prevented me from doing so. This has driven the point home that data selection and cleaning is the vast majority of the workload when it comes to fitting models to answer research questions.
8 References and Acknowledgments
Here is a link to the data: https://github.com/kkakey/dog_traits_AKC/tree/main/data
Thank you to Dr. Love and the CRSP 432 Teaching Assistants for the conveying the knowledge and insight necessary to complete this project and answer my research questions.
9 Session Information
::session_info() xfun
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Locale: en_US.UTF-8 / en_US.UTF-8 / en_US.UTF-8 / C / en_US.UTF-8 / en_US.UTF-8
Package version:
abind_1.4-5 askpass_1.1 assertthat_0.2.1
backports_1.4.1 base64enc_0.1-3 bit_4.0.4
bit64_4.0.5 bitops_1.0.7 blob_1.2.3
bookdown_0.26 boot_1.3.28 brio_1.1.3
broom_0.8.0 broom.helpers_1.7.0 bslib_0.3.1
cachem_1.0.6 callr_3.7.0 car_3.0-12
carData_3.0-5 cellranger_1.1.0 checkmate_2.1.0
cli_3.2.0 clipr_0.8.0 cluster_2.1.3
codetools_0.2-18 colorspace_2.0-3 commonmark_1.8.0
compiler_4.1.2 conflicted_1.1.0 cpp11_0.4.2
crayon_1.5.1 curl_4.3.2 data.table_1.14.2
DBI_1.1.2 dbplyr_2.1.1 desc_1.4.1
diffobj_0.3.5 digest_0.6.29 dplyr_1.0.8
dtplyr_1.2.1 ellipsis_0.3.2 evaluate_0.15
fansi_1.0.3 farver_2.1.0 fastmap_1.1.0
forcats_0.5.1 foreign_0.8-82 Formula_1.2-4
fs_1.5.2 furrr_0.2.3 future_1.25.0
gargle_1.2.0 generics_0.1.2 GGally_2.1.2
ggplot2_3.3.5 ggrepel_0.9.1 globals_0.14.0
glue_1.6.2 googledrive_2.0.0 googlesheets4_1.0.0
graphics_4.1.2 grDevices_4.1.2 grid_4.1.2
gridExtra_2.3 gt_0.5.0 gtable_0.3.0
gtsummary_1.5.2 haven_2.5.0 here_1.0.1
highr_0.9 Hmisc_4.7-0 hms_1.1.1
htmlTable_2.4.0 htmltools_0.5.2 htmlwidgets_1.5.4
httr_1.4.2 ids_1.0.1 isoband_0.2.5
janitor_2.1.0 jpeg_0.1-9 jquerylib_0.1.4
jsonlite_1.8.0 knitr_1.38 labeling_0.4.2
labelled_2.9.0 lattice_0.20-45 latticeExtra_0.6-29
lifecycle_1.0.1 listenv_0.8.0 lme4_1.1.29
lmtest_0.9-40 lubridate_1.8.0 magrittr_2.0.3
maptools_1.1.4 MASS_7.3-57 Matrix_1.4-1
MatrixModels_0.5-0 memoise_2.0.1 methods_4.1.2
mgcv_1.8.40 mime_0.12 minqa_1.2.4
modelr_0.1.8 multcomp_1.4-18 munsell_0.5.0
mvtnorm_1.1-3 naniar_0.6.1 nlme_3.1-157
nloptr_2.0.0 nnet_7.3-17 norm_1.0.10.0
numDeriv_2016.8.1.1 openssl_2.0.0 parallel_4.1.2
parallelly_1.31.1 patchwork_1.1.1 pbkrtest_0.5.1
pillar_1.7.0 pkgconfig_2.0.3 pkgload_1.2.4
plyr_1.8.7 png_0.1-7 polspline_1.1.19
praise_1.0.0 prettyunits_1.1.1 pROC_1.18.0
processx_3.5.3 progress_1.2.2 ps_1.7.0
purrr_0.3.4 quantreg_5.88 R6_2.5.1
rappdirs_0.3.3 RColorBrewer_1.1-3 Rcpp_1.0.8.3
RcppEigen_0.3.3.9.2 readr_2.1.2 readxl_1.4.0
rematch_1.0.1 rematch2_2.1.2 reprex_2.0.1
reshape_0.8.9 rlang_1.0.2 rmarkdown_2.14
rmdformats_1.0.3 rms_6.3-0 rpart_4.1.16
rprojroot_2.0.3 rsample_0.1.1 rstudioapi_0.13
rvest_1.0.2 sandwich_3.0-1 sass_0.4.1
scales_1.2.0 selectr_0.4.2 slider_0.2.2
snakecase_0.11.0 sp_1.4.7 SparseM_1.81
splines_4.1.2 stats_4.1.2 stringi_1.7.6
stringr_1.4.0 survival_3.3-1 sys_3.4
testthat_3.1.3 TH.data_1.1-0 tibble_3.1.6
tidyr_1.2.0 tidyselect_1.1.2 tidyverse_1.3.1
tinytex_0.38 tools_4.1.2 tzdb_0.3.0
UpSetR_1.4.0 utf8_1.2.2 utils_4.1.2
uuid_1.1.0 vcd_1.4-9 vctrs_0.4.1
viridis_0.6.2 viridisLite_0.4.0 visdat_0.5.3
vroom_1.5.7 waldo_0.4.0 warp_0.2.0
withr_2.5.0 xfun_0.30 xml2_1.3.3
yaml_2.3.5 yardstick_0.0.9 zoo_1.8-10