Factors Influencing Calories and Whipped Cream in Starbucks Beverages
R Packages and Setup
I am loading the caret
package to make a more detailed summary from my confusion matrix.
library(knitr)
library(rmdformats)
library(here)
library(janitor)
library(magrittr)
library(rms)
library(broom)
library(here)
library(naniar)
library(car)
library(GGally)
library(caret)
library(ROCR)
library(patchwork)
library(ggridges)
library(tidyverse)
theme_set(theme_bw())
1 Data Source
https://github.com/PythonCoderUnicorn/starbucks
This data set is from the Tidy Tuesday archive. The data set was put together from the pdf Starbucks Coffee Company Beverage Nutrition Information by a GitHub user named PythonCoderUnicorn.
2 The Subjects
The subjects are Starbucks beverages, of which there are 1147 unique beverages. In the raw data, each beverage is identified by name, and further by row number because there are some duplicate names. The raw data set includes information on the beverage size, milk type, addition of whipped cream, serving size in mL, number of calories, total fat in grams, saturated fat in grams, trans fat in grams, cholesterol in milligrams, sodium in milligrams, total carbohydrates in grams, fiber in grams, sugar in grams, and caffeine in milligrams. I plan to include values for beverage name, beverage size, addition of whipped cream, sugar in grams, caffeine in milligrams, milk type, and number of calories in my tidy tibble.
3 Loading and Tidying the Data
3.1 Loading the Raw Data
I will ingest my data.
<- read_csv(here("data", "starbucks-nutrition.csv")) %>%
starbucks_raw mutate(id = row_number()) %>%
clean_names()
starbucks_raw
# A tibble: 1,147 × 16
product_name size milk whip serv_size_m_l calories total_fat_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 brewed coffee - dark ro… short 0 0 236 3 0.1
2 brewed coffee - dark ro… tall 0 0 354 4 0.1
3 brewed coffee - dark ro… gran… 0 0 473 5 0.1
4 brewed coffee - dark ro… venti 0 0 591 5 0.1
5 brewed coffee - decaf p… short 0 0 236 3 0.1
6 brewed coffee - decaf p… tall 0 0 354 4 0.1
7 brewed coffee - decaf p… gran… 0 0 473 5 0.1
8 brewed coffee - decaf p… venti 0 0 591 5 0.1
9 brewed coffee - medium … short 0 0 236 3 0.1
10 brewed coffee - medium … tall 0 0 354 4 0.1
# … with 1,137 more rows, and 9 more variables: saturated_fat_g <dbl>,
# trans_fat_g <chr>, cholesterol_mg <dbl>, sodium_mg <dbl>,
# total_carbs_g <dbl>, fiber_g <chr>, sugar_g <dbl>, caffeine_mg <dbl>,
# id <int>
3.2 Cleaning the Data
I will check the variables types and check for implausible values in my variables of interest.
%>%
starbucks_raw select(product_name, size, milk, whip, calories, sugar_g, caffeine_mg, id) %>%
summary()
product_name size milk whip
Length:1147 Length:1147 Min. :0.000 Min. :0.0000
Class :character Class :character 1st Qu.:1.000 1st Qu.:0.0000
Mode :character Mode :character Median :2.000 Median :0.0000
Mean :2.513 Mean :0.2467
3rd Qu.:4.000 3rd Qu.:0.0000
Max. :5.000 Max. :1.0000
calories sugar_g caffeine_mg id
Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 1.0
1st Qu.:130.0 1st Qu.:18.00 1st Qu.: 30.00 1st Qu.: 287.5
Median :220.0 Median :34.00 Median : 75.00 Median : 574.0
Mean :228.4 Mean :34.99 Mean : 91.86 Mean : 574.0
3rd Qu.:320.0 3rd Qu.:49.00 3rd Qu.:150.00 3rd Qu.: 860.5
Max. :640.0 Max. :89.00 Max. :475.00 Max. :1147.0
I need to change the variable type for some as well as recode and relevel. There are no implausible values.
<- starbucks_raw %>%
starbucks select(id, product_name, size, milk, whip, calories, sugar_g, caffeine_mg) %>%
mutate(id = as.character(id),
size = factor(size),
size = fct_relevel(size, "tall",
"grande",
"venti"),
milk = factor(milk),
milk = fct_recode(milk, "no_milk" = "0",
"nonfat" = "1",
"2_pct" = "2",
"soy" = "3",
"coconut" = "4",
"whole" = "5"),
milk = fct_relevel(milk, "no_milk",
"nonfat",
"2_pct",
"soy",
"coconut",
"whole"),
whip = factor(whip),
whip = fct_recode(whip, "whip" = "1",
"no_whip" = "0"),
whip = fct_relevel(whip, "whip",
"no_whip"),
sug = sugar_g,
caff = caffeine_mg,
name = product_name,
cal = calories) %>%
select(-product_name, -calories, -sugar_g, -caffeine_mg) %>%
droplevels()
summary(starbucks)
id size milk whip sug
Length:1147 grande :334 no_milk:165 whip :283 Min. : 0.00
Class :character venti :320 nonfat :222 no_whip:864 1st Qu.:18.00
Mode :character tall :318 2_pct :190 Median :34.00
short :123 soy :190 Mean :34.99
trenta : 21 coconut:190 3rd Qu.:49.00
doppio : 7 whole :190 Max. :89.00
(Other): 24
caff name cal
Min. : 0.00 Length:1147 Min. : 0.0
1st Qu.: 30.00 Class :character 1st Qu.:130.0
Median : 75.00 Mode :character Median :220.0
Mean : 91.86 Mean :228.4
3rd Qu.:150.00 3rd Qu.:320.0
Max. :475.00 Max. :640.0
This looks better. size
, milk
, and whip
are factor variables and id
and name
are characters.
I need to change the categories for size
.
summary(starbucks$size) %>%
kable()
x | |
---|---|
tall | 318 |
grande | 334 |
venti | 320 |
1 scoop | 2 |
1 shot | 1 |
doppio | 7 |
quad | 7 |
short | 123 |
solo | 7 |
trenta | 21 |
triple | 7 |
I will combine all the categories except tall, grande, and venti into a category called ‘other’ because the current categories are very small relative to the primary three.
<- starbucks %>%
starbucks mutate(size = fct_recode(size, "other" = "short",
"other" = "1 scoop",
"other" = "1 shot",
"other" = "doppio",
"other" = "quad",
"other" = "solo",
"other" = "trenta",
"other" = "triple")) %>%
droplevels()
summary(starbucks$size) %>%
kable()
x | |
---|---|
tall | 318 |
grande | 334 |
venti | 320 |
other | 175 |
This looks better.
3.3 Missingness Check
I will check for missing values in my data set.
miss_var_summary(starbucks)
# A tibble: 8 × 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 id 0 0
2 size 0 0
3 milk 0 0
4 whip 0 0
5 sug 0 0
6 caff 0 0
7 name 0 0
8 cal 0 0
I have no missing values.
4 The Tidy Tibble
4.1 Listing the Tibble
I will list my tibble.
starbucks
# A tibble: 1,147 × 8
id size milk whip sug caff name cal
<chr> <fct> <fct> <fct> <dbl> <dbl> <chr> <dbl>
1 1 other no_milk no_whip 0 130 brewed coffee - dark roast 3
2 2 tall no_milk no_whip 0 193 brewed coffee - dark roast 4
3 3 grande no_milk no_whip 0 260 brewed coffee - dark roast 5
4 4 venti no_milk no_whip 0 340 brewed coffee - dark roast 5
5 5 other no_milk no_whip 0 15 brewed coffee - decaf pike pl… 3
6 6 tall no_milk no_whip 0 20 brewed coffee - decaf pike pl… 4
7 7 grande no_milk no_whip 0 25 brewed coffee - decaf pike pl… 5
8 8 venti no_milk no_whip 0 30 brewed coffee - decaf pike pl… 5
9 9 other no_milk no_whip 0 155 brewed coffee - medium roast 3
10 10 tall no_milk no_whip 0 235 brewed coffee - medium roast 4
# … with 1,137 more rows
4.2 Size and Identifiers
My tibble has 1147 rows and 8 columns. The name of the variable that provides the identifier for each row is id
, which is a character variable. I will demonstrate that each row has a unique value for that variable here.
n_distinct(starbucks$id)
[1] 1147
4.3 Saving the R data set
I will save the data set as an .Rds file.
# saveRDS(starbucks, file = here("data", "starbucks"))
5 The Code Book
5.1 Defining the Variables
Variable | Role | Type | Description |
---|---|---|---|
id |
identifier | - | Character code for subjects |
name |
identifier | - | Name of the beverage |
size |
input | 4-cat | Size of the beverage: tall, grande, venti, or other (short, 1 scoop, 1 shot, doppio, quad, solo, trenta, or triple) |
milk |
input | 6-cat | Type of milk: none (no_milk), nonfat, 2% (2_pct), soy, coconut, or whole |
whip |
outcome | 2-cat | Beverage has whipped cream? Yes (whip) or No (no_whip) |
cal |
outcome | quant | Number of calories in the beverage |
sug |
input | quant | Grams of sugar in the beverage |
caff |
input | quant | Milligrams of caffeine in the beverage |
5.2 Numerical Description
Here is a numerical description of my tibble.
::describe(starbucks) Hmisc
starbucks
8 Variables 1147 Observations
--------------------------------------------------------------------------------
id
n missing distinct
1147 0 1147
lowest : 1 10 100 1000 1001, highest: 995 996 997 998 999
--------------------------------------------------------------------------------
size
n missing distinct
1147 0 4
Value tall grande venti other
Frequency 318 334 320 175
Proportion 0.277 0.291 0.279 0.153
--------------------------------------------------------------------------------
milk
n missing distinct
1147 0 6
lowest : no_milk nonfat 2_pct soy coconut
highest: nonfat 2_pct soy coconut whole
Value no_milk nonfat 2_pct soy coconut whole
Frequency 165 222 190 190 190 190
Proportion 0.144 0.194 0.166 0.166 0.166 0.166
--------------------------------------------------------------------------------
whip
n missing distinct
1147 0 2
Value whip no_whip
Frequency 283 864
Proportion 0.247 0.753
--------------------------------------------------------------------------------
sug
n missing distinct Info Mean Gmd .05 .10
1147 0 89 0.999 34.99 25.6 0 4
.25 .50 .75 .90 .95
18 34 49 67 76
lowest : 0 1 2 3 4, highest: 85 86 87 88 89
--------------------------------------------------------------------------------
caff
n missing distinct Info Mean Gmd .05 .10
1147 0 60 0.994 91.86 84 0 0
.25 .50 .75 .90 .95
30 75 150 190 225
lowest : 0 10 15 20 25, highest: 410 425 445 470 475
--------------------------------------------------------------------------------
name
n missing distinct
1147 0 93
lowest : Blended Strawberry Lemonade brewed coffee - dark roast brewed coffee - decaf pike place roast brewed coffee - medium roast brewed coffee - True North Blend Blonde roast
highest: Vanilla Sweet Cream Cold Brew Very Berry Hibiscus Starbucks Refreshers White Chocolate Mocha White Hot Chocolate Youthberry Brewed Tea
--------------------------------------------------------------------------------
cal
n missing distinct Info Mean Gmd .05 .10
1147 0 71 0.999 228.4 156.3 5 50
.25 .50 .75 .90 .95
130 220 320 420 470
lowest : 0 2 3 4 5, highest: 590 600 620 630 640
--------------------------------------------------------------------------------
6 Linear Regression Plans
6.1 Question
The question I hope to answer with the proposed linear model is:
How effectively does milk type, amount of sugar, amount of caffeine, and beverage size predict the number of calories in a Starbucks beverage?
6.2 My Quantitative Outcome
My outcome is cal
. I am interested in this because I enjoy Starbucks and tend to watch my caloric intake.
I will display a count of the number of rows in starbucks
with complete information on cal
.
%>%
starbucks filter(complete.cases(.)) %>%
select(cal) %>%
nrow()
[1] 1147
Here is a histogram of cal
.
ggplot(starbucks, aes(x = cal)) +
geom_histogram(aes(y = stat(density)),
bins = 15,
col = "white",
fill = "darkgreen") +
stat_function(fun = dnorm,
args = list(mean = mean(starbucks$cal),
sd = sd(starbucks$cal)),
col = "tan3", lwd = 1.5) +
labs(title = "Distribution of `cal` with minor right skew",
x = "# of calories",
y = "subjects (density)",
caption = "1147 beverages in `starbucks`")
While the distribution roughly approximates a normal distribution, there is a minor right skew. It is fairly continuous and I have displayed it in 15 bins. I considered using the natural log, but it caused a much more significant left skew than the right skew in this untransformed outcome.
Here I will show that there are at least 10, ordered, distinct values.
n_distinct(starbucks$cal)
[1] 71
::favstats(starbucks$cal) %>%
mosaickable()
Registered S3 method overwritten by 'mosaic':
method from
fortify.SpatialPolygonsDataFrame ggplot2
min | Q1 | median | Q3 | max | mean | sd | n | missing | |
---|---|---|---|---|---|---|---|---|---|
0 | 130 | 220 | 320 | 640 | 228.3932 | 137.6709 | 1147 | 0 |
There are 71 distinct values, ranging from 0 to 640 calories, with a mean of 228 calories.
6.3 My Planned Predictors (Linear Model)
The predictors I will use are size
, sug
, caff
, and milk
.
I will demonstrate that I have an input which is quantitative, sug
, with at least 10 different, ordered, observed values.
n_distinct(starbucks$sug)
[1] 89
::favstats(starbucks$sug) %>%
mosaickable()
min | Q1 | median | Q3 | max | mean | sd | n | missing | |
---|---|---|---|---|---|---|---|---|---|
0 | 18 | 34 | 49 | 89 | 34.99477 | 22.45884 | 1147 | 0 |
sug
has 89 distinct values ranging from 0 to 89 grams, with a mean of 35 grams.
I will demonstrate that I have a categorical input which has between 3 and 6 categories, that will be used as a factor, and that has at least 30 observations in each level of the factor.
n_distinct(starbucks$milk)
[1] 6
tabyl(starbucks$milk) %>%
kable(digits = 3)
starbucks$milk | n | percent |
---|---|---|
no_milk | 165 | 0.144 |
nonfat | 222 | 0.194 |
2_pct | 190 | 0.166 |
soy | 190 | 0.166 |
coconut | 190 | 0.166 |
whole | 190 | 0.166 |
glimpse(starbucks$milk)
Factor w/ 6 levels "no_milk","nonfat",..: 1 1 1 1 1 1 1 1 1 1 ...
milk
has 6 categories with at least 30 observations in each. 2_pct
, soy
, coconut
, and whole
each have 190 beverages, nonfat
contains 222 beverages, and no_milk
contains 165 beverages. The variable is a factor.
The total number of candidate predictors is no more than 4 + (N-100)/100 candidate regression inputs, rounding down, where N is the sample size.
4 + (1147-100)/100 = 4 + 1047/100 = 4 + 10.47 = 14.47
7 Logistic Regression Plans
7.1 Question
The question I hope to answer with the proposed logistic model is:
How effectively does beverage size, type of milk, and amount of sugar and caffeine predict whether a beverage has whipped cream?
7.2 My Binary Outcome
I will use my logistic regression model to explain whip
. I am interested in examining this because I like whipped cream in my coffee sometimes and I am curious about how well these factors predict whether a beverage has whipped cream.
I will display a count of the number of rows with each of the two possible values of this outcome and a missingness check.
tabyl(starbucks$whip) %>%
kable(digits = 3)
starbucks$whip | n | percent |
---|---|---|
whip | 283 | 0.247 |
no_whip | 864 | 0.753 |
%>% select(whip) %>%
starbucks miss_var_summary() %>%
kable()
variable | n_miss | pct_miss |
---|---|---|
whip | 0 | 0 |
283 of the beverages, or 25%, have whipped cream and 864 beverages, or 75%, do not have whipped cream. I do not have any missing values.
7.3 My Planned Predictors (Logistic Model)
The predictors I will use in this model are size
, sug
, caff
, and milk
. In section 6.3 I demonstrated that I have a quantitative input with at least 10 different, ordered, observed values and a categorical input with between 3 and 6 categories that is a factor and has at least 30 observations in each level. The sample size and number of predictors are the same.
8 Linear Regression Modeling
8.1 Missingness
My sample has no missing data.
8.2 Outcome Transformation
My quantitative outcome is fairly normally distributed. There is a minor right skew as shown in this histogram.
ggplot(starbucks, aes(x = cal)) +
geom_histogram(aes(y = stat(density)),
bins = 15,
col = "white",
fill = "darkgreen") +
stat_function(fun = dnorm,
args = list(mean = mean(starbucks$cal),
sd = sd(starbucks$cal)),
col = "tan3", lwd = 1.5) +
labs(title = "Distribution of `cal` has minor right skew",
x = "# of calories",
y = "subjects (density)",
caption = "1147 beverages from `starbucks`")
My outcome can be 0, so I will add 1 to every subject’s cal
value in order to do my transformation assessment.
%$%
starbucks boxCox((cal + 1) ~ sug + caff + milk + size)
My result is very close to 1, so it well approximates a normal distribution. I considered a logarithm transformation but it caused a left skew that was more substantial that my current minor right skew, so I will not transform my outcome.
8.3 Scatterplot Matrix and Collinearity
I will create a scatterplot matrix to evaluate the associations of my predictors and outcome.
ggpairs(starbucks %>%
select(size, sug, caff, milk, cal))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It looks like sug
is very closely correlated with cal
which makes sense because sugar is known to increase calories of a beverage. It looks like caff
does not have a strong relationship with cal
which also makes sense, as caffeine does not contain calories. The relationship between the milk
categories and cal
show that no milk has much fewer calories than the rest of the categories. The size
category ‘other’ has the fewest calories compared to the remaining categories, which makes sense because this category consists of smaller beverages such as an espresso shot.
I will look at each of my predictors’ ability to predict cal
.
<- ggplot(starbucks, aes(x = sug,
p_sug y = cal)) +
geom_point(alpha = 0.3,
col = "forestgreen") +
labs(title = "Sugar",
x = "sugar (g)",
y = "# calories")
<- ggplot(starbucks, aes(x = size, y = cal,
p_size fill = size)) +
geom_bar(stat = "identity") +
labs(title = "Size",
x = "size",
y = "# calories")
<- ggplot(starbucks, aes(x = caff,
p_caff y = cal)) +
geom_point(alpha = 0.6,
col = "goldenrod") +
labs(title = "Caffeine",
x = "caffeine (mg)",
y = "# calories")
<- ggplot(starbucks, aes(x = milk, y = cal,
p_milk fill = milk)) +
geom_bar(stat = "identity") +
labs(title = "Milk",
x = "milk",
y = "# calories")
+ p_size) / (p_caff + p_milk) +
(p_sug plot_annotation(title = "Sugar is Most Predictive for Number of Calories",
subtitle = "",
caption = "1147 beverages from `starbucks`")
This plot confirms what I saw in my scatterplot matrix, however in a more attractive format. Sugar in grams is highly correlated with calories. Caffeine does not have a linear association. The amount of calories increases with increasing size of the beverage. Milk type does not have an association, but the presence of any milk or no milk does look different.
I will look at variance inflation factors for collinearity.
vif(lm(cal ~ size + sug + caff + milk, data = starbucks)) %>%
kable()
GVIF | Df | GVIF^(1/(2*Df)) | |
---|---|---|---|
size | 1.745631 | 3 | 1.097300 |
sug | 1.851949 | 1 | 1.360864 |
caff | 1.248897 | 1 | 1.117541 |
milk | 1.213989 | 5 | 1.019580 |
All of the variance inflation factors are below 2, so I am not worried about collinearity between predictors.
8.4 Model A
8.4.1 Main Effects Linear Model
I will fit my main effects linear regression model using lm
.
<- lm(cal ~ size + sug + caff + milk,
modA_lm data = starbucks)
This is a tidied table of regression coefficients.
tidy(modA_lm, conf.int = TRUE, conf.level = 0.90) %>%
select(term, estimate, conf.low, conf.high, std.error, p.value) %>%
kable(digits = c(0,2,2,2,2,3))
term | estimate | conf.low | conf.high | std.error | p.value |
---|---|---|---|---|---|
(Intercept) | -15.36 | -24.02 | -6.70 | 5.26 | 0.004 |
sizegrande | 14.58 | 7.50 | 21.66 | 4.30 | 0.001 |
sizeventi | 24.41 | 16.12 | 32.70 | 5.04 | 0.000 |
sizeother | -6.42 | -14.69 | 1.85 | 5.03 | 0.202 |
sug | 4.87 | 4.72 | 5.03 | 0.09 | 0.000 |
caff | 0.05 | 0.02 | 0.09 | 0.02 | 0.020 |
milknonfat | 39.94 | 30.53 | 49.34 | 5.71 | 0.000 |
milk2_pct | 74.89 | 65.10 | 84.68 | 5.95 | 0.000 |
milksoy | 76.31 | 66.75 | 85.88 | 5.81 | 0.000 |
milkcoconut | 58.81 | 49.15 | 68.47 | 5.87 | 0.000 |
milkwhole | 96.00 | 86.20 | 105.79 | 5.95 | 0.000 |
These are key summary statistics.
glance(modA_lm) %>%
select(r.squared, adj.r.squared, AIC, BIC) %>%
kable(digits = c(4,4,0,0))
r.squared | adj.r.squared | AIC | BIC |
---|---|---|---|
0.8578 | 0.8565 | 12339 | 12399 |
Model A accounts for about 85.78% of variation in the outcome.
I will develop and display a validated R-square statistic using the validate
function in ols
. First I will have to create an ols
model.
<- datadist(starbucks)
d options(datadist = "d")
<- ols(cal ~ sug + size + caff + milk,
modA_ols data = starbucks, x = TRUE, y = TRUE)
set.seed(2022)
validate(modA_ols)
index.orig training test optimism index.corrected n
R-square 0.8578 0.8585 0.8566 0.0019 0.8559 40
MSE 2693.0025 2655.2618 2715.8641 -60.6023 2753.6048 40
g 145.1467 144.4762 145.0659 -0.5896 145.7363 40
Intercept 0.0000 0.0000 -0.0310 0.0310 -0.0310 40
Slope 1.0000 1.0000 1.0015 -0.0015 1.0015 40
The estimated R-squared across n = 40 training samples was 0.8585, and was similar in the resulting test samples at 0.8566, suggesting optimism of 0.0019. The augmented model’s R-squared, corrected for overfitting, is 0.8559.
8.4.2 Residual Diagnostics: Main Effect Model
I will display the 4 key diagnostic plots of residuals.
par(mfrow=c(2,2))
plot(modA_lm)
The first plot is the Residuals vs. Fitted values plot which checks for non-linearity and non-constant variance. It is not the ideal ‘fuzzy football’ shape. There is a slight fan shape but the line is fairly straight. It’s possible that linearity is an issue here. The second plot is the Normal Q-Q plot which checks for non-normality. This shows our data are quite normal, with no points much farther than + 3 or - 2 standard deviations. The third is the Scale-Location plot which checks for non-constant variance. There is a general trend upwards, shown by the slope of the line. There may be an issue with non-constant variance in my model. The fourth is the Residuals vs. Leverage plot which checks for highly leveraged or influential points. There are no points outside the Cook’s distance contours here, so there are no highly influential points, and there are no highly leveraged points.
Overall, I feel comfortable assuming normality and I feel reassured that there are no highly influential points. However, I have some concerns about linearity and constant variance.
8.5 Non-Linearity
I will assess whether a nonlinear term might improve my model by starting with a Spearman rho-squared plot.
<- spearman2(cal ~ size + sug + caff + milk,
sp data = starbucks)
plot(sp)
The Spearman rho squared plot suggests the strongest predictor is sug
. Since this is a quantitative variable, I will spend my degrees of freedom on trying a restricted cubic spline with 5 knots, which will add 3 degrees of freedom. I want to use about 6 degrees of freedom beyond the main effects to account for non-linearity as per the project requirements, so I will also create an interaction term between size
and caff
which will add an additional 3 degrees of freedom. I will use size
in the interaction term because it has the second largest rho-squared. I do not want to create an interaction with milk
as this will add far too many degrees of freedom (15 df). caff
has a low rho-squared, but I need a variable with 1 degree of freedom and I am already using sug
in a nonlinear term, so I will use this variable in my interaction term.
8.6 Model B
8.6.1 Augmented Linear Model
I will fit my augmented model using the nonlinear terms I discussed, a restricted cubic spline with 5 knots for sug
and an interaction term between size
and caff
, with ols
. I described my reasoning for these nonlinear terms in the section above. I will present a table with effect sizes.
<- ols(cal ~ rcs(sug, 5) + size * caff + milk,
modB_ols data = starbucks, x = TRUE, y = TRUE)
summary(modB_ols) %>%
kable(digits = c(2))
Low | High | Diff. | Effect | S.E. | Lower 0.95 | Upper 0.95 | Type | |
---|---|---|---|---|---|---|---|---|
sug | 18 | 49 | 31 | 155.79 | 4.95 | 146.07 | 165.51 | 1 |
caff | 30 | 150 | 120 | 8.40 | 4.83 | -1.08 | 17.88 | 1 |
size - tall:grande | 2 | 1 | NA | -14.49 | 4.45 | -23.21 | -5.76 | 1 |
size - venti:grande | 2 | 3 | NA | 9.84 | 4.80 | 0.42 | 19.26 | 1 |
size - other:grande | 2 | 4 | NA | -17.61 | 5.50 | -28.41 | -6.81 | 1 |
milk - no_milk:nonfat | 2 | 1 | NA | -39.21 | 6.36 | -51.69 | -26.72 | 1 |
milk - 2_pct:nonfat | 2 | 3 | NA | 34.87 | 5.15 | 24.76 | 44.98 | 1 |
milk - soy:nonfat | 2 | 4 | NA | 36.44 | 5.16 | 26.32 | 46.56 | 1 |
milk - coconut:nonfat | 2 | 5 | NA | 18.87 | 5.15 | 8.76 | 28.97 | 1 |
milk - whole:nonfat | 2 | 6 | NA | 55.97 | 5.15 | 45.86 | 66.09 | 1 |
This is a plot of the effects of this model.
plot(summary(modB_ols))
This is a nomogram of this model.
plot(nomogram(modB_ols))
Per the above table of coefficients, plot effect size summary, and nomogram, I can conclude that amount of sugar has a much higher impact on number of calories than does amount of caffeine. Breaking down caffeine’s impact by size of the beverage does not make much of a difference. Milk type also does not make much of a difference in number of calories.
8.6.2 Model Comparison
Here is an ANOVA comparison of Models A (main effects) and B (augmented).
anova(modB_ols)
Analysis of Variance Response: cal
Factor d.f. Partial SS MS
sug 4 7409886.678 1852471.669
Nonlinear 3 8838.133 2946.044
size (Factor+Higher Order Factors) 6 105837.593 17639.599
All Interactions 3 11710.417 3903.472
caff (Factor+Higher Order Factors) 4 29977.029 7494.257
All Interactions 3 11710.417 3903.472
milk 5 765638.136 153127.627
size * caff (Factor+Higher Order Factors) 3 11710.417 3903.472
TOTAL NONLINEAR + INTERACTION 6 22581.446 3763.574
REGRESSION 16 18654155.290 1165884.706
ERROR 1130 3066292.377 2713.533
F P
682.68 <.0001
1.09 0.3541
6.50 <.0001
1.44 0.2299
2.76 0.0265
1.44 0.2299
56.43 <.0001
1.44 0.2299
1.39 0.2165
429.66 <.0001
The nonlinear terms don’t meet the standard for statistical detectability at our usual alpha levels, so it does not make a statistically detectable improvement in prediction. The calorie differences are still driven by amount of sugar and size, primarily.
8.6.3 Residual Diagnostics: Augmented Model
These are the 4 key diagnostic residual plots for the model. I have to fit model B with lm
in order to create these plots with base R.
<- lm(cal ~ rcs(sug, 5) + size * caff + milk, data = starbucks)
modB_lm
par(mfrow=c(2,2))
plot(modB_lm)
The first plot is the Residuals vs. Fitted values plot which checks for non-linearity and non-constant variance. It is not the ideal ‘fuzzy football’ shape. There is a slight fan shape but the line is fairly straight. It’s possible that linearity is an issue here, as it was in the Main Effects model. The second plot is the Normal Q-Q plot which checks for non-normality. This shows our data are quite normal, with no points much farther than + 3 or - 2 standard deviations, as it was in the Main Effects model. The third is the Scale-Location plot which checks for non-constant variance. There is a general trend upwards, shown by the slope of the line. There may be an issue with non-constant variance in the augmented model similar to the Main Effects model. The fourth is the Residuals vs. Leverage plot which checks for highly leveraged or influential points. There are no points outside the Cook’s distance contours here, so there are no highly influential points. Point 80 seems to have the highest leverage, greater than 0.08.
Overall, I feel comfortable assuming normality and I feel reassured that there are no highly influential points. However, I have some concerns about linearity and constant variance. I also want to identify point 80 since it has the highest leverage.
%>%
starbucks filter(id == "80") %>%
select(id, name, size, milk, sug, cal) %>%
kable()
id | name | size | milk | sug | cal |
---|---|---|---|---|---|
80 | Vanilla Sweet Cream Cold Brew | other | no_milk | 26 | 210 |
The beverage with the highest leverage is a Vanilla Sweet Cream Cold Brew, size in the ‘other’ category, with no milk, 26 grams of sugar, and 210 calories.
8.7 Validating the Models
I will start by partitioning my data in training and test sets.
set.seed(2022)
<- starbucks$cal %>%
split_samples createDataPartition(p = 0.7, list = FALSE)
<- starbucks[split_samples,]
starb_train <- starbucks[-split_samples,]
starb_test
dim(starb_train)
[1] 804 8
dim(starb_test)
[1] 343 8
804 beverages, or 70%, are in the training subset and 343 beverages, or 30%, are in the test subset. My 8 columns remained in each subset.
8.7.1 Validation: Main Effects Model
I will fit the main effects model A in the training sample, then create a tidied summary of coefficients.
<-
modA_lm_train lm(cal ~ size + sug + caff + milk,
data = starb_train)
tidy(modA_lm_train) %>%
kable(digits = 3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -13.897 | 6.134 | -2.266 | 0.024 |
sizegrande | 11.820 | 5.077 | 2.328 | 0.020 |
sizeventi | 26.041 | 5.709 | 4.561 | 0.000 |
sizeother | -7.138 | 5.799 | -1.231 | 0.219 |
sug | 4.887 | 0.107 | 45.868 | 0.000 |
caff | 0.049 | 0.025 | 1.934 | 0.053 |
milknonfat | 35.427 | 6.557 | 5.403 | 0.000 |
milk2_pct | 72.061 | 6.860 | 10.504 | 0.000 |
milksoy | 75.139 | 6.662 | 11.279 | 0.000 |
milkcoconut | 56.347 | 6.763 | 8.331 | 0.000 |
milkwhole | 103.030 | 6.796 | 15.161 | 0.000 |
I will use coefficients from the main effects model in the training sample to obtain predicted values in the test sample and display 3 of these predictions.
<- modA_lm_train %>%
modA_pred predict(starb_test)
1:3] modA_pred[
1 2 3
13.61833 -13.41668 102.90898
These are key summaries of fit quality for the test subset in the model made with fitted values.
postResample(modA_pred, starb_test$cal)
RMSE Rsquared MAE
55.8912927 0.8312869 44.2853299
The root mean squared error, i.e. the prediction error between the observed outcome values and the predicted outcome values, is 55.89. The R-squared is 0.8313. The mean absolute error is 44.29.
8.7.2 Validation: Augmented Model
I will obtain these quality of fit summaries for the augmented model in order to compare. I will use validate because this is an ols
model.
set.seed(2022)
validate(modB_ols, method = "boot", B = 40)
index.orig training test optimism index.corrected n
R-square 0.8588 0.8601 0.8569 0.0032 0.8556 40
MSE 2673.3151 2624.7217 2709.9485 -85.2268 2758.5419 40
g 145.4735 144.7860 145.2518 -0.4659 145.9394 40
Intercept 0.0000 0.0000 0.2788 -0.2788 0.2788 40
Slope 1.0000 1.0000 1.0004 -0.0004 1.0004 40
The corrected R-squared for the augmented model is 0.8556.
The estimated R-squared across n = 40 training samples was 0.8601, and was similar in the resulting test samples at 0.8569, suggesting optimism of 0.0032. The augmented model’s R-squared, corrected for overfitting, is 0.8556. This optimism index is higher than the main effect model’s, indicating this model may be overfitting more.
8.7.3 Validated Model Comparison
I will show the comparison of the validated models in a table.
<- tibble(
starb_linear_summaries model = c("main_effects", "augmented"),
RMSE = c(RMSE(modA_pred, starb_test$cal), sqrt(2755.8552)),
R2 = c(R2(modA_pred, starb_test$cal), 0.8559))
%>%
starb_linear_summaries kable(digits = 3)
model | RMSE | R2 |
---|---|---|
main_effects | 55.891 | 0.831 |
augmented | 52.496 | 0.856 |
Although they are very similar, the main effects model has higher mean prediction error than the augmented model, and the augmented model accounts for more variation in the outcome (85.6%) than the main effects model (83.1%).
8.8 Final Model
I prefer the main effects model A. This is a list of the point estimates and confidence intervals for the model fit in the entire data set.
summary(modA_ols, conf.type=c('simultaneous')) %>%
kable(digits = 2)
Low | High | Diff. | Effect | S.E. | Lower 0.95 | Upper 0.95 | Type | |
---|---|---|---|---|---|---|---|---|
sug | 18 | 49 | 31 | 151.07 | 2.89 | 144.62 | 157.51 | 1 |
caff | 30 | 150 | 120 | 6.17 | 2.64 | 0.28 | 12.07 | 1 |
size - tall:grande | 2 | 1 | NA | -14.58 | 4.30 | -23.02 | -6.14 | 1 |
size - venti:grande | 2 | 3 | NA | 9.83 | 4.36 | 1.27 | 18.39 | 1 |
size - other:grande | 2 | 4 | NA | -21.00 | 5.25 | -31.31 | -10.69 | 1 |
milk - no_milk:nonfat | 2 | 1 | NA | -39.94 | 5.71 | -51.14 | -28.73 | 1 |
milk - 2_pct:nonfat | 2 | 3 | NA | 34.95 | 5.16 | 24.83 | 45.07 | 1 |
milk - soy:nonfat | 2 | 4 | NA | 36.38 | 5.16 | 26.25 | 46.50 | 1 |
milk - coconut:nonfat | 2 | 5 | NA | 18.88 | 5.15 | 8.76 | 28.99 | 1 |
milk - whole:nonfat | 2 | 6 | NA | 56.06 | 5.16 | 45.94 | 66.18 | 1 |
This is the estimates shown graphically in a plot of effect size of the main effects model in the entire data set.
plot(summary(modA_ols))
The effect of sug
on my outcome cal
in my model is meaningful. If I have 2 beverages, Beverage A and Beverage B, that are the same size with the same amount of caffeine and same type of milk, but Beverage A has 18 grams of sugar and Beverage B has 49 grams of sugar, then my main effects model predicts that Beverage B’s number of calories will be 151.07 calories higher than Beverage A’s.
This is the validated R-squared for the main effects model.
set.seed(2022)
validate(modA_ols, method = "boot", B = 40)
index.orig training test optimism index.corrected n
R-square 0.8578 0.8585 0.8566 0.0019 0.8559 40
MSE 2693.0025 2655.2618 2715.8641 -60.6023 2753.6048 40
g 145.1467 144.4762 145.0659 -0.5896 145.7363 40
Intercept 0.0000 0.0000 -0.0310 0.0310 -0.0310 40
Slope 1.0000 1.0000 1.0015 -0.0015 1.0015 40
The estimated R-squared across n = 40 training samples was 0.8585, and was similar in the resulting test samples at 0.8566, suggesting optimism of 0.0019. The augmented model’s R-squared, corrected for overfitting, is 0.8559.
This is a nomogram for the main effects model.
plot(nomogram(modA_ols))
For a tall beverage with 50 grams of sugar, 50 mg of caffeine, and whole milk added, the total points added up on the nomogram would be 50 (sug) + 1 (size) + 1 (caff) + 22 (milk) = 79 total points which correlates with a prediction of about 325 calories.
I will calculate this more exactly and obtain a prediction interval.
<- tibble(size = "tall", sug = 50,
new_sub caff = 50, milk = "whole")
predict(modA_lm, newdata = new_sub,
interval = "prediction", level = 0.95)
fit lwr upr
1 326.86 224.1175 429.6025
The predicted calories for this beverage is 326.86 calories with a 95% uncertainty interval of 224.12 - 429.60 calories.
8.8.1 Conclusion
Although the augmented model has slightly better overall fit quality, the difference in RMSE and R-squared is very small. The augmented model had a highly leveraged point, while the main effects model did not. The adherence to assumptions as seen in residuals was similar between the 2 models. Overall, I do not think adding the nonlinear terms in the augmented model yields a worthwhile improvement. For these reasons, I prefer the main effects model.
9 Logistic Regression Modeling
9.1 Missingness
I have no missing data.
9.1.1 Outcome Assessment
I will check my outcome against each categorical predictor to ensure all possible categories have at least 1 value.
%>%
starbucks tabyl(whip, milk) %>%
kable()
whip | no_milk | nonfat | 2_pct | soy | coconut | whole |
---|---|---|---|---|---|---|
whip | 8 | 56 | 56 | 51 | 56 | 56 |
no_whip | 157 | 166 | 134 | 139 | 134 | 134 |
%>%
starbucks tabyl(whip, size) %>%
kable()
whip | tall | grande | venti | other |
---|---|---|---|---|
whip | 84 | 86 | 86 | 27 |
no_whip | 234 | 248 | 234 | 148 |
I have no categories in either of my categorical predictors with 0 values.
9.2 Model Y
9.2.1 Main Effects Logistic Regression Model
I will create my main effects logistic regression model using lrm
to predict beverages that have whipped cream.
<- lrm((whip == "whip") ~ size + sug + caff + milk,
modY_lrm data = starbucks, x = TRUE, y = TRUE)
modY_lrm
Logistic Regression Model
lrm(formula = (whip == "whip") ~ size + sug + caff + milk, data = starbucks,
x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 1147 LR chi2 258.19 R2 0.300 C 0.801
FALSE 864 d.f. 10 g 1.782 Dxy 0.602
TRUE 283 Pr(> chi2) <0.0001 gr 5.944 gamma 0.603
max |deriv| 2e-08 gp 0.222 tau-a 0.224
Brier 0.147
Coef S.E. Wald Z Pr(>|Z|)
Intercept -4.5105 0.4986 -9.05 <0.0001
size=grande -0.8679 0.2172 -4.00 <0.0001
size=venti -1.8521 0.2872 -6.45 <0.0001
size=other 0.3195 0.2760 1.16 0.2470
sug 0.0677 0.0058 11.67 <0.0001
caff -0.0012 0.0013 -0.88 0.3769
milk=nonfat 1.3971 0.4520 3.09 0.0020
milk=2_pct 1.5123 0.4530 3.34 0.0008
milk=soy 1.6532 0.4566 3.62 0.0003
milk=coconut 1.6869 0.4542 3.71 0.0002
milk=whole 1.5123 0.4530 3.34 0.0008
The Nagelkerke R-squared is 0.3. The area under the ROC curve is 0.801.
To make a tidied table of regression coefficients I will fit the model with glm
.
<- glm((whip == "whip") ~ size + sug + caff + milk,
modY_glm data = starbucks, family = "binomial"(link = "logit"))
tidy(modY_glm, conf.int = TRUE, conf.level = 0.90) %>%
select(term, estimate, conf.low, conf.high, std.error, p.value) %>%
kable(digits = c(0,2,2,2,2,3))
term | estimate | conf.low | conf.high | std.error | p.value |
---|---|---|---|---|---|
(Intercept) | -4.51 | -5.38 | -3.73 | 0.50 | 0.000 |
sizegrande | -0.87 | -1.23 | -0.51 | 0.22 | 0.000 |
sizeventi | -1.85 | -2.33 | -1.39 | 0.29 | 0.000 |
sizeother | 0.32 | -0.14 | 0.77 | 0.28 | 0.247 |
sug | 0.07 | 0.06 | 0.08 | 0.01 | 0.000 |
caff | 0.00 | 0.00 | 0.00 | 0.00 | 0.377 |
milknonfat | 1.40 | 0.69 | 2.18 | 0.45 | 0.002 |
milk2_pct | 1.51 | 0.80 | 2.30 | 0.45 | 0.001 |
milksoy | 1.65 | 0.94 | 2.45 | 0.46 | 0.000 |
milkcoconut | 1.69 | 0.97 | 2.48 | 0.45 | 0.000 |
milkwhole | 1.51 | 0.80 | 2.30 | 0.45 | 0.001 |
This is a plot of effect sizes of the main effects model.
plot(summary(modY_lrm))
This is a nomogram of the main effects model.
plot(nomogram(modY_lrm))
The above summaries and plots show that sugar has the highest impact on predicting whether a beverage has whipped cream. Caffeine amount is not good at predicting whether a beverage has whipped cream. A beverage with any milk is possibly more likely to have whipped cream than a beverage with no milk.
9.2.2 Confusion Matrix
In order to make a confusion matrix, I will use the augment
function on the glm
model to obtain fitted values.
<- augment(modY_glm, starbucks, type.predict = "response")
modY_aug
%$%
modY_aug confusionMatrix(
data = factor(.fitted >= 0.5),
reference = factor(whip == "whip"),
positive = "TRUE")
Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 791 188
TRUE 73 95
Accuracy : 0.7724
95% CI : (0.7471, 0.7964)
No Information Rate : 0.7533
P-Value [Acc > NIR] : 0.06948
Kappa : 0.291
Mcnemar's Test P-Value : 1.708e-12
Sensitivity : 0.33569
Specificity : 0.91551
Pos Pred Value : 0.56548
Neg Pred Value : 0.80797
Prevalence : 0.24673
Detection Rate : 0.08282
Detection Prevalence : 0.14647
Balanced Accuracy : 0.62560
'Positive' Class : TRUE
The sensitivity of the main effects model Y is 0.33569. 33.57% of the beverages predicted to have whipped cream actually had whipped cream. The specificity of the main effects model Y is 0.91551. 91.55% of the beverages that actually had whipped cream were predicted to have whipped cream by the model. The positive predictive value for the main effects model Y is 0.56548. The predictions for whipped cream were correct 56.55% of the time.
9.3 Non-Linearity
I will assess whether a nonlinear term might improve my model by starting with a Spearman rho-squared plot.
<- spearman2((whip == "whip") ~ size + sug + caff + milk,
sp data = starbucks)
plot(sp)
The Spearman rho squared plot suggests the strongest predictor is sug
. Since this is a quantitative variable, I will spend my degrees of freedom on trying a restricted cubic spline with 5 knots, which will add 3 degrees of freedom.
9.4 Model Z
9.4.1 Augmented Logistic Regression Model
I will fit my augmented model using the nonlinear term I discussed, a restricted cubic spline with 5 knots for sug
, with lrm
. I described my reasoning for this nonlinear term in the section above.
<- lrm((whip == "whip") ~ rcs(sug, 5) + size + caff + milk,
modZ_lrm data = starbucks, x = TRUE, y = TRUE)
modZ_lrm
Logistic Regression Model
lrm(formula = (whip == "whip") ~ rcs(sug, 5) + size + caff +
milk, data = starbucks, x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 1147 LR chi2 271.76 R2 0.314 C 0.806
FALSE 864 d.f. 13 g 1.925 Dxy 0.611
TRUE 283 Pr(> chi2) <0.0001 gr 6.856 gamma 0.611
max |deriv| 4e-05 gp 0.226 tau-a 0.227
Brier 0.147
Coef S.E. Wald Z Pr(>|Z|)
Intercept -4.6971 0.7879 -5.96 <0.0001
sug -0.0732 0.0783 -0.93 0.3499
sug' 2.9307 1.2449 2.35 0.0186
sug'' -4.3537 1.8239 -2.39 0.0170
sug''' 2.0294 0.8774 2.31 0.0207
size=grande -0.8489 0.2246 -3.78 0.0002
size=venti -1.6380 0.2860 -5.73 <0.0001
size=other 0.6803 0.3170 2.15 0.0319
caff -0.0008 0.0013 -0.60 0.5516
milk=nonfat 1.5510 0.4682 3.31 0.0009
milk=2_pct 1.6585 0.4687 3.54 0.0004
milk=soy 1.8123 0.4736 3.83 0.0001
milk=coconut 1.8403 0.4709 3.91 <0.0001
milk=whole 1.6585 0.4687 3.54 0.0004
The Nagelkerke R-squared for the model is 0.314. The area under the ROC curve is 0.806.
To make a tidied table of regression coefficients I will fit the model with glm
.
<- glm((whip == "whip") ~ rcs(sug, 5) + size + caff + milk,
modZ_glm data = starbucks, family = "binomial"(link = "logit"))
tidy(modZ_glm, conf.int = TRUE, conf.level = 0.90) %>%
select(term, estimate, conf.low, conf.high, std.error, p.value) %>%
kable(digits = c(0,2,2,2,2,3))
term | estimate | conf.low | conf.high | std.error | p.value |
---|---|---|---|---|---|
(Intercept) | -4.70 | -6.12 | -3.51 | 0.79 | 0.000 |
rcs(sug, 5)sug | -0.07 | -0.20 | 0.06 | 0.08 | 0.350 |
rcs(sug, 5)sug’ | 2.93 | 0.85 | 4.98 | 1.24 | 0.019 |
rcs(sug, 5)sug’’ | -4.35 | -7.37 | -1.32 | 1.82 | 0.017 |
rcs(sug, 5)sug’’’ | 2.03 | 0.60 | 3.49 | 0.88 | 0.021 |
sizegrande | -0.85 | -1.22 | -0.48 | 0.22 | 0.000 |
sizeventi | -1.64 | -2.12 | -1.17 | 0.29 | 0.000 |
sizeother | 0.68 | 0.16 | 1.21 | 0.32 | 0.032 |
caff | 0.00 | 0.00 | 0.00 | 0.00 | 0.552 |
milknonfat | 1.55 | 0.81 | 2.36 | 0.47 | 0.001 |
milk2_pct | 1.66 | 0.92 | 2.47 | 0.47 | 0.000 |
milksoy | 1.81 | 1.07 | 2.63 | 0.47 | 0.000 |
milkcoconut | 1.84 | 1.10 | 2.66 | 0.47 | 0.000 |
milkwhole | 1.66 | 0.92 | 2.47 | 0.47 | 0.000 |
This is a plot of the effect sizes for this model.
plot(summary(modZ_lrm))
This is a nomogram for this model.
plot(nomogram(modZ_lrm))
The above summaries and plots look like my augmented model is not much different at making predictions for number of calories than my main effects model is. I will run an ANOVA to check.
9.4.2 Model Comparison
This is an ANOVA comparison of Model Z (augmented model) To Model Y (main effects model) which will display whether the addition of the non-linear term led to statistically detectable improvement in the model’s prediction.
anova(modZ_lrm)
Wald Statistics Response: (whip == "whip")
Factor Chi-Square d.f. P
sug 134.10 4 <.0001
Nonlinear 12.17 3 0.0068
size 44.63 3 <.0001
caff 0.35 1 0.5516
milk 16.56 5 0.0054
TOTAL 152.63 13 <.0001
While the augmented model’s nonlinear term may meet statistical detectability in predicting number of calories, this is not nearly as strong as the main effect of sugar. Therefore, I think the main effects model is better.
9.4.3 Confusion Matrix
In order to make a confusion matrix, I will need to use the function augment
to obtain fitted values.
<- augment(modZ_glm, starbucks, type.predict = "response")
modZ_aug
%$%
modZ_aug confusionMatrix(
data = factor(.fitted >= 0.5),
reference = factor(whip == "whip"),
positive = "TRUE" )
Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 793 191
TRUE 71 92
Accuracy : 0.7716
95% CI : (0.7462, 0.7956)
No Information Rate : 0.7533
P-Value [Acc > NIR] : 0.07927
Kappa : 0.2833
Mcnemar's Test P-Value : 1.955e-13
Sensitivity : 0.32509
Specificity : 0.91782
Pos Pred Value : 0.56442
Neg Pred Value : 0.80589
Prevalence : 0.24673
Detection Rate : 0.08021
Detection Prevalence : 0.14211
Balanced Accuracy : 0.62146
'Positive' Class : TRUE
The sensitivity of the augmented model Z is 0.32509. 32.51% of the beverages predicted to have whipped cream actually had whipped cream. This is compared to a sensitivity of the main effects model Y of 33.57%.
The specificity of the augmented model Z is 0.91782. 91.78% of the beverages that actually had whipped cream were predicted to have whipped cream by the model. This is compared to a sensitivity of the main effects model Y of 91.55%.
The positive predictive value for the augmented model Z is 0.56442. The predictions for whipped cream were correct 56.44% of the time. This is compared to a positive predictive value for the main effects model Y of 56.55%.
9.5 Validating the Models
I will validate the Nagelkerke R-square and the C-statistic for the “augmented model” Z and compare them to that of the “main effects” model Y through the validate
function.
set.seed(2022)
validate(modY_lrm, method = "boot", B = 40)
index.orig training test optimism index.corrected n
Dxy 0.6024 0.6095 0.5931 0.0164 0.5859 40
R2 0.2996 0.3066 0.2905 0.0161 0.2835 40
Intercept 0.0000 0.0000 -0.0305 0.0305 -0.0305 40
Slope 1.0000 1.0000 0.9580 0.0420 0.9580 40
Emax 0.0000 0.0000 0.0147 0.0147 0.0147 40
D 0.2242 0.2300 0.2166 0.0134 0.2108 40
U -0.0017 -0.0017 0.0003 -0.0020 0.0003 40
Q 0.2260 0.2317 0.2163 0.0154 0.2106 40
B 0.1472 0.1453 0.1489 -0.0036 0.1508 40
g 1.7824 1.8144 1.7278 0.0866 1.6958 40
gp 0.2225 0.2243 0.2189 0.0054 0.2171 40
validate(modZ_lrm, method = "boot", B = 40)
index.orig training test optimism index.corrected n
Dxy 0.6112 0.6309 0.5994 0.0315 0.5798 40
R2 0.3135 0.3355 0.3021 0.0334 0.2801 40
Intercept 0.0000 0.0000 -0.0748 0.0748 -0.0748 40
Slope 1.0000 1.0000 0.9097 0.0903 0.9097 40
Emax 0.0000 0.0000 0.0342 0.0342 0.0342 40
D 0.2361 0.2556 0.2264 0.0292 0.2068 40
U -0.0017 -0.0017 0.0013 -0.0031 0.0013 40
Q 0.2378 0.2573 0.2250 0.0323 0.2055 40
B 0.1467 0.1435 0.1488 -0.0053 0.1519 40
g 1.9251 2.0877 1.8704 0.2173 1.7078 40
gp 0.2263 0.2347 0.2220 0.0127 0.2136 40
9.5.1 Validated Models Comparison
I will show the comparison of the validated models with key fit measures in a table.
<- tibble(
starb_logistic_summaries model = c("main_effects", "augmented"),
R2 = c(0.2835, 0.2801),
C = c((0.5 + (0.5859/2)), (0.5 + (0.5798/2))),
)
%>%
starb_logistic_summaries kable(digits = 4)
model | R2 | C |
---|---|---|
main_effects | 0.2835 | 0.7930 |
augmented | 0.2801 | 0.7899 |
Although they are very similar, the main effects model accounts for more variation in the outcome (28.35%) than the augmented model (28.01%%) and the main effects model has a larger area under the ROC curve (0.793 compared to 0.790).
9.6 Final Model
I prefer the main effects model Y. Here is a list of the odds ratios and confidence intervals for the model fit in the entire data set. Since a binary logistic regression model predicts the log odds of whipped cream (whip == “whip”), I will exponentiate to interpret the odds ratios and confidence intervals.
tidy(modY_glm, exponentiate = TRUE, conf.int = TRUE) %>%
kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 0.011 | 0.499 | -9.047 | 0.000 | 0.004 | 0.027 |
sizegrande | 0.420 | 0.217 | -3.996 | 0.000 | 0.273 | 0.641 |
sizeventi | 0.157 | 0.287 | -6.450 | 0.000 | 0.088 | 0.273 |
sizeother | 1.376 | 0.276 | 1.158 | 0.247 | 0.794 | 2.349 |
sug | 1.070 | 0.006 | 11.672 | 0.000 | 1.058 | 1.083 |
caff | 0.999 | 0.001 | -0.884 | 0.377 | 0.996 | 1.001 |
milknonfat | 4.043 | 0.452 | 3.091 | 0.002 | 1.747 | 10.437 |
milk2_pct | 4.537 | 0.453 | 3.339 | 0.001 | 1.955 | 11.726 |
milksoy | 5.224 | 0.457 | 3.621 | 0.000 | 2.237 | 13.601 |
milkcoconut | 5.403 | 0.454 | 3.714 | 0.000 | 2.325 | 14.005 |
milkwhole | 4.537 | 0.453 | 3.339 | 0.001 | 1.955 | 11.726 |
I can also run a summary of the lrm
model to see a more detailed summary.
summary(modY_lrm) %>%
kable(digits = 3)
Low | High | Diff. | Effect | S.E. | Lower 0.95 | Upper 0.95 | Type | |
---|---|---|---|---|---|---|---|---|
sug | 18 | 49 | 31 | 2.100 | 0.180 | 1.747 | 2.452 | 1 |
Odds Ratio | 18 | 49 | 31 | 8.164 | NA | 5.738 | 11.616 | 2 |
caff | 30 | 150 | 120 | -0.141 | 0.160 | -0.454 | 0.172 | 1 |
Odds Ratio | 30 | 150 | 120 | 0.868 | NA | 0.635 | 1.188 | 2 |
size - tall:grande | 2 | 1 | NA | 0.868 | 0.217 | 0.442 | 1.294 | 1 |
Odds Ratio | 2 | 1 | NA | 2.382 | NA | 1.556 | 3.646 | 2 |
size - venti:grande | 2 | 3 | NA | -0.984 | 0.233 | -1.442 | -0.527 | 1 |
Odds Ratio | 2 | 3 | NA | 0.374 | NA | 0.237 | 0.590 | 2 |
size - other:grande | 2 | 4 | NA | 1.187 | 0.309 | 0.582 | 1.792 | 1 |
Odds Ratio | 2 | 4 | NA | 3.278 | NA | 1.790 | 6.003 | 2 |
milk - no_milk:nonfat | 2 | 1 | NA | -1.397 | 0.452 | -2.283 | -0.511 | 1 |
Odds Ratio | 2 | 1 | NA | 0.247 | NA | 0.102 | 0.600 | 2 |
milk - 2_pct:nonfat | 2 | 3 | NA | 0.115 | 0.245 | -0.364 | 0.595 | 1 |
Odds Ratio | 2 | 3 | NA | 1.122 | NA | 0.695 | 1.813 | 2 |
milk - soy:nonfat | 2 | 4 | NA | 0.256 | 0.249 | -0.232 | 0.744 | 1 |
Odds Ratio | 2 | 4 | NA | 1.292 | NA | 0.793 | 2.104 | 2 |
milk - coconut:nonfat | 2 | 5 | NA | 0.290 | 0.245 | -0.191 | 0.771 | 1 |
Odds Ratio | 2 | 5 | NA | 1.336 | NA | 0.826 | 2.161 | 2 |
milk - whole:nonfat | 2 | 6 | NA | 0.115 | 0.245 | -0.364 | 0.595 | 1 |
Odds Ratio | 2 | 6 | NA | 1.122 | NA | 0.695 | 1.813 | 2 |
This is a graphical display of the the effect sizes of the main effects model in the entire data set.
plot(summary(modY_lrm))
The effect of sug
on my outcome whip
in my model is meaningful, as shown by the odds ratio interval being entirely greater than 1. If I have 2 beverages, Beverage A and Beverage B, that are the same size with the same amount of caffeine and same type of milk, but Beverage A has 18 grams of sugar and Beverage B has 49 grams of sugar, then my augmented model predicts that Beverage B’s odds of having whipped cream will be 8.164 times Beverage A’s odds of having whipped cream. Beverage B’s odds are 816.4% as large as Beverage A’s.
This is a plot of the ROC curve for the main effects model in the entire data set.
<- predict(modY_glm, starbucks, type="response")
prob <- prediction(prob, starbucks$whip)
pred <- performance(pred, measure = "tpr", x.measure = "fpr")
perf <- performance(pred, measure="auc")
auc
<- round(auc@y.values[[1]],3)
auc <- data.frame(fpr=unlist(perf@x.values),
roc.data tpr=unlist(perf@y.values),
model="LRM")
ggplot(roc.data, aes(x=fpr, ymin=0, ymax=tpr)) +
geom_ribbon(alpha=0.2, fill = "blue") +
geom_line(aes(y=tpr), col = "blue") +
geom_abline(intercept = 0, slope = 1, lty = "dashed") +
labs(title = paste0("Model Y: ROC Curve w/ AUC=", auc))
This plot displays that my model predicts the odds of a beverage having whipped cream better than guessing.
This is the validated Nagelkerke R-square and C-statistic for the main effects model.
set.seed(2022)
validate(modY_lrm, method = "boot", B = 40)
index.orig training test optimism index.corrected n
Dxy 0.6024 0.6095 0.5931 0.0164 0.5859 40
R2 0.2996 0.3066 0.2905 0.0161 0.2835 40
Intercept 0.0000 0.0000 -0.0305 0.0305 -0.0305 40
Slope 1.0000 1.0000 0.9580 0.0420 0.9580 40
Emax 0.0000 0.0000 0.0147 0.0147 0.0147 40
D 0.2242 0.2300 0.2166 0.0134 0.2108 40
U -0.0017 -0.0017 0.0003 -0.0020 0.0003 40
Q 0.2260 0.2317 0.2163 0.0154 0.2106 40
B 0.1472 0.1453 0.1489 -0.0036 0.1508 40
g 1.7824 1.8144 1.7278 0.0866 1.6958 40
gp 0.2225 0.2243 0.2189 0.0054 0.2171 40
The Nagelkerke R-squared for the main effect logistic regression model in the entire dataset is 0.2835. The C-statistic corrected for optimism is 0.5+(0.5859/2) = 0.79295
This is a nomogram for the main effects model.
plot(nomogram(modY_lrm))
There are two new subjects of interest, Beverage A and Beverage B, which are the same size and have the same type of milk and milligrams of caffeine, but Beverage A has 18 grams of sugar and Beverage B has 49 grams of sugar. Beverage B’s odds of having whipped cream will be 8.164 times Beverage A’s odds of having whipped cream. The odds of Beverage B having whipped cream is 816.4% as large as the odds of Beverage A having whipped cream.
There are two new subjects of interest, Beverage C and Beverage D, which are the same size and have the same type of milk and grams of sugar, but Beverage C has 30 mg of caffeine and Beverage D has 150 mg of caffeine. Beverage D’s odds of having whipped cream will be 0.868 times Beverage C’s odds of having whipped cream. The odds of Beverafe D having whipped cream is 86.8% as large as the odds of Beverage C having whipped cream.
9.6.1 Conclusion
The main effects model has slightly better overall fit quality the difference in Nagelkerke R-square and the C-statistic are very small. The sensitivity and PPV of the main effects model is very slightly better, and the specificity of the augmented model is very slightly better. Overall, I do not think the minor differences is worth the complication of adding the nonlinear term in the augmented model. For these reasons, I prefer the main effects model.
10 Discussion
Begin the discussion section by clearly stating the two questions you posed at the start of Sections 6 and 7 of the Proposal, and then answering them based on the results of your modeling in Sections 8 and 9.
Question: How effectively can milk type, quantity of sugar in grams, quantity of caffeine in milligrams, and beverage size predict the number of calories in a Starbucks beverage? Answer: The effect sizes of sugar and beverage size are larger than the effect sizes of caffeine and milk type. Overall, my model can effectively predict the number of calories in a Starbucks beverage.
Question: How effectively does beverage size, type of milk, and quantity of sugar and caffeine predict whether a beverage has whipped cream? Answer: Similar to my linear regression answer, sugar is the driver of the odds ratio prediction for whether a beverage has whipped cream. Type of milk and amount caffeine have less of an effect on this prediction. Overall, my model can effectively predict whether a beverage will have whipped cream.
The most confusing part of this project for me was that I felt I was repeating myself frequently, and the summary statistics for all of my models were very similar. I had to carefully keep track of which model I was working with, what question I was trying to answer, and which outcome I was predicting. I wish that I used 2 different data sets for my linear and logistic regression modeling. This would have helped me separate these concepts better. The most useful thing I learned while doing this project is that linear regression models predict quantitative outcomes and logistic regression models predict binary outcomes. I understand this is a simple concept, but working through Project A is what drove this home for me. I am also happy that I learned how to assess for and incorporate nonlinear terms into my models.
11 Affirmation
I am certain that it is completely appropriate for these data to be shared with anyone, without any conditions. There are no concerns about privacy or security.
12 References
13 Session Information
::session_info() xfun
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Locale: en_US.UTF-8 / en_US.UTF-8 / en_US.UTF-8 / C / en_US.UTF-8 / en_US.UTF-8
Package version:
abind_1.4-5 askpass_1.1 assertthat_0.2.1
backports_1.4.1 base64enc_0.1-3 bit_4.0.4
bit64_4.0.5 bitops_1.0.7 blob_1.2.2
bookdown_0.24 boot_1.3.28 brio_1.1.3
broom_0.7.12 bslib_0.3.1 callr_3.7.0
car_3.0-12 carData_3.0-5 caret_6.0-90
caTools_1.18.2 cellranger_1.1.0 checkmate_2.0.0
class_7.3-20 cli_3.1.1 clipr_0.7.1
cluster_2.1.2 codetools_0.2-18 colorspace_2.0-2
compiler_4.1.2 cpp11_0.4.2 crayon_1.5.0
crosstalk_1.2.0 curl_4.3.2 data.table_1.14.2
DBI_1.1.2 dbplyr_2.1.1 desc_1.4.0
diffobj_0.3.5 digest_0.6.29 dplyr_1.0.8
dtplyr_1.2.1 e1071_1.7-9 ellipsis_0.3.2
evaluate_0.14 fansi_1.0.2 farver_2.1.0
fastmap_1.1.0 forcats_0.5.1 foreach_1.5.2
foreign_0.8-82 Formula_1.2-4 fs_1.5.2
future_1.23.0 future.apply_1.8.1 gargle_1.2.0
generics_0.1.2 GGally_2.1.2 ggdendro_0.1.22
ggforce_0.3.3 ggformula_0.10.1 ggplot2_3.3.5
ggrepel_0.9.1 ggridges_0.5.3 ggstance_0.3.5
globals_0.14.0 glue_1.6.1 googledrive_2.0.0
googlesheets4_1.0.0 gower_1.0.0 gplots_3.1.1
graphics_4.1.2 grDevices_4.1.2 grid_4.1.2
gridExtra_2.3 gtable_0.3.0 gtools_3.9.2
haven_2.4.3 here_1.0.1 highr_0.9
Hmisc_4.6-0 hms_1.1.1 htmlTable_2.4.0
htmltools_0.5.2 htmlwidgets_1.5.4 httr_1.4.2
ids_1.0.1 ipred_0.9-12 isoband_0.2.5
iterators_1.0.14 janitor_2.1.0 jpeg_0.1-9
jquerylib_0.1.4 jsonlite_1.7.3 KernSmooth_2.23.20
knitr_1.37 labeling_0.4.2 labelled_2.9.0
lattice_0.20-45 latticeExtra_0.6-29 lava_1.6.10
lazyeval_0.2.2 leaflet_2.0.4.1 leaflet.providers_1.9.0
lifecycle_1.0.1 listenv_0.8.0 lme4_1.1.28
lubridate_1.8.0 magrittr_2.0.2 maptools_1.1.2
markdown_1.1 MASS_7.3-55 Matrix_1.4-0
MatrixModels_0.5-0 methods_4.1.2 mgcv_1.8.38
mime_0.12 minqa_1.2.4 ModelMetrics_1.2.2.2
modelr_0.1.8 mosaic_1.8.3 mosaicCore_0.9.0
mosaicData_0.20.2 multcomp_1.4-18 munsell_0.5.0
mvtnorm_1.1-3 naniar_0.6.1 nlme_3.1-155
nloptr_2.0.0 nnet_7.3-17 norm_1.0.9.5
numDeriv_2016.8.1.1 openssl_1.4.6 parallel_4.1.2
parallelly_1.30.0 patchwork_1.1.1 pbkrtest_0.5.1
pillar_1.7.0 pkgconfig_2.0.3 pkgload_1.2.4
plyr_1.8.6 png_0.1-7 polspline_1.1.19
polyclip_1.10-0 praise_1.0.0 prettyunits_1.1.1
pROC_1.18.0 processx_3.5.2 prodlim_2019.11.13
progress_1.2.2 progressr_0.10.0 proxy_0.4-26
ps_1.6.0 purrr_0.3.4 quantreg_5.88
R6_2.5.1 rappdirs_0.3.3 raster_3.5.15
RColorBrewer_1.1-2 Rcpp_1.0.8 RcppEigen_0.3.3.9.1
readr_2.1.2 readxl_1.3.1 recipes_0.1.17
rematch_1.0.1 rematch2_2.1.2 reprex_2.0.1
reshape_0.8.8 reshape2_1.4.4 rlang_1.0.1
rmarkdown_2.11 rmdformats_1.0.3 rms_6.2-0
ROCR_1.0-11 rpart_4.1.16 rprojroot_2.0.2
rstudioapi_0.13 rvest_1.0.2 sandwich_3.0-1
sass_0.4.0 scales_1.1.1 selectr_0.4.2
snakecase_0.11.0 sp_1.4.6 SparseM_1.81
splines_4.1.2 SQUAREM_2021.1 stats_4.1.2
stats4_4.1.2 stringi_1.7.6 stringr_1.4.0
survival_3.2-13 sys_3.4 terra_1.5.17
testthat_3.1.2 TH.data_1.1-0 tibble_3.1.6
tidyr_1.2.0 tidyselect_1.1.1 tidyverse_1.3.1
timeDate_3043.102 tinytex_0.36 tools_4.1.2
tweenr_1.0.2 tzdb_0.2.0 UpSetR_1.4.0
utf8_1.2.2 utils_4.1.2 uuid_1.0.3
vctrs_0.3.8 viridis_0.6.2 viridisLite_0.4.0
visdat_0.5.3 vroom_1.5.7 waldo_0.3.1
withr_2.4.3 xfun_0.29 xml2_1.3.3
yaml_2.2.2 zoo_1.8-9