Impact of Glycohemoglobin on Total Sleep Time in US Adults
1 Setup and Data Ingest
1.1 Package Loading
I will load in my necessary packages.
library(janitor)
library(magrittr)
library(naniar)
library(ggrepel)
library(equatiomatic)
library(patchwork)
library(car)
library(broom)
library(nhanesA)
library(haven)
library(mosaic)
library(knitr)
library(GGally)
library(tidyverse)
$set(comment=NA)
opts_chunk$set(width=75)
opts_knit
theme_set(theme_bw())
options(dplyr.summarise.inform = FALSE)
1.2 Data Ingest
I will ingest each of the 3 nhanesA
data sets I will be using into a raw data set tibble, remove labels from the variables, and save each data set as a new file on my computer for subsequent sessions.
# demo_raw <- nhanes('DEMO_J') %>% zap_label() %>% tibble()
# saveRDS(demo_raw, "data/DEMO_J.Rds")
<- readRDS("data/DEMO_J.Rds") demo_raw
# sleep_raw <- nhanes('SLQ_J') %>% zap_label() %>% tibble()
# saveRDS(sleep_raw, "data/SLQ_J.Rds")
<- readRDS("data/SLQ_J.Rds") sleep_raw
# a1c_raw <- nhanes('GHB_J') %>% zap_label() %>% tibble()
# saveRDS(a1c_raw, "data/GHB_J.Rds")
<- readRDS("data/GHB_J.Rds") a1c_raw
2 Cleaning the Data
2.1 Selecting the Variables
I will select the variables of interest from each raw data set. I need SEQN
from each in order to combine the data sets next. I need SLD012
, SLQ030
, and SLQ120
from the sleep_raw
data, LBXGH
from the a1c_raw
data, and RIAGENDR
, RIDAGEYR
, and RIDSTATR
from the demo_raw
data set.
<- demo_raw %>%
demo_data select(SEQN, RIAGENDR, RIDAGEYR, RIDSTATR)
<- sleep_raw %>%
sleep_data select(SEQN, SLD012, SLQ030, SLQ120)
<- a1c_raw %>%
a1c_data select(SEQN, LBXGH)
2.2 Merging the Data
I will merge the 3 raw data sets first by joining sleep_data
to demo_data
, then by adding in a1c_data
. I will join them by the unique subject identifier SEQN
to create 1 row per subject.
<- left_join(demo_data, sleep_data, by = "SEQN")
temp
<- left_join(temp, a1c_data, by = "SEQN")
raw_data
dim(raw_data)
[1] 9254 8
I have 9254 rows (subjects) and 8 columns (variables) as expected.
I will check that the number of rows equals the number of unique subjects to ensure I avoided any errors in merging.
identical(raw_data %$% n_distinct(SEQN),
%>% nrow()) raw_data
[1] TRUE
They do.
I will remove data sets I will no longer be using to keep my work space clear.
rm(temp, demo_raw, a1c_raw, sleep_raw, demo_data, a1c_data, sleep_data)
2.3 Names and Types
I will begin to clean my data by make each variable the type I want. My categorical variables (RIDSTATR
, RIAGENDR
, SLQ030
, SLQ120
) will be factors. SEQN
will be a character variable, as its numbers have no numerical quality. LBXGH
, SLD012
, and RIDAGEYR
are my quantitative variables, so they will be numerical or ‘dbl’. Lastly, I will remove any levels of a factor variable that have 0 subjects.
dim(raw_data)
[1] 9254 8
<- raw_data %>%
clean_data mutate(SEQN = as.character(SEQN),
LBXGH = as.numeric(LBXGH),
SLD012 = as.double(SLD012),
RIDAGEYR = as.double(RIDAGEYR),
RIDSTATR = factor(RIDSTATR),
RIAGENDR = factor(RIAGENDR),
SLQ030 = factor(SLQ030),
SLQ120 = factor(SLQ120)) %>%
droplevels()
dim(clean_data)
[1] 9254 8
I did not lose any columns or rows as I did this.
glimpse(clean_data)
Rows: 9,254
Columns: 8
$ SEQN <chr> "93703", "93704", "93705", "93706", "93707", "93708", "93709"~
$ RIAGENDR <fct> 2, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1~
$ RIDAGEYR <dbl> 2, 2, 66, 18, 13, 66, 75, 0, 56, 18, 67, 54, 71, 61, 22, 45, ~
$ RIDSTATR <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
$ SLD012 <dbl> NA, NA, 8.0, 10.5, NA, 8.0, 7.0, NA, 7.0, 7.5, 5.5, 7.0, 5.0,~
$ SLQ030 <fct> NA, NA, 2, 1, NA, 9, 1, NA, 2, 1, 0, 3, 3, 3, 2, 0, NA, NA, 0~
$ SLQ120 <fct> NA, NA, 0, 1, NA, 2, 1, NA, 3, 2, 2, 3, 4, 3, 1, 2, NA, NA, 0~
$ LBXGH <dbl> NA, NA, 6.2, 5.2, 5.6, 6.2, 6.3, NA, 5.7, 5.4, 5.6, 12.7, 6.2~
My variable types are as expected and described above.
2.4 Variables
2.4.1 Total Sleep Time: Outcome
Total sleep time in hours is a quantitative variable and will be my outcome. I will start with a numerical summary.
df_stats(~ SLD012, data = clean_data)
My range of values is 2-14 hours of sleep per night on weekdays and weekends. These are plausible values. The mean and median being close to each and around 7.5-8 hours is reassuring. There are 3141 subjects with missing data. I will rename this to a more descriptive name and filter to complete cases.
<- clean_data %>%
clean_data rename(tst = SLD012) %>%
filter(complete.cases(tst))
dim(clean_data) # should be 9254 - 3141 = 6113
[1] 6113 8
I am left with 6113 rows (subjects) and 8 columns (variables) as expected.
2.4.2 Glycohemoglobin: Key Predictor
Glycohemoglobin percentage is a quantitative variable and will be my key predictor. I will start with a numerical summary.
df_stats( ~ LBXGH, data = clean_data)
There are 615 subjects with missing values. The range of values is 3.8-16.2%. Levels below 4% and above 12% are nearly implausible and rarely seen in clinical practice. Furthermore, the clinical difference between 3.8% and 4% as well as between 12% and 16.2% are extremely minimal and would be unlikely to affect the clinical significance of the findings of the following analyses. For example, a glycohemoglobin level of 12% already indicates very poor long term blood glucose control, and an increase to 16.2% does not make a meaningful difference at that point. On the other hand, these outliers will meaningfully affect the quality of my model’s predictions. For these reasons, I will create inclusion criteria of glycohemoglobin level 4-12% by telling R to consider values outside this range as missing. I will rename this to a more descriptive name.
<- clean_data %>%
clean_data rename(a1c = LBXGH) %>%
mutate(a1c = replace(a1c, a1c < 4 | a1c > 12, NA))
df_stats(~ a1c, data = clean_data)
The range of glycohemoglobin levels is now, more appropriately, 4.1-12%. There were no values of 4.0% in the raw data set. There are now 637 missing values. I will filter to complete cases.
<- clean_data %>%
clean_data filter(complete.cases(a1c))
dim(clean_data) # should be 6113 - 637 = 5476
[1] 5476 8
I am left with 5476 rows (subjects) and 8 columns (variables) as expected.
2.4.3 Age: Secondary Predictor
Age in years is a quantitative variable and will be a secondary predictor. I will start with a numerical summary.
df_stats(~ RIDAGEYR, data = clean_data)
There are no missing data. The age range of my subjects is 16-80 years old. I wish to study adults age 20-79 years in my analyses, so I will create these inclusion criteria now, as well as rename this variable to a more descriptive name. My upper limit is 79 years because subjects demarcated as 80 years old are, in actuality, 80 years old or greater. This ceiling effect will alter the accuracy of my model’s predictions, so I will remove them.
<- clean_data %>%
clean_data rename(age = RIDAGEYR) %>%
mutate(age = replace(age, age < 20 | age > 79, NA))
df_stats(~ age, data = clean_data)
The age range is now 20-79 years. I have 879 subjects with missing data. I will filter to complete cases.
<- clean_data %>%
clean_data filter(complete.cases(age))
dim(clean_data) # should be 5476 - 879 = 4597
[1] 4597 8
I am left with 4597 rows (subjects) and 8 columns (variables) as expected.
2.4.4 Sex: Secondary Predictor
Sex is a binary categorical variable and will be a secondary predictor. I will start with a numerical summary.
%>% tabyl(RIAGENDR) %>% kable(digits = 3) clean_data
RIAGENDR | n | percent |
---|---|---|
1 | 2206 | 0.48 |
2 | 2391 | 0.52 |
My data set comprises 2206 males (48%) and 2391 females (52%). I will rename this variable to a more descriptive name and recode it so that I can see ‘Male’ and ‘Female’ instead of 1 and 2.
# gender recode
<- clean_data %>%
clean_data rename(sex = RIAGENDR) %>%
mutate(sex = fct_recode(sex,
"Male" = "1",
"Female" = "2"))
%>% tabyl(sex) %>% kable(digits = 3) clean_data
sex | n | percent |
---|---|---|
Male | 2206 | 0.48 |
Female | 2391 | 0.52 |
dim(clean_data) # should be 4597
[1] 4597 8
I am left with 4597 rows (subjects) and 8 columns (variables) as expected.
2.4.5 Snore: Secondary Predictor
Snore is a binary categorical variable and will be a secondary predictor. I will start with a numerical summary.
%>% tabyl(SLQ030) %>% kable(digits = 3) clean_data
SLQ030 | n | percent |
---|---|---|
0 | 1058 | 0.230 |
1 | 1057 | 0.230 |
2 | 866 | 0.188 |
3 | 1302 | 0.283 |
7 | 5 | 0.001 |
9 | 309 | 0.067 |
My data set comprises 1058 subjects (23.0%) who never snore, 1057 subjects (23.0%) who rarely snore, 866 subjects (18.8%) who occasionally snore, 1302 subjects (28.3%) who frequently snore, 5 subjects who refused to answer, and 9 subjects who did not know. I do not have any missing data. I will rename this variable to a more descriptive name, tell R to consider “refused” and “don’t know” as missing values, and recode it to a binary variable so those who never or rarely snore will be in ‘Does not snore’ and those who occasionally or frequently snore will be in ‘Snores’.
<- clean_data %>%
clean_data rename(snore = SLQ030) %>%
mutate(snore = na_if(snore, 7),
snore = na_if(snore, 9)) %>%
mutate(snore = fct_recode(snore,
"Does not snore" = "0",
"Does not snore" = "1",
"Snores" = "2",
"Snores" = "3"))
%>% tabyl(snore) %>% kable() clean_data
snore | n | percent | valid_percent |
---|---|---|---|
Does not snore | 2115 | 0.4600827 | 0.4938127 |
Snores | 2168 | 0.4716119 | 0.5061873 |
7 | 0 | 0.0000000 | 0.0000000 |
9 | 0 | 0.0000000 | 0.0000000 |
NA | 314 | 0.0683054 | NA |
This looks good. I have created a binary variable and moved all the ‘Refused’ or ‘Don’t know’ responses to missing values. I will drop those missing values and those rows that do not contain any subjects.
<- clean_data %>%
clean_data filter(complete.cases(snore)) %>%
droplevels()
%>% tabyl(snore) %>% kable(digits = 3) clean_data
snore | n | percent |
---|---|---|
Does not snore | 2115 | 0.494 |
Snores | 2168 | 0.506 |
dim(clean_data) # should be 4597 - 314 = 4283
[1] 4283 8
I am left with 4283 rows (subjects) and 8 columns (variables) as expected.
2.4.6 Sleepy: Secondary Predictor
Sleepy is a 5-level multi-categorical variable and will be a secondary predictor. I will start with a numerical summary.
%>% tabyl(SLQ120) %>% kable(digits = 3) clean_data
SLQ120 | n | percent |
---|---|---|
0 | 704 | 0.164 |
1 | 1035 | 0.242 |
2 | 1464 | 0.342 |
3 | 730 | 0.170 |
4 | 346 | 0.081 |
9 | 4 | 0.001 |
My data set comprises 704 subjects (16.4%) who are never sleepy, 1035 subjects (24.2%) who are rarely sleepy, 1464 subjects (34.2%) who are sometimes sleepy, 730 subjects (17.0%) who are often sleepy, 346 (8.1%) who are always sleepy, and 4 subjects who did not know their frequency of sleepiness. I do not have any missing data. I will rename this variable to a more descriptive name, tell R to consider “Don’t know” as a missing value, and rename the levels more descriptively.
# recode to multi-categorical variable
<- clean_data %>%
clean_data rename(sleepy = SLQ120) %>%
mutate(sleepy = na_if(sleepy, 7),
sleepy = na_if(sleepy, 9)) %>%
mutate(sleepy = fct_recode(sleepy,
"never" = "0",
"rarely" = "1",
"sometimes" = "2",
"often" = "3",
"always" = "4"))
%>% tabyl(sleepy) %>% kable() clean_data
sleepy | n | percent | valid_percent |
---|---|---|---|
never | 704 | 0.1643708 | 0.1645244 |
rarely | 1035 | 0.2416530 | 0.2418789 |
sometimes | 1464 | 0.3418165 | 0.3421360 |
often | 730 | 0.1704413 | 0.1706006 |
always | 346 | 0.0807845 | 0.0808600 |
9 | 0 | 0.0000000 | 0.0000000 |
NA | 4 | 0.0009339 | NA |
I have moved all the ‘Don’t know’ responses to missing values and these level names look good. I will drop those missing values and the row that does not contain any subjects.
<- clean_data %>%
clean_data filter(complete.cases(sleepy)) %>%
droplevels()
dim(clean_data) # should be 4283 - 4 = 4279
[1] 4279 8
I am left with 4279 rows (subjects) and 8 columns (variables) as expected.
2.4.7 RIDSTATR: Indicator Variable
RIDSTATR
is a variable which designates whether a subject partook in both the questionnaire and the examination, for which they are given a value of 2. I will start by seeing if all the subjects in my data set have a value of 2.
%>% tabyl(RIDSTATR) %>% kable() clean_data
RIDSTATR | n | percent |
---|---|---|
2 | 4279 | 1 |
They do. I will drop the level 1 from my data set as it contains no subjects.
<- clean_data %>%
clean_data droplevels()
%>% tabyl(RIDSTATR) %>% kable() clean_data
RIDSTATR | n | percent |
---|---|---|
2 | 4279 | 1 |
dim(clean_data) # should be 4279
[1] 4279 8
I am left with 4279 rows (subjects) and 8 columns (variables) as expected.
2.5 Missingness Check
I will check that I was successful in removing the missing data as I cleaned each variable.
miss_var_summary(clean_data)
I was.
2.6 Creating My Tibble
I will create a tibble called study2
from the clean data, selecting only those variables I will use in my analyses. I will rename the unique subject identifier to a more descriptive name. Lastly, I will save my tibble in my data folder.
<- clean_data %>%
study2 select(SEQN, a1c, tst, age, sex, snore, sleepy) %>%
rename(sub_id = SEQN)
rm(raw_data, clean_data)
saveRDS(study2, "data/study2.Rds")
3 Codebook and Data Description
These data are from the 2017-2018 National Health and Nutrition Examination Survey (NHANES) which tracks the health of the population of the United States by collecting questionnaire and examination data. NHANES is run by the Center for Disease Control and Prevention (CDC). Their target population is the non-institutionalized civilian resident population of the United States. They create their sample population through random selection using statistical processes on US Census information. For the 2017-2018 sampling, 16,211 people were selected. Of these, 9,254 completed the interview and 8,704 completed the exams. The following analyses include subjects between 20-79 years old who have complete data on the variables of interest. My goal is to build a prediction model for total sleep time in hours with the key predictor being glycohemoglobin percentage, adjusted for age, sex, presence of snoring, and level of sleepiness.
3.1 Codebook
Variable | Type | Description | Raw Name |
---|---|---|---|
sub_id |
Chr | A unique subject identifier within each NHANES data set | SEQN |
tst |
Quant | Outcome This is in the Questionnaire category and describes estimated total sleep time on weekdays or workdays, in hours, ranging from 2 to 14 hours. | SLD012 |
a1c |
Quant | Key Predictor This is in the Laboratory category and describes the percentage of glycosylated hemoglobin in the blood, a measure of long-term blood glucose control. A level of 6.5% or higher is diagnostic for diabetes, with higher levels indicating poorer disease status. Inclusion criteria are values 4.1% to 12.0%. | LBXGH |
age |
Quant | This is in the Demographics category and describes age in years. Inclusion criteria ages 20 to 79 years. | RIDAGEYR |
sex |
Binary | This is in the Demographics category and describes sex as male or female. These data comprise 48.2% males and 51.8% females. | RIAGENDR |
snore |
Binary | This is in the Questionnaire category and describes subjective snoring frequency in the NHANES data. In converting this to binary for these analyses, the meaning has become presence of snoring, thus the categories are ‘Snores’ and ‘Does not snore’. ‘Snores’ includes the questionnaire answers “Occasionally - 3-4 nights a week” or “Frequently - 5 or more nights a week”. ‘Does not snore’ includes the questionnaire answers “Never” or “Rarely - 1-2 nights a week”. In these data, 50.6% admit to snoring, and 49.4% report never or rarely snoring. | SLQ030 |
sleepy |
5-Cat | This is in the Questionnaire category and describes subjective frequency of feeling overly sleepy. The categories are ‘never’, ‘rarely’, ‘sometimes’, ‘often’, and ‘always’. ‘Never’ corresponds to a questionnaire response of ‘never’, ‘rarely’ to ‘rarely - 1 time a month, ’sometimes’ to ‘sometimes - 2-4 times a month’, ‘often’ to ‘often - 5-15 times a month’, and ‘always’ to ‘always - 16-30 times a month’. In these data, 16.5% are never sleepy, 24.2% are rarely sleepy, 34.2% are sometimes sleepy, 17.1% are often sleepy, and 8.1% are always sleepy. | SLQ120 |
3.2 Analytic Tibble
Here is my tibble.
study2
As anticipated, I have 4279 rows (subjects) and 7 variables in my final data set.
I will demonstrate that study2
is, in fact, a tibble.
is_tibble(study2)
[1] TRUE
3.3 Data Summary
I would like to see a summary of my data now that it has been cleaned up.
%>%
study2 select(-sub_id) %>%
::describe(.) Hmisc
.
6 Variables 4279 Observations
--------------------------------------------------------------------------------
a1c
n missing distinct Info Mean Gmd .05 .10
4279 0 78 0.996 5.81 0.9007 4.9 5.0
.25 .50 .75 .90 .95
5.3 5.6 6.0 6.8 7.8
lowest : 4.1 4.2 4.3 4.4 4.5, highest: 11.5 11.6 11.7 11.8 12.0
--------------------------------------------------------------------------------
tst
n missing distinct Info Mean Gmd .05 .10
4279 0 23 0.985 7.549 1.78 5.0 5.5
.25 .50 .75 .90 .95
6.5 7.5 8.5 9.5 10.0
lowest : 2.0 3.0 3.5 4.0 4.5, highest: 11.5 12.0 12.5 13.0 14.0
--------------------------------------------------------------------------------
age
n missing distinct Info Mean Gmd .05 .10
4279 0 60 1 48.87 18.82 23 26
.25 .50 .75 .90 .95
35 50 62 70 74
lowest : 20 21 22 23 24, highest: 75 76 77 78 79
--------------------------------------------------------------------------------
sex
n missing distinct
4279 0 2
Value Male Female
Frequency 2063 2216
Proportion 0.482 0.518
--------------------------------------------------------------------------------
snore
n missing distinct
4279 0 2
Value Does not snore Snores
Frequency 2113 2166
Proportion 0.494 0.506
--------------------------------------------------------------------------------
sleepy
n missing distinct
4279 0 5
lowest : never rarely sometimes often always
highest: never rarely sometimes often always
Value never rarely sometimes often always
Frequency 704 1035 1464 730 346
Proportion 0.165 0.242 0.342 0.171 0.081
--------------------------------------------------------------------------------
4 My Research Question
How well can an adult’s (age 20-79 years) long term blood glucose control i.e., glycohemoglobin percentage, predict their total sleep time on workdays/weekdays, after adjusting for age, sex, presence of snoring, and frequency of sleepiness?
5 Partitioning the Data
I will set a seed so that I can replicate my findings later. I will create a training subset comprising 80% of subjects from study2
, and a test subset that comprises 20%.
set.seed(1228)
<- study2 %>%
study2_train slice_sample(., prop = .80)
<- anti_join(study2, study2_train, by = "sub_id")
study2_test
dim(study2)
[1] 4279 7
dim(study2_train)
[1] 3423 7
dim(study2_test)
[1] 856 7
This looks good because 3423 is 80% of 4279, and the difference of these is 856.
6 Transforming the Outcome
6.1 Visualizing the Outcome Distribution
<- ggplot(study2_train, aes(sample = tst)) +
p1 geom_qq(col = "black", alpha = 0.5, size = 0.5) +
geom_qq_line(col = "red") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q plot",
x = "Theoretical Quantities",
y = "TST (hrs)")
<- ggplot(study2_train, aes(x = tst)) +
p2 geom_histogram(aes(y = stat(density)),
bins = 12, col = "slateblue4",
fill = "lightskyblue") +
stat_function(fun = dnorm,
args = list(mean = mean(study2_train$tst),
sd = sd(study2_train$tst)),
col = 'red', lwd = 0.8) +
labs(title = "Density Histogram",
x = "TST (hrs)",
y = "Density")
<- ggplot(study2_train, aes(x = "", y = tst)) +
p3 geom_boxplot(fill = "lightskyblue",
outlier.color = "red",
outlier.size = 2) +
coord_flip() +
labs(title = "Box Plot",
x = "",
y = "TST (hrs)")
+ (p2 / p3 + plot_layout(heights = c(4,1))) +
p1 plot_annotation(title = "Distribution Assessment of Total Sleep Time (`tst`)",
subtitle = "abbreviated as TST",
caption = "3423 subjects from `study2_train`")
The bell-shape of the histogram density curve points toward a normal distribution, however the box plot shows there are 6 outliers, 2 on the lower scale and 4 on the higher scale. The Q-Q plot further emphasizes these findings. Because the distribution is fairly symmetric, a transformation is unlikely to be helpful.
6.2 Numerical Summaries
This is a numerical summary of my outcome, tst
.
favstats(~ tst, data = study2_train)
The mean and median being so close provides evidence of Normality.
This is a numerical summary of my predictors.
%>%
study2_train select(a1c, age, sex, snore, sleepy) %>%
::describe(.) Hmisc
.
5 Variables 3423 Observations
--------------------------------------------------------------------------------
a1c
n missing distinct Info Mean Gmd .05 .10
3423 0 77 0.996 5.816 0.9042 4.9 5.0
.25 .50 .75 .90 .95
5.3 5.6 6.0 6.8 7.8
lowest : 4.1 4.2 4.3 4.4 4.5, highest: 11.4 11.5 11.6 11.8 12.0
--------------------------------------------------------------------------------
age
n missing distinct Info Mean Gmd .05 .10
3423 0 60 1 48.82 18.84 23 26
.25 .50 .75 .90 .95
34 50 62 70 74
lowest : 20 21 22 23 24, highest: 75 76 77 78 79
--------------------------------------------------------------------------------
sex
n missing distinct
3423 0 2
Value Male Female
Frequency 1648 1775
Proportion 0.481 0.519
--------------------------------------------------------------------------------
snore
n missing distinct
3423 0 2
Value Does not snore Snores
Frequency 1695 1728
Proportion 0.495 0.505
--------------------------------------------------------------------------------
sleepy
n missing distinct
3423 0 5
lowest : never rarely sometimes often always
highest: never rarely sometimes often always
Value never rarely sometimes often always
Frequency 549 849 1152 591 282
Proportion 0.160 0.248 0.337 0.173 0.082
--------------------------------------------------------------------------------
My categorical variables have the proper number of levels and subjects within each levels, and my quantitative variables have the values I expect. There are no missing data.
6.3 Scatterplot Matrix
This is a scatterplot matrix to show the interplay of all my variables of interest.
ggpairs(study2_train %>% select(a1c, age, sex, snore, sleepy, tst),
lower = list(combo = wrap("facethist", binwidth = 0.5)),
title = "3423 subjects in `study2_train`")
My outcome does not appear like it will fit a linear model well with my key predictor. The Pearson correlation coefficient is -0.001, indicative of a weak relationship. The relationship of my outcome with age
appears even more random in the graphical representation, but the Pearson correlation coefficient states otherwise at 0.021 with a1c
predicting age
, and 0.340 with age
predicting a1c
. The distribution of total sleep time when stratified by both sex
and snore
appears to maintain normality, but this matrix does not allow me to see outliers, so I will look further into that. It is too difficult for me to interpret my outcome’s relationship with sleepy
so I will look further into that as well.
6.3.1 Relationship Between Sex and TST
ggplot(study2_train, aes(x = sex, y = tst)) +
geom_violin(aes(fill = sex)) +
geom_boxplot(width = 0.3, outlier.size = 2) +
stat_summary(fun = "mean", geom = "point",
shape = 23, size = 2, fill = "magenta") +
coord_flip() +
guides(fill = "none") +
labs(title = "Sex as a Predictor of Total Sleep Time",
subtitle = "stratified by male or female sex",
caption = "3423 subjects in `study2_train`",
x = "Sex",
y = "Total Sleep Time (hrs)")
As I suspected from my scatterplot matrix and numerical summaries, tst
for both sexes has a bell shape, but the outliers prevent it from having a normal distribution. The outliers have less of an effect on the mean in males than in females. The spread of each is very similar.
6.3.2 Relationship Between Snoring and TST
ggplot(study2_train, aes(x = snore, y = tst)) +
geom_violin(aes(fill = snore)) +
geom_boxplot(width = 0.3, outlier.size = 2) +
stat_summary(fun = "mean", geom = "point",
shape = 23, size = 2, fill = "magenta") +
coord_flip() +
guides(fill = "none") +
labs(title = "Snoring as a Predictor of Total Sleep Time",
subtitle = "stratified by snoring presence",
caption = "3423 subjects in `study2_train`",
x = "Presence of Snoring",
y = "Total Sleep Time (hrs)")
Again, as I suspected from my scatterplot matrix and numerical summaries, tst
for categories has a bell shape, but the outliers prevent it from having a normal distribution. The means are very close to the medians here. The spread of each is very similar.
6.3.3 Relatiosnip Between Sleepinessand TST
::favstats(tst ~ sleepy, data = study2_train) mosaic
The range of TST values is similar for each category, however subjects in the ‘always’ sleepy category report fewer total sleep hours with a max of 12 hours and mean of 7.23 hours. Subjects in the ‘never’ sleepy category report the most total sleep hours with a mean of 7.73 hours. The sizes of each category is unbalanced, so I will view the distributions with a density plot.
ggplot(study2_train, aes(x = tst, y = sleepy, fill = sleepy)) +
geom_density_ridges(scale = 0.9, alpha = 0.8) +
scale_fill_viridis_d() +
guides(fill = "none") +
theme_ridges() +
labs(title = "Distribution of Total Sleep Time (TST)",
subtitle = "stratified by sleepiness frequency",
x = "TST (hrs)",
y = "")
Picking joint bandwidth of 0.354
Visually, it appears the spread of values decreases from the ‘always’ sleepy to the ‘never’ sleepy group. Otherwise, each category appears to have a normal distribution aside from the likely outliers.
6.4 Collinearity Checking
Next, I will check for any collinearity of my data.
::vif(lm(tst ~ a1c + age + sex + snore + sleepy,
cardata = study2_train))
GVIF Df GVIF^(1/(2*Df))
a1c 1.142530 1 1.068892
age 1.164237 1 1.078998
sex 1.016107 1 1.008021
snore 1.064328 1 1.031663
sleepy 1.028627 4 1.003534
There do not appear to be any issues with collinearity here.
6.5 Transformation Assessment
I do not suspect an improvement in the distribution of my outcome with a transformation because the issues are the outliers. I will run a boxCox
just to be sure.
<- lm(tst ~ a1c + age + sex + snore + sleepy,
model_temp data = study2_train)
boxCox(model_temp)
powerTransform(model_temp)
Estimated transformation parameter
Y1
1.088339
This confirms my thinking that a transformation would not help my outcome.
7 The Big Model
I will fit a “kitchen sink” model that includes all my predictors (a1c
, age
, sex
, snore
, and sleepy
), including my key predictor, to predict my outcome, tst
.
<- lm(tst ~ a1c + age + sex + snore + sleepy,
mod_big data = study2_train)
summary(mod_big)
Call:
lm(formula = tst ~ a1c + age + sex + snore + sleepy, data = study2_train)
Residuals:
Min 1Q Median 3Q Max
-5.9452 -0.9086 0.0536 0.9115 6.6603
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.527319 0.183580 41.003 < 2e-16 ***
a1c -0.002432 0.030063 -0.081 0.93552
age 0.002591 0.001848 1.402 0.16094
sexFemale 0.248842 0.056430 4.410 1.07e-05 ***
snoreSnores -0.085615 0.057717 -1.483 0.13807
sleepyrarely -0.176910 0.089640 -1.974 0.04851 *
sleepysometimes -0.121373 0.085265 -1.423 0.15469
sleepyoften -0.257612 0.097612 -2.639 0.00835 **
sleepyalways -0.496874 0.120762 -4.115 3.97e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.637 on 3414 degrees of freedom
Multiple R-squared: 0.01281, Adjusted R-squared: 0.01049
F-statistic: 5.536 on 8 and 3414 DF, p-value: 5.588e-07
The model accounts for 1.28% of variation in the outcome, as per the R^2. The adjusted R^2 value is 0.0105, which I will use later to compare this model to other models. The residual standard error is 1.637, telling me to expect 95% of residuals be in the range of -3.274 to +3.274, and none to be outside the range of -4.911 to +4.911. The model has an F statistic of 5.536 on 8 and 3414 degrees of freedom that yields a p-value of <0.001, implying that this model can predict the outcome better than if the outcome were predicted by chance.
tidy(mod_big, conf.int = TRUE, conf.level = 0.90) %>%
select(term, estimate, std.error, conf.low, conf.high, p.value) %>%
kable(digits = 3)
term | estimate | std.error | conf.low | conf.high | p.value |
---|---|---|---|---|---|
(Intercept) | 7.527 | 0.184 | 7.225 | 7.829 | 0.000 |
a1c | -0.002 | 0.030 | -0.052 | 0.047 | 0.936 |
age | 0.003 | 0.002 | 0.000 | 0.006 | 0.161 |
sexFemale | 0.249 | 0.056 | 0.156 | 0.342 | 0.000 |
snoreSnores | -0.086 | 0.058 | -0.181 | 0.009 | 0.138 |
sleepyrarely | -0.177 | 0.090 | -0.324 | -0.029 | 0.049 |
sleepysometimes | -0.121 | 0.085 | -0.262 | 0.019 | 0.155 |
sleepyoften | -0.258 | 0.098 | -0.418 | -0.097 | 0.008 |
sleepyalways | -0.497 | 0.121 | -0.696 | -0.298 | 0.000 |
The 90% confidence interval for my key predictor (a1c
) includes 0, so it is unlikely that a1c
improves the effectiveness of this model’s prediction of tst
. The confidence intervals for age
and snore
also include 0, so their inclusion is also likely not improving the predictability of the model.
The predictors for which the confidence interval do not include 0 are sex
and sleepy
. The point estimate for sex
implies that female subjects have 0.249 hours more sleep time than males, after adjusting for a1c
and age
, and the 90% confidence interval for this estimate is (0.156,0.342). Compared to those who are never sleepy, those who are rarely, often, and always sleepy have a lower total sleep time. Those who are sometimes sleepy may have 0.121 fewer sleep hours, but I am not 90% confident that this difference exists in the true population.
extract_eq(mod_big, use_coefs = TRUE, coef_digits = 3,
wrap = TRUE, terms_per_line = 1)
\[ \begin{aligned} \operatorname{\widehat{tst}} &= 7.527\ - \\ &\quad 0.002(\operatorname{a1c})\ + \\ &\quad 0.003(\operatorname{age})\ + \\ &\quad 0.249(\operatorname{sex}_{\operatorname{Female}})\ - \\ &\quad 0.086(\operatorname{snore}_{\operatorname{Snores}})\ - \\ &\quad 0.177(\operatorname{sleepy}_{\operatorname{rarely}})\ - \\ &\quad 0.121(\operatorname{sleepy}_{\operatorname{sometimes}})\ - \\ &\quad 0.258(\operatorname{sleepy}_{\operatorname{often}})\ - \\ &\quad 0.497(\operatorname{sleepy}_{\operatorname{always}}) \end{aligned} \]
For every 1 percentage point increase in a1c
, I expect a decrease in TST of 0.002 hours (90% CI: -0.052, 0.047), if all other predictors remained constant. For every 1 year older age, I expect an increase in TST of 0.003 hours (90% CI: 0.000,0.006), if all other predictors remained constant. Being female results in an increase in TST of 0.249 hours (90% CI: 0.156,0.342), compared to being male, if all other predictors remained constant. The lack of snoring estimates a 0.086 hour (90% CI: -0.009,0.181) increase in TST, compared to the presence of snoring, if all other predictors remained constant. Compared to those who are never sleepy, those who are rarely, sometimes, often, and always sleepy get 0.177 (90% CI: -0.324 -0.029), 0.121 (90% CI: -0.262,0.019), 0.258 (90% CI: -0.418,-0.097), and 0.497 (90% CI: -0.696,-0.298) fewer hours of sleep on weekdays/workdays, respectively, if all other predictors remained constant.
8 The Smaller Model
I will fit a model that uses a1c
and sex
to predict total sleep time. First I will visualize the relationship I am evaluating.
ggplot(study2_train, aes(x = a1c, y = tst,
col = sex, group = sex)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE) +
facet_wrap(~ sex) +
guides(col = "none") +
labs(title = "Total Sleep Time as a Function of Glycohemoglobin",
subtitle = "stratified by sex, with OLS",
caption = "3423 subjects from `study2_train`",
x = "Glycohemoglobin (%)",
y = "Total Sleep Time (hrs)")
It appears that in males, higher glycohemoglobin is associated with higher total sleep time. The opposite is true in females; higher glycohemoglobin is associated with lower total sleep time. Both of these relationships appear very weak.
I will fit the model.
<- lm(tst ~ a1c + sex, data = study2_train)
mod_small
summary(mod_small)
Call:
lm(formula = tst ~ a1c + sex, data = study2_train)
Residuals:
Min 1Q Median 3Q Max
-5.6796 -0.9281 0.0714 0.8295 6.5722
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.41321 0.17009 43.585 < 2e-16 ***
a1c 0.00275 0.02822 0.097 0.922
sexFemale 0.24329 0.05618 4.331 1.53e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.641 on 3420 degrees of freedom
Multiple R-squared: 0.005455, Adjusted R-squared: 0.004873
F-statistic: 9.379 on 2 and 3420 DF, p-value: 8.665e-05
The model accounts for 0.55% of variation in the outcome, as per the R^2. The adjusted R^2 value is 0.0049, which I will use later to compare this model to other models. The residual standard error is 1.641, telling me to expect 95% of residuals be in the range of -3.282 to +3.282, and none to be outside the range of -4.923 to +4.923. The model has an F statistic of 9.379 on 2 and 3420 degrees of freedom that yields a p-value of <0.001, implying that this model can predict the outcome better than if the outcome were predicted by chance.
tidy(mod_small, conf.int = TRUE, conf.level = 0.90) %>%
select(term, estimate, std.error, conf.low, conf.high, p.value) %>%
kable(dig = c(0,3,3,3,3,5))
term | estimate | std.error | conf.low | conf.high | p.value |
---|---|---|---|---|---|
(Intercept) | 7.413 | 0.170 | 7.133 | 7.693 | 0.00000 |
a1c | 0.003 | 0.028 | -0.044 | 0.049 | 0.92240 |
sexFemale | 0.243 | 0.056 | 0.151 | 0.336 | 0.00002 |
The 90% confidence interval for my key predictor (a1c
) includes 0, so it is unlikely that a1c
improves the effectiveness of this model’s prediction of tst
. The confidence interval sex
does not include 0. The point estimate for sex
implies that female subjects have 0.243 hours more sleep time than males, after adjusting for a1c
, and the 90% confidence interval for this estimate is (0.151,0.336).
extract_eq(mod_small, use_coefs = TRUE, coef_digits = 3,
terms_per_line = 1, wrap = TRUE)
\[ \begin{aligned} \operatorname{\widehat{tst}} &= 7.413\ + \\ &\quad 0.003(\operatorname{a1c})\ + \\ &\quad 0.243(\operatorname{sex}_{\operatorname{Female}}) \end{aligned} \]
For every 1 percentage point increase in a1c
, I expect a increase in TST of 0.003 hours (90% CI: -0.044,0.049), if all other predictors remained constant. Being female results in an increase in TST of 0.243 hours (90% CI: 0.151,0.336), compared to being male, if all other predictors remained constant.
9 In-Sample Comparison
9.1 Quality of Fit
<- glance(mod_big) %>% mutate(mod_n = "Big_train")
rep_big
<- glance(mod_small) %>% mutate(mod_n = "Small_train")
rep_small
<- bind_rows(rep_big, rep_small)
fit_report
%>%
fit_report select(mod_n, nobs, r.squared, adj.r.squared, sigma, AIC, BIC) %>%
kable(digits = c(0,0,4,3,3,0,0))
mod_n | nobs | r.squared | adj.r.squared | sigma | AIC | BIC |
---|---|---|---|---|---|---|
Big_train | 3423 | 0.0128 | 0.010 | 1.637 | 13097 | 13158 |
Small_train | 3423 | 0.0055 | 0.005 | 1.641 | 13111 | 13135 |
According to R^2, adjusted R^2, sigma, and AIC, the Big Model performed better. According to BIC, the smaller model performed better.
9.2 Assessing Assumptions
For my first step in developing residual plots, I will use augment
to calculate residuals.
<- augment(mod_big, data = study2_train)
aug_big <- augment(mod_small, data = study2_train) aug_small
The first plots I will look at are the residual values vs. fitted to assess linearity and variance, and normal Q-Q plots to assess normality.
<- ggplot(aug_big, aes(x = .fitted, y = .resid)) +
p1_big geom_point(alpha = 0.1) +
geom_point(data = aug_big %>%
slice_max(abs(.resid), n = 5),
col = "red", size = 2) +
geom_text_repel(data = aug_big %>%
slice_max(abs(.resid), n = 5),
aes(label = sub_id), col = "red") +
geom_abline(intercept = 0, slope = 0, lty = "dashed") +
geom_smooth(method = "loess", formula = y ~ x, span = 2, se = F) +
labs(title = "Big Model: Residual vs. Fitted",
x = "Fitted total sleep time (hrs)", y = "Residual")
<- ggplot(aug_small, aes(x = .fitted, y = .resid)) +
p1_small geom_point(alpha = 0.1) +
geom_point(data = aug_small %>%
slice_max(abs(.resid), n = 5),
col = "red", size = 2) +
geom_text_repel(data = aug_small %>%
slice_max(abs(.resid), n = 5),
aes(label = sub_id), col = "red") +
geom_abline(intercept = 0, slope = 0, lty = "dashed") +
geom_smooth(method = "loess", formula = y ~ x, span = 2, se = F) +
labs(title = "Small Model: Residual vs. Fitted",
x = "Fitted total sleep time (hrs)", y = "Residual")
<- ggplot(aug_big, aes(sample = .std.resid)) +
p2_big geom_qq() +
geom_qq_line(col = "red") +
labs(title = "Big Model: Normal Q-Q",
y = "Standardized Residuals",
x = "Standard Normal Quantiles")
<- ggplot(aug_small, aes(sample = .std.resid)) +
p2_small geom_qq() +
geom_qq_line(col = "red") +
labs(title = "Small Model: Normal Q-Q",
y = "Standardized Residuals",
x = "Standard Normal Quantiles")
+ p2_big) / (p1_small + p2_small) +
(p1_big plot_annotation(title = "Residual Diagnostics: Set 1",
subtitle = "3423 subjects from `study2_train`")
I can assume linearity for the big model. There is an area around a fitted value of 7.2 hours that is a bit sparse, but this is unlikely to have a large impact. Neither the line nor points have any distinct pattern. The residual vs. fitted plot for the small model is not of much use to me. It likely looks this way because the lines of best fit on the plots for each sex were nearly horizontal, so the fitted value is very similar within each group. I have higher residuals in the small model. As far as normality, both models show a symmetrical distribution with a wide spread and a few outliers, with standardized residuals up to 4. I cannot assume normality for these models.
The next plots I will look at are scale-location plots for variance and residuals vs. leverage plots for poorly fit points.
<- ggplot(aug_big, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
p3_big geom_point(alpha = 0.2) +
geom_smooth(method = "loess", formula = y ~ x, span = 2, se = F) +
labs(title = "Big Model: Scale-Location",
x = "Fitted TST",
y = "Sqrt|Std. Residual|")
<- ggplot(aug_small, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
p3_small geom_point(alpha = 0.2) +
geom_smooth(method = "loess", formula = y ~ x, span = 2, se = F) +
labs(title = "Small Model: Scale-Location",
x = "Fitted TST",
y = "Sqrt|Std. Residual|")
<- ggplot(aug_big, aes(x = .hat, y = .std.resid)) +
p4_big geom_point(alpha = 0.2) +
geom_point(data = aug_big %>% filter(.cooksd >= 0.5),
col = "red", size = 2) +
geom_text_repel(data = aug_big %>% filter(.cooksd >= 0.5),
aes(label = sub_id), col = "red") +
geom_smooth(method = "loess", formula = y ~ x, span = 2, se = F) +
geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
labs(title = "Big Model: Residuals vs. Leverage",
caption = "Red points indicate Cook's d at least 0.5",
x = "Leverage", y = "Standardized Residual")
<- ggplot(aug_small, aes(x = .hat, y = .std.resid)) +
p4_small geom_point(alpha = 0.2) +
geom_point(data = aug_small %>% filter(.cooksd >= 0.5),
col = "red", size = 2) +
geom_text_repel(data = aug_small %>% filter(.cooksd >= 0.5),
aes(label = sub_id), col = "red") +
geom_smooth(method = "loess", formula = y ~ x, span = 2, se = F) +
geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
labs(title = "Small Model: Residuals vs. Leverage",
caption = "Red points indicate Cook's d at least 0.5",
x = "Leverage", y = "Standardized Residual")
+ p4_big) / (p3_small + p4_small) +
(p3_big plot_annotation(
title = "Residual Diagnostics: Set 2")
There are no substantial issues with variance nor any highly influential points in either model.
Lastly I will check for collinearity in my models.
::vif(mod_big) car
GVIF Df GVIF^(1/(2*Df))
a1c 1.142530 1 1.068892
age 1.164237 1 1.078998
sex 1.016107 1 1.008021
snore 1.064328 1 1.031663
sleepy 1.028627 4 1.003534
There are no elevated variance inflation factors, so I am not concerned about collinearity. I will not run this for my small model because it contains no variables that are not in the big model.
9.3 Comparing the Models
The evidence I have so far leads me to conclude that the big model is better. It performs better on quality of fit measures, specially R^2, adjusted R^2, sigma, and AIC. The only reason BIC was better for the small model is that BIC highly takes the number of variables into account. The only regression assumption of concern is normality, and the small model has the same issue.
10 Model Validation
10.1 Calculating Prediction Errors
First I will augment the big model, create a variable mod_n
for the model name, and rename the fitted and residual variables more descriptively as tst_fit
and tst_res
. I will do this using the test data.
<- augment(mod_big, newdata = study2_test) %>%
test_big mutate(mod_n = "Big_test") %>%
rename(tst_fit = .fitted,
tst_res = .resid) %>%
select(mod_n, sub_id, tst, tst_fit, tst_res, everything())
%>% head(3) test_big
I have my fitted and residual variables here with the proper names.
I will do the same thing with the small model, also on the test data.
<- augment(mod_small, newdata = study2_test) %>%
test_small mutate(mod_n = "Small_test") %>%
rename(tst_fit = .fitted,
tst_res = .resid) %>%
select(mod_n, sub_id, tst, tst_fit, tst_res, everything())
%>% head(3) test_small
I have my fitted and residual variables here with the proper names.
I will combine the big and small model’s residual and fitted values into a tibble so that I have the fitted and residual tst
for each subject, for each model.
<- union(test_big, test_small) %>%
test_combo arrange(sub_id, mod_n)
%>% head() test_combo
10.2 Visualizing the Predictions
ggplot(test_combo, aes(x = tst_fit, y = tst_res)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", col = "blue", se = FALSE, formula = y ~ x) +
facet_wrap( ~ mod_n) +
labs(x = "Predicted TST (hrs)",
y = "Observed TST (hrs)",
title = "Observed vs. Predicted Total Sleep Time",
subtitle = "Comparing Big to Small Model in Test Sample",
caption = "Total sleep time abbreviated as TST")
The differences in each model’s errors are about the same. The loess smooth line’s slope is approximately 0 and there are equal numbers of points above and below the line with approximately the same variance.
10.3 Summarizing the Errors
%>%
test_combo group_by(mod_n) %>%
summarise(n = n(),
MAPE = mean(abs(tst_res)),
RMSPE = sqrt(mean(tst_res^2)),
max_error = max(abs(tst_res))) %>%
kable(digits = 3)
mod_n | n | MAPE | RMSPE | max_error |
---|---|---|---|---|
Big_test | 856 | 1.198 | 1.569 | 6.305 |
Small_test | 856 | 1.200 | 1.570 | 6.329 |
The big model is better in regard to predictive error. It has the lower MAPE, RMSPE, and maximum error. These models have an average error of about 1.2 hours in predicting total sleep time. Given the range of tst
values is 2-14 hours, this is a large mean error.
I would like to identify the subjects for whom the MAPE was made by each model. The observed and fitted values of tst
will also be valuable to look at.
<- test_big %>%
temp_big filter(abs(tst_res) == max(abs(tst_res)))
<- test_small %>%
temp_small filter(abs(tst_res) == max(abs(tst_res)))
bind_rows(temp_big, temp_small)
The big model had the largest predictive error on subject 95233, where it predicted 7.7 hours of sleep time but the subject reported 14 hours. This was a 21 year old female who did not snore, reported being overly sleepy sometimes, and had a glycohemoglobin of 5.7%. The small model had the largest predictive error on subject 93813, where it also predicted 7.7 hours of sleep time but the subject reported 14 hours. These are the same values (when rounded to the nearest tenth) as in the large model. This was a 42 year old female who snored, was never overly sleepy, and had a glycohemoglobin of 5.3%.
Lastly, I will validate the R^2 values for both models.
%$% cor(tst, tst_fit)^2 test_big
[1] 0.01299449
%$% cor(tst, tst_fit)^2 test_small
[1] 0.01416621
These are very similar. The big model accounted for 1.30% of variation in the outcome, which is very slightly better than the calculated R^2 earlier (1.28%), and the small model accounted for 1.42%, which is better than the calculated R^2 earlier (0.5%). Neither of these models are very impressive, but these R^2 values are at least the same or better than we initially thought.
10.4 Comparing the Models
The smaller model had larger overall prediction error statistics, but its validated R^2 is larger. Despite the R^2, I would chose the big model because I would prefer less error overall, and the R^2 value of the small model is only 0.19 percentage points better than the big model’s.
11 Discussion
11.1 Chosen Model
I chose the big model. The big model showed better prediction error statistics, e.g., lower MAPE, RMSPE, and maximum error. The big model also follows the linearity assumption better. It is harder for me to assess linearity in the small model because changes in glycohemoglobin levels do not cause much change in predicted total sleep time. This leads to the model predicting approximately the mean total sleep time for every subject. The big model’s validated R^2 is lower than the smaller model’s, but both are very low to begin with and the difference is minimal.
11.2 Answering My Question
My research question was: How well can an adult (age 20-79 years) subject’s long term blood glucose control i.e., glycohemoglobin percentage, predict their total sleep time on workdays/weekdays, after adjusting for age, sex, presence of snoring, and frequency of sleepiness?
My answer is: not well. Glycohemoglobin percentage, after adjustment for all these other predictors, only accounted for 1.3% of the variation in total sleep time on workdays/weekdays in my big model. My pre-analysis expectation was that people with higher blood glucose would have less total sleep time. Sleep deprivation interferes with insulin, insulin-like growth hormone, leptin (the satiety hormone), and ghrelin (the hunger hormone), which leads to variations in blood glucose, but not necessarily long term blood glucose, which is measured by glycohemoglobin. A limitation is that total sleep time is an estimated and subjective value. People will often over- or underestimate for various reasons that can differ day to day, and even depend on what time of day they are asked.
11.3 Next Steps
An actigraphy monitor is a watch-like device that a patient will often wear for about 2 weeks that monitors their activity levels. It is used to estimate total sleep time, among other uses. This is a much more objective measurement than a reported, estimated total sleep time on a questionnaire. Perhaps a study that used actigraphy to more objectively measure subjects’ total sleep time might see a stronger correlation with glycohemoglobin. I also wonder about how acute changes in total sleep time affect blood glucose levels. A study that measured a difference in sleep time over the past, say, 3 nights, rather than an estimated total sleep time at baseline, might be interesting as well.
11.4 Reflection
I would have chosen more quantitative variables, had I known at the start what I know now. I feel that a different mixture of quantitative variables might have strengthened my smaller model better than categorical ones might have.
12 Session Information
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.5.1 stringr_1.4.0 purrr_0.3.4 readr_2.1.1
[5] tidyr_1.1.4 tibble_3.1.6 tidyverse_1.3.1 GGally_2.1.2
[9] knitr_1.36 mosaic_1.8.3 ggridges_0.5.3 mosaicData_0.20.2
[13] ggformula_0.10.1 ggstance_0.3.5 dplyr_1.0.7 Matrix_1.3-4
[17] lattice_0.20-45 haven_2.4.3 nhanesA_0.6.5.3 broom_0.7.10
[21] car_3.0-12 carData_3.0-4 patchwork_1.1.1 equatiomatic_0.3.0
[25] ggrepel_0.9.1 ggplot2_3.3.5 naniar_0.6.1 magrittr_2.0.1
[29] janitor_2.1.0
loaded via a namespace (and not attached):
[1] colorspace_2.0-2 ellipsis_0.3.2 visdat_0.5.3
[4] leaflet_2.0.4.1 snakecase_0.11.0 htmlTable_2.3.0
[7] fs_1.5.1 base64enc_0.1-3 ggdendro_0.1.22
[10] rstudioapi_0.13 farver_2.1.0 fansi_0.5.0
[13] lubridate_1.8.0 xml2_1.3.3 splines_4.1.2
[16] polyclip_1.10-0 Formula_1.2-4 jsonlite_1.7.2
[19] cluster_2.1.2 dbplyr_2.1.1 png_0.1-7
[22] ggforce_0.3.3 shiny_1.7.1 compiler_4.1.2
[25] httr_1.4.2 backports_1.3.0 assertthat_0.2.1
[28] fastmap_1.1.0 cli_3.1.0 later_1.3.0
[31] tweenr_1.0.2 htmltools_0.5.2 tools_4.1.2
[34] gtable_0.3.0 glue_1.5.1 Rcpp_1.0.7
[37] cellranger_1.1.0 jquerylib_0.1.4 vctrs_0.3.8
[40] nlme_3.1-153 crosstalk_1.2.0 xfun_0.28.10
[43] rvest_1.0.2 mime_0.12 lifecycle_1.0.1
[46] mosaicCore_0.9.0 MASS_7.3-54 scales_1.1.1
[49] hms_1.1.1 promises_1.2.0.1 RColorBrewer_1.1-2
[52] yaml_2.2.1 gridExtra_2.3 sass_0.4.0
[55] labelled_2.9.0 rpart_4.1-15 reshape_0.8.8
[58] latticeExtra_0.6-29 stringi_1.7.6 highr_0.9
[61] checkmate_2.0.0 rlang_0.4.11 pkgconfig_2.0.3
[64] evaluate_0.14 labeling_0.4.2 htmlwidgets_1.5.4
[67] tidyselect_1.1.1 plyr_1.8.6 bookdown_0.24
[70] R6_2.5.1 generics_0.1.1 Hmisc_4.6-0
[73] DBI_1.1.1 mgcv_1.8-38 pillar_1.6.4
[76] foreign_0.8-81 withr_2.4.3 survival_3.2-13
[79] abind_1.4-5 nnet_7.3-16 modelr_0.1.8
[82] crayon_1.4.2 utf8_1.2.2 tzdb_0.2.0
[85] rmarkdown_2.11.3 jpeg_0.1-9 grid_4.1.2
[88] readxl_1.3.1 data.table_1.14.2 rmdformats_1.0.3
[91] reprex_2.0.1 digest_0.6.29 xtable_1.8-4
[94] httpuv_1.6.3 munsell_0.5.0 viridisLite_0.4.0
[97] bslib_0.3.1