Impact of Glycohemoglobin on Total Sleep Time in US Adults

1 Setup and Data Ingest

1.1 Package Loading

I will load in my necessary packages.

library(janitor)
library(magrittr)
library(naniar)
library(ggrepel)
library(equatiomatic)
library(patchwork)
library(car)
library(broom)
library(nhanesA)
library(haven)
library(mosaic)
library(knitr)
library(GGally)
library(tidyverse)

opts_chunk$set(comment=NA)
opts_knit$set(width=75)

theme_set(theme_bw())
options(dplyr.summarise.inform = FALSE)

1.2 Data Ingest

I will ingest each of the 3 nhanesA data sets I will be using into a raw data set tibble, remove labels from the variables, and save each data set as a new file on my computer for subsequent sessions.

# demo_raw <- nhanes('DEMO_J') %>% zap_label() %>% tibble()

# saveRDS(demo_raw, "data/DEMO_J.Rds")

demo_raw <- readRDS("data/DEMO_J.Rds")

# sleep_raw <- nhanes('SLQ_J') %>% zap_label() %>% tibble()

# saveRDS(sleep_raw, "data/SLQ_J.Rds")

sleep_raw <- readRDS("data/SLQ_J.Rds")

# a1c_raw <- nhanes('GHB_J') %>% zap_label() %>% tibble()

# saveRDS(a1c_raw, "data/GHB_J.Rds")

a1c_raw <- readRDS("data/GHB_J.Rds")

2 Cleaning the Data

2.1 Selecting the Variables

I will select the variables of interest from each raw data set. I need SEQN from each in order to combine the data sets next. I need SLD012, SLQ030, and SLQ120 from the sleep_raw data, LBXGH from the a1c_raw data, and RIAGENDR, RIDAGEYR, and RIDSTATR from the demo_raw data set.

demo_data <- demo_raw %>%
  select(SEQN, RIAGENDR, RIDAGEYR, RIDSTATR) 

sleep_data <- sleep_raw %>%
  select(SEQN, SLD012, SLQ030, SLQ120)

a1c_data <- a1c_raw %>%
  select(SEQN, LBXGH)

2.2 Merging the Data

I will merge the 3 raw data sets first by joining sleep_data to demo_data, then by adding in a1c_data. I will join them by the unique subject identifier SEQN to create 1 row per subject.

temp <- left_join(demo_data, sleep_data, by = "SEQN")

raw_data <- left_join(temp, a1c_data, by = "SEQN")

dim(raw_data)

[1] 9254    8

I have 9254 rows (subjects) and 8 columns (variables) as expected.

I will check that the number of rows equals the number of unique subjects to ensure I avoided any errors in merging.

identical(raw_data %$% n_distinct(SEQN), 
          raw_data %>% nrow())

[1] TRUE

They do.

I will remove data sets I will no longer be using to keep my work space clear.

rm(temp, demo_raw, a1c_raw, sleep_raw, demo_data, a1c_data, sleep_data)

2.3 Names and Types

I will begin to clean my data by make each variable the type I want. My categorical variables (RIDSTATR, RIAGENDR, SLQ030, SLQ120) will be factors. SEQN will be a character variable, as its numbers have no numerical quality. LBXGH, SLD012, and RIDAGEYR are my quantitative variables, so they will be numerical or ‘dbl’. Lastly, I will remove any levels of a factor variable that have 0 subjects.

dim(raw_data)

[1] 9254    8

clean_data <- raw_data %>%
  mutate(SEQN = as.character(SEQN),
         LBXGH = as.numeric(LBXGH),
         SLD012 = as.double(SLD012),
         RIDAGEYR = as.double(RIDAGEYR),
         RIDSTATR = factor(RIDSTATR),
         RIAGENDR = factor(RIAGENDR),
         SLQ030 = factor(SLQ030),
         SLQ120 = factor(SLQ120)) %>%
  droplevels()

dim(clean_data)

[1] 9254    8

I did not lose any columns or rows as I did this.

glimpse(clean_data)

Rows: 9,254
Columns: 8
$ SEQN     <chr> "93703", "93704", "93705", "93706", "93707", "93708", "93709"~
$ RIAGENDR <fct> 2, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1~
$ RIDAGEYR <dbl> 2, 2, 66, 18, 13, 66, 75, 0, 56, 18, 67, 54, 71, 61, 22, 45, ~
$ RIDSTATR <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
$ SLD012   <dbl> NA, NA, 8.0, 10.5, NA, 8.0, 7.0, NA, 7.0, 7.5, 5.5, 7.0, 5.0,~
$ SLQ030   <fct> NA, NA, 2, 1, NA, 9, 1, NA, 2, 1, 0, 3, 3, 3, 2, 0, NA, NA, 0~
$ SLQ120   <fct> NA, NA, 0, 1, NA, 2, 1, NA, 3, 2, 2, 3, 4, 3, 1, 2, NA, NA, 0~
$ LBXGH    <dbl> NA, NA, 6.2, 5.2, 5.6, 6.2, 6.3, NA, 5.7, 5.4, 5.6, 12.7, 6.2~

My variable types are as expected and described above.

2.4 Variables

2.4.1 Total Sleep Time: Outcome

Total sleep time in hours is a quantitative variable and will be my outcome. I will start with a numerical summary.

df_stats(~ SLD012, data = clean_data)

My range of values is 2-14 hours of sleep per night on weekdays and weekends. These are plausible values. The mean and median being close to each and around 7.5-8 hours is reassuring. There are 3141 subjects with missing data. I will rename this to a more descriptive name and filter to complete cases.

clean_data <- clean_data %>%
  rename(tst = SLD012) %>%
  filter(complete.cases(tst))

dim(clean_data) # should be 9254 - 3141 = 6113

[1] 6113    8

I am left with 6113 rows (subjects) and 8 columns (variables) as expected.

2.4.2 Glycohemoglobin: Key Predictor

Glycohemoglobin percentage is a quantitative variable and will be my key predictor. I will start with a numerical summary.

df_stats( ~ LBXGH, data = clean_data)

There are 615 subjects with missing values. The range of values is 3.8-16.2%. Levels below 4% and above 12% are nearly implausible and rarely seen in clinical practice. Furthermore, the clinical difference between 3.8% and 4% as well as between 12% and 16.2% are extremely minimal and would be unlikely to affect the clinical significance of the findings of the following analyses. For example, a glycohemoglobin level of 12% already indicates very poor long term blood glucose control, and an increase to 16.2% does not make a meaningful difference at that point. On the other hand, these outliers will meaningfully affect the quality of my model’s predictions. For these reasons, I will create inclusion criteria of glycohemoglobin level 4-12% by telling R to consider values outside this range as missing. I will rename this to a more descriptive name.

clean_data <- clean_data %>%
  rename(a1c = LBXGH) %>%
  mutate(a1c = replace(a1c, a1c < 4 | a1c > 12, NA))

df_stats(~ a1c, data = clean_data)

The range of glycohemoglobin levels is now, more appropriately, 4.1-12%. There were no values of 4.0% in the raw data set. There are now 637 missing values. I will filter to complete cases.

clean_data <- clean_data %>%
  filter(complete.cases(a1c))

dim(clean_data) # should be 6113 - 637 = 5476

[1] 5476    8

I am left with 5476 rows (subjects) and 8 columns (variables) as expected.

2.4.3 Age: Secondary Predictor

Age in years is a quantitative variable and will be a secondary predictor. I will start with a numerical summary.

df_stats(~ RIDAGEYR, data = clean_data)

There are no missing data. The age range of my subjects is 16-80 years old. I wish to study adults age 20-79 years in my analyses, so I will create these inclusion criteria now, as well as rename this variable to a more descriptive name. My upper limit is 79 years because subjects demarcated as 80 years old are, in actuality, 80 years old or greater. This ceiling effect will alter the accuracy of my model’s predictions, so I will remove them.

clean_data <- clean_data %>%
  rename(age = RIDAGEYR) %>%
  mutate(age = replace(age, age < 20 | age > 79, NA))

df_stats(~ age, data = clean_data)

The age range is now 20-79 years. I have 879 subjects with missing data. I will filter to complete cases.

clean_data <- clean_data %>%
  filter(complete.cases(age))

dim(clean_data) # should be 5476 - 879 = 4597

[1] 4597    8

I am left with 4597 rows (subjects) and 8 columns (variables) as expected.

2.4.4 Sex: Secondary Predictor

Sex is a binary categorical variable and will be a secondary predictor. I will start with a numerical summary.

clean_data %>% tabyl(RIAGENDR) %>% kable(digits = 3)

RIAGENDR	n	percent
1	2206	0.48
2	2391	0.52

My data set comprises 2206 males (48%) and 2391 females (52%). I will rename this variable to a more descriptive name and recode it so that I can see ‘Male’ and ‘Female’ instead of 1 and 2.

# gender recode

clean_data <- clean_data %>%
  rename(sex = RIAGENDR) %>%
  mutate(sex = fct_recode(sex,
                          "Male" = "1",
                          "Female" = "2"))

clean_data %>% tabyl(sex) %>% kable(digits = 3)

sex	n	percent
Male	2206	0.48
Female	2391	0.52

dim(clean_data) # should be 4597

[1] 4597    8

I am left with 4597 rows (subjects) and 8 columns (variables) as expected.

2.4.5 Snore: Secondary Predictor

Snore is a binary categorical variable and will be a secondary predictor. I will start with a numerical summary.

clean_data %>% tabyl(SLQ030) %>% kable(digits = 3)

SLQ030	n	percent
0	1058	0.230
1	1057	0.230
2	866	0.188
3	1302	0.283
7	5	0.001
9	309	0.067

My data set comprises 1058 subjects (23.0%) who never snore, 1057 subjects (23.0%) who rarely snore, 866 subjects (18.8%) who occasionally snore, 1302 subjects (28.3%) who frequently snore, 5 subjects who refused to answer, and 9 subjects who did not know. I do not have any missing data. I will rename this variable to a more descriptive name, tell R to consider “refused” and “don’t know” as missing values, and recode it to a binary variable so those who never or rarely snore will be in ‘Does not snore’ and those who occasionally or frequently snore will be in ‘Snores’.

clean_data <- clean_data %>%
  rename(snore = SLQ030) %>%
  mutate(snore = na_if(snore, 7),
         snore = na_if(snore, 9)) %>%
  mutate(snore = fct_recode(snore,
                          "Does not snore" = "0",
                          "Does not snore" = "1",
                          "Snores" = "2",
                          "Snores" = "3"))

clean_data %>% tabyl(snore) %>% kable()

snore	n	percent	valid_percent
Does not snore	2115	0.4600827	0.4938127
Snores	2168	0.4716119	0.5061873
7	0	0.0000000	0.0000000
9	0	0.0000000	0.0000000
NA	314	0.0683054	NA

This looks good. I have created a binary variable and moved all the ‘Refused’ or ‘Don’t know’ responses to missing values. I will drop those missing values and those rows that do not contain any subjects.

clean_data <- clean_data %>%
  filter(complete.cases(snore)) %>%
  droplevels()

clean_data %>% tabyl(snore) %>% kable(digits = 3)

snore	n	percent
Does not snore	2115	0.494
Snores	2168	0.506

dim(clean_data) # should be 4597 - 314 = 4283

[1] 4283    8

I am left with 4283 rows (subjects) and 8 columns (variables) as expected.

2.4.6 Sleepy: Secondary Predictor

Sleepy is a 5-level multi-categorical variable and will be a secondary predictor. I will start with a numerical summary.

clean_data %>% tabyl(SLQ120) %>% kable(digits = 3)

SLQ120	n	percent
0	704	0.164
1	1035	0.242
2	1464	0.342
3	730	0.170
4	346	0.081
9	4	0.001

My data set comprises 704 subjects (16.4%) who are never sleepy, 1035 subjects (24.2%) who are rarely sleepy, 1464 subjects (34.2%) who are sometimes sleepy, 730 subjects (17.0%) who are often sleepy, 346 (8.1%) who are always sleepy, and 4 subjects who did not know their frequency of sleepiness. I do not have any missing data. I will rename this variable to a more descriptive name, tell R to consider “Don’t know” as a missing value, and rename the levels more descriptively.

# recode to multi-categorical variable

clean_data <- clean_data %>%
  rename(sleepy = SLQ120) %>%
  mutate(sleepy = na_if(sleepy, 7),
         sleepy = na_if(sleepy, 9)) %>%
  mutate(sleepy = fct_recode(sleepy,
                          "never" = "0",
                          "rarely" = "1",
                          "sometimes" = "2",
                          "often" = "3",
                          "always" = "4"))

clean_data %>% tabyl(sleepy) %>% kable()

sleepy	n	percent	valid_percent
never	704	0.1643708	0.1645244
rarely	1035	0.2416530	0.2418789
sometimes	1464	0.3418165	0.3421360
often	730	0.1704413	0.1706006
always	346	0.0807845	0.0808600
9	0	0.0000000	0.0000000
NA	4	0.0009339	NA

I have moved all the ‘Don’t know’ responses to missing values and these level names look good. I will drop those missing values and the row that does not contain any subjects.

clean_data <- clean_data %>%
  filter(complete.cases(sleepy)) %>%
  droplevels()

dim(clean_data) # should be 4283 - 4 = 4279

[1] 4279    8

I am left with 4279 rows (subjects) and 8 columns (variables) as expected.

2.4.7 RIDSTATR: Indicator Variable

RIDSTATR is a variable which designates whether a subject partook in both the questionnaire and the examination, for which they are given a value of 2. I will start by seeing if all the subjects in my data set have a value of 2.

clean_data %>% tabyl(RIDSTATR) %>% kable()

RIDSTATR	n	percent
2	4279	1

They do. I will drop the level 1 from my data set as it contains no subjects.

clean_data <- clean_data %>%
  droplevels()

clean_data %>% tabyl(RIDSTATR) %>% kable()

RIDSTATR	n	percent
2	4279	1

dim(clean_data) # should be 4279

[1] 4279    8

I am left with 4279 rows (subjects) and 8 columns (variables) as expected.

2.5 Missingness Check

I will check that I was successful in removing the missing data as I cleaned each variable.

miss_var_summary(clean_data)

I was.

2.6 Creating My Tibble

I will create a tibble called study2 from the clean data, selecting only those variables I will use in my analyses. I will rename the unique subject identifier to a more descriptive name. Lastly, I will save my tibble in my data folder.

study2 <- clean_data %>%
  select(SEQN, a1c, tst, age, sex, snore, sleepy) %>%
  rename(sub_id = SEQN)

rm(raw_data, clean_data)

saveRDS(study2, "data/study2.Rds")

3 Codebook and Data Description

These data are from the 2017-2018 National Health and Nutrition Examination Survey (NHANES) which tracks the health of the population of the United States by collecting questionnaire and examination data. NHANES is run by the Center for Disease Control and Prevention (CDC). Their target population is the non-institutionalized civilian resident population of the United States. They create their sample population through random selection using statistical processes on US Census information. For the 2017-2018 sampling, 16,211 people were selected. Of these, 9,254 completed the interview and 8,704 completed the exams. The following analyses include subjects between 20-79 years old who have complete data on the variables of interest. My goal is to build a prediction model for total sleep time in hours with the key predictor being glycohemoglobin percentage, adjusted for age, sex, presence of snoring, and level of sleepiness.

3.1 Codebook

Variable	Type	Description	Raw Name
`sub_id`	Chr	A unique subject identifier within each NHANES data set	`SEQN`
`tst`	Quant	Outcome This is in the Questionnaire category and describes estimated total sleep time on weekdays or workdays, in hours, ranging from 2 to 14 hours.	`SLD012`
`a1c`	Quant	Key Predictor This is in the Laboratory category and describes the percentage of glycosylated hemoglobin in the blood, a measure of long-term blood glucose control. A level of 6.5% or higher is diagnostic for diabetes, with higher levels indicating poorer disease status. Inclusion criteria are values 4.1% to 12.0%.	`LBXGH`
`age`	Quant	This is in the Demographics category and describes age in years. Inclusion criteria ages 20 to 79 years.	`RIDAGEYR`
`sex`	Binary	This is in the Demographics category and describes sex as male or female. These data comprise 48.2% males and 51.8% females.	`RIAGENDR`
`snore`	Binary	This is in the Questionnaire category and describes subjective snoring frequency in the NHANES data. In converting this to binary for these analyses, the meaning has become presence of snoring, thus the categories are ‘Snores’ and ‘Does not snore’. ‘Snores’ includes the questionnaire answers “Occasionally - 3-4 nights a week” or “Frequently - 5 or more nights a week”. ‘Does not snore’ includes the questionnaire answers “Never” or “Rarely - 1-2 nights a week”. In these data, 50.6% admit to snoring, and 49.4% report never or rarely snoring.	`SLQ030`
`sleepy`	5-Cat	This is in the Questionnaire category and describes subjective frequency of feeling overly sleepy. The categories are ‘never’, ‘rarely’, ‘sometimes’, ‘often’, and ‘always’. ‘Never’ corresponds to a questionnaire response of ‘never’, ‘rarely’ to ‘rarely - 1 time a month, ’sometimes’ to ‘sometimes - 2-4 times a month’, ‘often’ to ‘often - 5-15 times a month’, and ‘always’ to ‘always - 16-30 times a month’. In these data, 16.5% are never sleepy, 24.2% are rarely sleepy, 34.2% are sometimes sleepy, 17.1% are often sleepy, and 8.1% are always sleepy.	`SLQ120`

3.2 Analytic Tibble

Here is my tibble.

study2

As anticipated, I have 4279 rows (subjects) and 7 variables in my final data set.

I will demonstrate that study2 is, in fact, a tibble.

is_tibble(study2)

[1] TRUE

3.3 Data Summary

I would like to see a summary of my data now that it has been cleaned up.

study2 %>% 
  select(-sub_id) %>% 
  Hmisc::describe(.)

. 

 6  Variables      4279  Observations
--------------------------------------------------------------------------------
a1c 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    4279        0       78    0.996     5.81   0.9007      4.9      5.0 
     .25      .50      .75      .90      .95 
     5.3      5.6      6.0      6.8      7.8 

lowest :  4.1  4.2  4.3  4.4  4.5, highest: 11.5 11.6 11.7 11.8 12.0
--------------------------------------------------------------------------------
tst 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    4279        0       23    0.985    7.549     1.78      5.0      5.5 
     .25      .50      .75      .90      .95 
     6.5      7.5      8.5      9.5     10.0 

lowest :  2.0  3.0  3.5  4.0  4.5, highest: 11.5 12.0 12.5 13.0 14.0
--------------------------------------------------------------------------------
age 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    4279        0       60        1    48.87    18.82       23       26 
     .25      .50      .75      .90      .95 
      35       50       62       70       74 

lowest : 20 21 22 23 24, highest: 75 76 77 78 79
--------------------------------------------------------------------------------
sex 
       n  missing distinct 
    4279        0        2 
                        
Value        Male Female
Frequency    2063   2216
Proportion  0.482  0.518
--------------------------------------------------------------------------------
snore 
       n  missing distinct 
    4279        0        2 
                                        
Value      Does not snore         Snores
Frequency            2113           2166
Proportion          0.494          0.506
--------------------------------------------------------------------------------
sleepy 
       n  missing distinct 
    4279        0        5 

lowest : never     rarely    sometimes often     always   
highest: never     rarely    sometimes often     always   
                                                            
Value          never    rarely sometimes     often    always
Frequency        704      1035      1464       730       346
Proportion     0.165     0.242     0.342     0.171     0.081
--------------------------------------------------------------------------------

4 My Research Question

How well can an adult’s (age 20-79 years) long term blood glucose control i.e., glycohemoglobin percentage, predict their total sleep time on workdays/weekdays, after adjusting for age, sex, presence of snoring, and frequency of sleepiness?

5 Partitioning the Data

I will set a seed so that I can replicate my findings later. I will create a training subset comprising 80% of subjects from study2, and a test subset that comprises 20%.

set.seed(1228)

study2_train <- study2 %>% 
  slice_sample(., prop = .80)
study2_test <- anti_join(study2, study2_train, by = "sub_id")

dim(study2)

[1] 4279    7

dim(study2_train)

[1] 3423    7

dim(study2_test)

[1] 856   7

This looks good because 3423 is 80% of 4279, and the difference of these is 856.

6 Transforming the Outcome

6.1 Visualizing the Outcome Distribution

p1 <- ggplot(study2_train, aes(sample = tst)) +
  geom_qq(col = "black", alpha = 0.5, size = 0.5) +
  geom_qq_line(col = "red") +
  theme(aspect.ratio = 1) +
  labs(title = "Normal Q-Q plot",
       x = "Theoretical Quantities",
       y = "TST (hrs)")

p2 <- ggplot(study2_train, aes(x = tst)) +
  geom_histogram(aes(y = stat(density)),
                 bins = 12, col = "slateblue4", 
                 fill = "lightskyblue") +
  stat_function(fun = dnorm, 
                args = list(mean = mean(study2_train$tst), 
                            sd = sd(study2_train$tst)),
                col = 'red', lwd = 0.8) +
  labs(title = "Density Histogram",
       x = "TST (hrs)",
       y = "Density")

p3 <- ggplot(study2_train, aes(x = "", y = tst)) +
  geom_boxplot(fill = "lightskyblue", 
               outlier.color = "red",
               outlier.size = 2) + 
  coord_flip() +
  labs(title = "Box Plot",
       x = "",
       y = "TST (hrs)")

p1 + (p2 / p3 + plot_layout(heights = c(4,1))) +
  plot_annotation(title = "Distribution Assessment of Total Sleep Time (`tst`)",
                  subtitle = "abbreviated as TST",
                  caption = "3423 subjects from `study2_train`")

The bell-shape of the histogram density curve points toward a normal distribution, however the box plot shows there are 6 outliers, 2 on the lower scale and 4 on the higher scale. The Q-Q plot further emphasizes these findings. Because the distribution is fairly symmetric, a transformation is unlikely to be helpful.

6.2 Numerical Summaries

This is a numerical summary of my outcome, tst.

favstats(~ tst, data = study2_train)

The mean and median being so close provides evidence of Normality.

This is a numerical summary of my predictors.

study2_train %>%
  select(a1c, age, sex, snore, sleepy) %>%
  Hmisc::describe(.)

. 

 5  Variables      3423  Observations
--------------------------------------------------------------------------------
a1c 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    3423        0       77    0.996    5.816   0.9042      4.9      5.0 
     .25      .50      .75      .90      .95 
     5.3      5.6      6.0      6.8      7.8 

lowest :  4.1  4.2  4.3  4.4  4.5, highest: 11.4 11.5 11.6 11.8 12.0
--------------------------------------------------------------------------------
age 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    3423        0       60        1    48.82    18.84       23       26 
     .25      .50      .75      .90      .95 
      34       50       62       70       74 

lowest : 20 21 22 23 24, highest: 75 76 77 78 79
--------------------------------------------------------------------------------
sex 
       n  missing distinct 
    3423        0        2 
                        
Value        Male Female
Frequency    1648   1775
Proportion  0.481  0.519
--------------------------------------------------------------------------------
snore 
       n  missing distinct 
    3423        0        2 
                                        
Value      Does not snore         Snores
Frequency            1695           1728
Proportion          0.495          0.505
--------------------------------------------------------------------------------
sleepy 
       n  missing distinct 
    3423        0        5 

lowest : never     rarely    sometimes often     always   
highest: never     rarely    sometimes often     always   
                                                            
Value          never    rarely sometimes     often    always
Frequency        549       849      1152       591       282
Proportion     0.160     0.248     0.337     0.173     0.082
--------------------------------------------------------------------------------

My categorical variables have the proper number of levels and subjects within each levels, and my quantitative variables have the values I expect. There are no missing data.

6.3 Scatterplot Matrix

This is a scatterplot matrix to show the interplay of all my variables of interest.

ggpairs(study2_train %>% select(a1c, age, sex, snore, sleepy, tst), 
        lower = list(combo = wrap("facethist", binwidth = 0.5)), 
        title = "3423 subjects in `study2_train`")

My outcome does not appear like it will fit a linear model well with my key predictor. The Pearson correlation coefficient is -0.001, indicative of a weak relationship. The relationship of my outcome with age appears even more random in the graphical representation, but the Pearson correlation coefficient states otherwise at 0.021 with a1c predicting age, and 0.340 with age predicting a1c. The distribution of total sleep time when stratified by both sex and snore appears to maintain normality, but this matrix does not allow me to see outliers, so I will look further into that. It is too difficult for me to interpret my outcome’s relationship with sleepy so I will look further into that as well.

6.3.1 Relationship Between Sex and TST

ggplot(study2_train, aes(x = sex, y = tst)) +
    geom_violin(aes(fill = sex)) +
    geom_boxplot(width = 0.3, outlier.size = 2) +
    stat_summary(fun = "mean", geom = "point",
               shape = 23, size = 2, fill = "magenta") +
    coord_flip() +
    guides(fill = "none") +
    labs(title = "Sex as a Predictor of Total Sleep Time",
         subtitle = "stratified by male or female sex",
         caption = "3423 subjects in `study2_train`",
         x = "Sex",
         y = "Total Sleep Time (hrs)")

As I suspected from my scatterplot matrix and numerical summaries, tst for both sexes has a bell shape, but the outliers prevent it from having a normal distribution. The outliers have less of an effect on the mean in males than in females. The spread of each is very similar.

6.3.2 Relationship Between Snoring and TST

ggplot(study2_train, aes(x = snore, y = tst)) +
    geom_violin(aes(fill = snore)) +
    geom_boxplot(width = 0.3, outlier.size = 2) +
    stat_summary(fun = "mean", geom = "point",
               shape = 23, size = 2, fill = "magenta") +
    coord_flip() +
    guides(fill = "none") +
    labs(title = "Snoring as a Predictor of Total Sleep Time",
         subtitle = "stratified by snoring presence",
         caption = "3423 subjects in `study2_train`",
         x = "Presence of Snoring",
         y = "Total Sleep Time (hrs)")

Again, as I suspected from my scatterplot matrix and numerical summaries, tst for categories has a bell shape, but the outliers prevent it from having a normal distribution. The means are very close to the medians here. The spread of each is very similar.

6.3.3 Relatiosnip Between Sleepinessand TST

mosaic::favstats(tst ~ sleepy, data = study2_train)

The range of TST values is similar for each category, however subjects in the ‘always’ sleepy category report fewer total sleep hours with a max of 12 hours and mean of 7.23 hours. Subjects in the ‘never’ sleepy category report the most total sleep hours with a mean of 7.73 hours. The sizes of each category is unbalanced, so I will view the distributions with a density plot.

ggplot(study2_train, aes(x = tst, y = sleepy, fill = sleepy)) +
  geom_density_ridges(scale = 0.9, alpha = 0.8) +
  scale_fill_viridis_d() +
  guides(fill = "none") + 
  theme_ridges() +
  labs(title = "Distribution of Total Sleep Time (TST)",
       subtitle = "stratified by sleepiness frequency",
       x = "TST (hrs)",
       y = "")

Picking joint bandwidth of 0.354

Visually, it appears the spread of values decreases from the ‘always’ sleepy to the ‘never’ sleepy group. Otherwise, each category appears to have a normal distribution aside from the likely outliers.

6.4 Collinearity Checking

Next, I will check for any collinearity of my data.

car::vif(lm(tst ~ a1c + age + sex + snore + sleepy, 
               data = study2_train))

           GVIF Df GVIF^(1/(2*Df))
a1c    1.142530  1        1.068892
age    1.164237  1        1.078998
sex    1.016107  1        1.008021
snore  1.064328  1        1.031663
sleepy 1.028627  4        1.003534

There do not appear to be any issues with collinearity here.

6.5 Transformation Assessment

I do not suspect an improvement in the distribution of my outcome with a transformation because the issues are the outliers. I will run a boxCox just to be sure.

model_temp <- lm(tst ~ a1c + age + sex + snore + sleepy, 
               data = study2_train)

boxCox(model_temp)

powerTransform(model_temp)

Estimated transformation parameter 
      Y1 
1.088339

This confirms my thinking that a transformation would not help my outcome.

7 The Big Model

I will fit a “kitchen sink” model that includes all my predictors (a1c, age, sex, snore, and sleepy), including my key predictor, to predict my outcome, tst.

mod_big <- lm(tst ~ a1c + age + sex + snore + sleepy, 
               data = study2_train)

summary(mod_big)


Call:
lm(formula = tst ~ a1c + age + sex + snore + sleepy, data = study2_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.9452 -0.9086  0.0536  0.9115  6.6603 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      7.527319   0.183580  41.003  < 2e-16 ***
a1c             -0.002432   0.030063  -0.081  0.93552    
age              0.002591   0.001848   1.402  0.16094    
sexFemale        0.248842   0.056430   4.410 1.07e-05 ***
snoreSnores     -0.085615   0.057717  -1.483  0.13807    
sleepyrarely    -0.176910   0.089640  -1.974  0.04851 *  
sleepysometimes -0.121373   0.085265  -1.423  0.15469    
sleepyoften     -0.257612   0.097612  -2.639  0.00835 ** 
sleepyalways    -0.496874   0.120762  -4.115 3.97e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.637 on 3414 degrees of freedom
Multiple R-squared:  0.01281,   Adjusted R-squared:  0.01049 
F-statistic: 5.536 on 8 and 3414 DF,  p-value: 5.588e-07

The model accounts for 1.28% of variation in the outcome, as per the R^2. The adjusted R^2 value is 0.0105, which I will use later to compare this model to other models. The residual standard error is 1.637, telling me to expect 95% of residuals be in the range of -3.274 to +3.274, and none to be outside the range of -4.911 to +4.911. The model has an F statistic of 5.536 on 8 and 3414 degrees of freedom that yields a p-value of <0.001, implying that this model can predict the outcome better than if the outcome were predicted by chance.

tidy(mod_big, conf.int = TRUE, conf.level = 0.90) %>%
  select(term, estimate, std.error, conf.low, conf.high, p.value) %>%
  kable(digits = 3)

term	estimate	std.error	conf.low	conf.high	p.value
(Intercept)	7.527	0.184	7.225	7.829	0.000
a1c	-0.002	0.030	-0.052	0.047	0.936
age	0.003	0.002	0.000	0.006	0.161
sexFemale	0.249	0.056	0.156	0.342	0.000
snoreSnores	-0.086	0.058	-0.181	0.009	0.138
sleepyrarely	-0.177	0.090	-0.324	-0.029	0.049
sleepysometimes	-0.121	0.085	-0.262	0.019	0.155
sleepyoften	-0.258	0.098	-0.418	-0.097	0.008
sleepyalways	-0.497	0.121	-0.696	-0.298	0.000

The 90% confidence interval for my key predictor (a1c) includes 0, so it is unlikely that a1c improves the effectiveness of this model’s prediction of tst. The confidence intervals for age and snore also include 0, so their inclusion is also likely not improving the predictability of the model.

The predictors for which the confidence interval do not include 0 are sex and sleepy. The point estimate for sex implies that female subjects have 0.249 hours more sleep time than males, after adjusting for a1c and age, and the 90% confidence interval for this estimate is (0.156,0.342). Compared to those who are never sleepy, those who are rarely, often, and always sleepy have a lower total sleep time. Those who are sometimes sleepy may have 0.121 fewer sleep hours, but I am not 90% confident that this difference exists in the true population.

extract_eq(mod_big, use_coefs = TRUE, coef_digits = 3, 
           wrap = TRUE, terms_per_line = 1)

\[ \begin{aligned} \operatorname{\widehat{tst}} &= 7.527\ - \\ &\quad 0.002(\operatorname{a1c})\ + \\ &\quad 0.003(\operatorname{age})\ + \\ &\quad 0.249(\operatorname{sex}_{\operatorname{Female}})\ - \\ &\quad 0.086(\operatorname{snore}_{\operatorname{Snores}})\ - \\ &\quad 0.177(\operatorname{sleepy}_{\operatorname{rarely}})\ - \\ &\quad 0.121(\operatorname{sleepy}_{\operatorname{sometimes}})\ - \\ &\quad 0.258(\operatorname{sleepy}_{\operatorname{often}})\ - \\ &\quad 0.497(\operatorname{sleepy}_{\operatorname{always}}) \end{aligned} \]

For every 1 percentage point increase in a1c, I expect a decrease in TST of 0.002 hours (90% CI: -0.052, 0.047), if all other predictors remained constant. For every 1 year older age, I expect an increase in TST of 0.003 hours (90% CI: 0.000,0.006), if all other predictors remained constant. Being female results in an increase in TST of 0.249 hours (90% CI: 0.156,0.342), compared to being male, if all other predictors remained constant. The lack of snoring estimates a 0.086 hour (90% CI: -0.009,0.181) increase in TST, compared to the presence of snoring, if all other predictors remained constant. Compared to those who are never sleepy, those who are rarely, sometimes, often, and always sleepy get 0.177 (90% CI: -0.324 -0.029), 0.121 (90% CI: -0.262,0.019), 0.258 (90% CI: -0.418,-0.097), and 0.497 (90% CI: -0.696,-0.298) fewer hours of sleep on weekdays/workdays, respectively, if all other predictors remained constant.

8 The Smaller Model

I will fit a model that uses a1c and sex to predict total sleep time. First I will visualize the relationship I am evaluating.

ggplot(study2_train, aes(x = a1c, y = tst,
                         col = sex, group = sex)) +
    geom_point() +
    geom_smooth(method = "lm", formula = y ~ x, se = FALSE) + 
    facet_wrap(~ sex) +
    guides(col = "none") +
    labs(title = "Total Sleep Time as a Function of Glycohemoglobin",
         subtitle = "stratified by sex, with OLS",
         caption = "3423 subjects from `study2_train`",
         x = "Glycohemoglobin (%)",
         y = "Total Sleep Time (hrs)")

It appears that in males, higher glycohemoglobin is associated with higher total sleep time. The opposite is true in females; higher glycohemoglobin is associated with lower total sleep time. Both of these relationships appear very weak.

I will fit the model.

mod_small <- lm(tst ~ a1c + sex, data = study2_train)

summary(mod_small)


Call:
lm(formula = tst ~ a1c + sex, data = study2_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6796 -0.9281  0.0714  0.8295  6.5722 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.41321    0.17009  43.585  < 2e-16 ***
a1c          0.00275    0.02822   0.097    0.922    
sexFemale    0.24329    0.05618   4.331 1.53e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.641 on 3420 degrees of freedom
Multiple R-squared:  0.005455,  Adjusted R-squared:  0.004873 
F-statistic: 9.379 on 2 and 3420 DF,  p-value: 8.665e-05

The model accounts for 0.55% of variation in the outcome, as per the R^2. The adjusted R^2 value is 0.0049, which I will use later to compare this model to other models. The residual standard error is 1.641, telling me to expect 95% of residuals be in the range of -3.282 to +3.282, and none to be outside the range of -4.923 to +4.923. The model has an F statistic of 9.379 on 2 and 3420 degrees of freedom that yields a p-value of <0.001, implying that this model can predict the outcome better than if the outcome were predicted by chance.

tidy(mod_small, conf.int = TRUE, conf.level = 0.90) %>% 
  select(term, estimate, std.error, conf.low, conf.high, p.value) %>% 
  kable(dig = c(0,3,3,3,3,5))

term	estimate	std.error	conf.low	conf.high	p.value
(Intercept)	7.413	0.170	7.133	7.693	0.00000
a1c	0.003	0.028	-0.044	0.049	0.92240
sexFemale	0.243	0.056	0.151	0.336	0.00002

The 90% confidence interval for my key predictor (a1c) includes 0, so it is unlikely that a1c improves the effectiveness of this model’s prediction of tst. The confidence interval sex does not include 0. The point estimate for sex implies that female subjects have 0.243 hours more sleep time than males, after adjusting for a1c, and the 90% confidence interval for this estimate is (0.151,0.336).

extract_eq(mod_small, use_coefs = TRUE, coef_digits = 3,
           terms_per_line = 1, wrap = TRUE)

\[ \begin{aligned} \operatorname{\widehat{tst}} &= 7.413\ + \\ &\quad 0.003(\operatorname{a1c})\ + \\ &\quad 0.243(\operatorname{sex}_{\operatorname{Female}}) \end{aligned} \]

For every 1 percentage point increase in a1c, I expect a increase in TST of 0.003 hours (90% CI: -0.044,0.049), if all other predictors remained constant. Being female results in an increase in TST of 0.243 hours (90% CI: 0.151,0.336), compared to being male, if all other predictors remained constant.

9 In-Sample Comparison

9.1 Quality of Fit

rep_big <- glance(mod_big) %>% mutate(mod_n = "Big_train")

rep_small <- glance(mod_small) %>% mutate(mod_n = "Small_train")

fit_report <- bind_rows(rep_big, rep_small)

fit_report %>%
  select(mod_n, nobs, r.squared, adj.r.squared, sigma, AIC, BIC) %>% 
  kable(digits = c(0,0,4,3,3,0,0))

mod_n	nobs	r.squared	adj.r.squared	sigma	AIC	BIC
Big_train	3423	0.0128	0.010	1.637	13097	13158
Small_train	3423	0.0055	0.005	1.641	13111	13135

According to R^2, adjusted R^2, sigma, and AIC, the Big Model performed better. According to BIC, the smaller model performed better.

9.2 Assessing Assumptions

For my first step in developing residual plots, I will use augment to calculate residuals.

aug_big <- augment(mod_big, data = study2_train)
aug_small <- augment(mod_small, data = study2_train)

The first plots I will look at are the residual values vs. fitted to assess linearity and variance, and normal Q-Q plots to assess normality.

p1_big <- ggplot(aug_big, aes(x = .fitted, y = .resid)) +
    geom_point(alpha = 0.1) + 
    geom_point(data = aug_big %>% 
                   slice_max(abs(.resid), n = 5),
                   col = "red", size = 2) +
    geom_text_repel(data = aug_big %>% 
                        slice_max(abs(.resid), n = 5),
                        aes(label = sub_id), col = "red") +
    geom_abline(intercept = 0, slope = 0, lty = "dashed") +
    geom_smooth(method = "loess", formula = y ~ x, span = 2, se = F) +
    labs(title = "Big Model: Residual vs. Fitted",
         x = "Fitted total sleep time (hrs)", y = "Residual") 

p1_small <- ggplot(aug_small, aes(x = .fitted, y = .resid)) +
    geom_point(alpha = 0.1) + 
    geom_point(data = aug_small %>% 
                   slice_max(abs(.resid), n = 5),
                   col = "red", size = 2) +
    geom_text_repel(data = aug_small %>% 
                        slice_max(abs(.resid), n = 5),
                        aes(label = sub_id), col = "red") +
    geom_abline(intercept = 0, slope = 0, lty = "dashed") +
    geom_smooth(method = "loess", formula = y ~ x, span = 2, se = F) +
    labs(title = "Small Model: Residual vs. Fitted",
         x = "Fitted total sleep time (hrs)", y = "Residual")

p2_big <- ggplot(aug_big, aes(sample = .std.resid)) +
    geom_qq() + 
    geom_qq_line(col = "red") +
    labs(title = "Big Model: Normal Q-Q",
         y = "Standardized Residuals", 
         x = "Standard Normal Quantiles")

p2_small <- ggplot(aug_small, aes(sample = .std.resid)) +
    geom_qq() + 
    geom_qq_line(col = "red") +
    labs(title = "Small Model: Normal Q-Q",
         y = "Standardized Residuals", 
         x = "Standard Normal Quantiles")

(p1_big + p2_big) / (p1_small + p2_small) +
    plot_annotation(title = "Residual Diagnostics: Set 1",
                    subtitle = "3423 subjects from `study2_train`")

I can assume linearity for the big model. There is an area around a fitted value of 7.2 hours that is a bit sparse, but this is unlikely to have a large impact. Neither the line nor points have any distinct pattern. The residual vs. fitted plot for the small model is not of much use to me. It likely looks this way because the lines of best fit on the plots for each sex were nearly horizontal, so the fitted value is very similar within each group. I have higher residuals in the small model. As far as normality, both models show a symmetrical distribution with a wide spread and a few outliers, with standardized residuals up to 4. I cannot assume normality for these models.

The next plots I will look at are scale-location plots for variance and residuals vs. leverage plots for poorly fit points.

p3_big <- ggplot(aug_big, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
  geom_point(alpha = 0.2) + 
  geom_smooth(method = "loess", formula = y ~ x, span = 2, se = F) +
  labs(title = "Big Model: Scale-Location",
       x = "Fitted TST", 
       y = "Sqrt|Std. Residual|")

p3_small <- ggplot(aug_small, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
  geom_point(alpha = 0.2) + 
  geom_smooth(method = "loess", formula = y ~ x, span = 2, se = F) +
  labs(title = "Small Model: Scale-Location",
       x = "Fitted TST", 
       y = "Sqrt|Std. Residual|")

p4_big <- ggplot(aug_big, aes(x = .hat, y = .std.resid)) +
  geom_point(alpha = 0.2) + 
  geom_point(data = aug_big %>% filter(.cooksd >= 0.5),
             col = "red", size = 2) +
  geom_text_repel(data = aug_big %>% filter(.cooksd >= 0.5),
               aes(label = sub_id), col = "red") +
  geom_smooth(method = "loess", formula = y ~ x, span = 2, se = F) +
  geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
  labs(title = "Big Model: Residuals vs. Leverage",
       caption = "Red points indicate Cook's d at least 0.5",
       x = "Leverage", y = "Standardized Residual")

p4_small <- ggplot(aug_small, aes(x = .hat, y = .std.resid)) +
  geom_point(alpha = 0.2) + 
  geom_point(data = aug_small %>% filter(.cooksd >= 0.5),
             col = "red", size = 2) +
  geom_text_repel(data = aug_small %>% filter(.cooksd >= 0.5),
               aes(label = sub_id), col = "red") +
  geom_smooth(method = "loess", formula = y ~ x, span = 2, se = F) +
  geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
  labs(title = "Small Model: Residuals vs. Leverage",
       caption = "Red points indicate Cook's d at least 0.5",
       x = "Leverage", y = "Standardized Residual")

(p3_big + p4_big) / (p3_small + p4_small) +
    plot_annotation(
        title = "Residual Diagnostics: Set 2")

There are no substantial issues with variance nor any highly influential points in either model.

Lastly I will check for collinearity in my models.

car::vif(mod_big)

           GVIF Df GVIF^(1/(2*Df))
a1c    1.142530  1        1.068892
age    1.164237  1        1.078998
sex    1.016107  1        1.008021
snore  1.064328  1        1.031663
sleepy 1.028627  4        1.003534

There are no elevated variance inflation factors, so I am not concerned about collinearity. I will not run this for my small model because it contains no variables that are not in the big model.

9.3 Comparing the Models

The evidence I have so far leads me to conclude that the big model is better. It performs better on quality of fit measures, specially R^2, adjusted R^2, sigma, and AIC. The only reason BIC was better for the small model is that BIC highly takes the number of variables into account. The only regression assumption of concern is normality, and the small model has the same issue.

10 Model Validation

10.1 Calculating Prediction Errors

First I will augment the big model, create a variable mod_n for the model name, and rename the fitted and residual variables more descriptively as tst_fit and tst_res. I will do this using the test data.

test_big <- augment(mod_big, newdata = study2_test) %>% 
  mutate(mod_n = "Big_test") %>%
  rename(tst_fit = .fitted,
         tst_res = .resid) %>%
  select(mod_n, sub_id, tst, tst_fit, tst_res, everything())

test_big %>% head(3)

I have my fitted and residual variables here with the proper names.

I will do the same thing with the small model, also on the test data.

test_small <- augment(mod_small, newdata = study2_test) %>% 
  mutate(mod_n = "Small_test") %>%
  rename(tst_fit = .fitted,
         tst_res = .resid) %>%
  select(mod_n, sub_id, tst, tst_fit, tst_res, everything())

test_small %>% head(3)

I have my fitted and residual variables here with the proper names.

I will combine the big and small model’s residual and fitted values into a tibble so that I have the fitted and residual tst for each subject, for each model.

test_combo <- union(test_big, test_small) %>%
  arrange(sub_id, mod_n)

test_combo %>% head()

10.2 Visualizing the Predictions

ggplot(test_combo, aes(x = tst_fit, y = tst_res)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", col = "blue", se = FALSE, formula = y ~ x) +
  facet_wrap( ~ mod_n) +
  labs(x = "Predicted TST (hrs)",
       y = "Observed TST (hrs)",
       title = "Observed vs. Predicted Total Sleep Time",
       subtitle = "Comparing Big to Small Model in Test Sample",
       caption = "Total sleep time abbreviated as TST")

The differences in each model’s errors are about the same. The loess smooth line’s slope is approximately 0 and there are equal numbers of points above and below the line with approximately the same variance.

10.3 Summarizing the Errors

test_combo %>% 
    group_by(mod_n) %>%
    summarise(n = n(),
            MAPE = mean(abs(tst_res)), 
            RMSPE = sqrt(mean(tst_res^2)),
            max_error = max(abs(tst_res))) %>%
  kable(digits = 3)

mod_n	n	MAPE	RMSPE	max_error
Big_test	856	1.198	1.569	6.305
Small_test	856	1.200	1.570	6.329

The big model is better in regard to predictive error. It has the lower MAPE, RMSPE, and maximum error. These models have an average error of about 1.2 hours in predicting total sleep time. Given the range of tst values is 2-14 hours, this is a large mean error.

I would like to identify the subjects for whom the MAPE was made by each model. The observed and fitted values of tst will also be valuable to look at.

temp_big <- test_big %>%
  filter(abs(tst_res) == max(abs(tst_res)))

temp_small <- test_small %>%
  filter(abs(tst_res) == max(abs(tst_res)))

bind_rows(temp_big, temp_small)

The big model had the largest predictive error on subject 95233, where it predicted 7.7 hours of sleep time but the subject reported 14 hours. This was a 21 year old female who did not snore, reported being overly sleepy sometimes, and had a glycohemoglobin of 5.7%. The small model had the largest predictive error on subject 93813, where it also predicted 7.7 hours of sleep time but the subject reported 14 hours. These are the same values (when rounded to the nearest tenth) as in the large model. This was a 42 year old female who snored, was never overly sleepy, and had a glycohemoglobin of 5.3%.

Lastly, I will validate the R^2 values for both models.

test_big %$% cor(tst, tst_fit)^2

[1] 0.01299449

test_small %$% cor(tst, tst_fit)^2

[1] 0.01416621

These are very similar. The big model accounted for 1.30% of variation in the outcome, which is very slightly better than the calculated R^2 earlier (1.28%), and the small model accounted for 1.42%, which is better than the calculated R^2 earlier (0.5%). Neither of these models are very impressive, but these R^2 values are at least the same or better than we initially thought.

10.4 Comparing the Models

The smaller model had larger overall prediction error statistics, but its validated R^2 is larger. Despite the R^2, I would chose the big model because I would prefer less error overall, and the R^2 value of the small model is only 0.19 percentage points better than the big model’s.

11 Discussion

11.1 Chosen Model

I chose the big model. The big model showed better prediction error statistics, e.g., lower MAPE, RMSPE, and maximum error. The big model also follows the linearity assumption better. It is harder for me to assess linearity in the small model because changes in glycohemoglobin levels do not cause much change in predicted total sleep time. This leads to the model predicting approximately the mean total sleep time for every subject. The big model’s validated R^2 is lower than the smaller model’s, but both are very low to begin with and the difference is minimal.

11.2 Answering My Question

My research question was: How well can an adult (age 20-79 years) subject’s long term blood glucose control i.e., glycohemoglobin percentage, predict their total sleep time on workdays/weekdays, after adjusting for age, sex, presence of snoring, and frequency of sleepiness?

My answer is: not well. Glycohemoglobin percentage, after adjustment for all these other predictors, only accounted for 1.3% of the variation in total sleep time on workdays/weekdays in my big model. My pre-analysis expectation was that people with higher blood glucose would have less total sleep time. Sleep deprivation interferes with insulin, insulin-like growth hormone, leptin (the satiety hormone), and ghrelin (the hunger hormone), which leads to variations in blood glucose, but not necessarily long term blood glucose, which is measured by glycohemoglobin. A limitation is that total sleep time is an estimated and subjective value. People will often over- or underestimate for various reasons that can differ day to day, and even depend on what time of day they are asked.

11.3 Next Steps

An actigraphy monitor is a watch-like device that a patient will often wear for about 2 weeks that monitors their activity levels. It is used to estimate total sleep time, among other uses. This is a much more objective measurement than a reported, estimated total sleep time on a questionnaire. Perhaps a study that used actigraphy to more objectively measure subjects’ total sleep time might see a stronger correlation with glycohemoglobin. I also wonder about how acute changes in total sleep time affect blood glucose levels. A study that measured a difference in sleep time over the past, say, 3 nights, rather than an estimated total sleep time at baseline, might be interesting as well.

11.4 Reflection

I would have chosen more quantitative variables, had I known at the start what I know now. I feel that a different mixture of quantitative variables might have strengthened my smaller model better than categorical ones might have.

12 Session Information

sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.1      stringr_1.4.0      purrr_0.3.4        readr_2.1.1       
 [5] tidyr_1.1.4        tibble_3.1.6       tidyverse_1.3.1    GGally_2.1.2      
 [9] knitr_1.36         mosaic_1.8.3       ggridges_0.5.3     mosaicData_0.20.2 
[13] ggformula_0.10.1   ggstance_0.3.5     dplyr_1.0.7        Matrix_1.3-4      
[17] lattice_0.20-45    haven_2.4.3        nhanesA_0.6.5.3    broom_0.7.10      
[21] car_3.0-12         carData_3.0-4      patchwork_1.1.1    equatiomatic_0.3.0
[25] ggrepel_0.9.1      ggplot2_3.3.5      naniar_0.6.1       magrittr_2.0.1    
[29] janitor_2.1.0     

loaded via a namespace (and not attached):
 [1] colorspace_2.0-2    ellipsis_0.3.2      visdat_0.5.3       
 [4] leaflet_2.0.4.1     snakecase_0.11.0    htmlTable_2.3.0    
 [7] fs_1.5.1            base64enc_0.1-3     ggdendro_0.1.22    
[10] rstudioapi_0.13     farver_2.1.0        fansi_0.5.0        
[13] lubridate_1.8.0     xml2_1.3.3          splines_4.1.2      
[16] polyclip_1.10-0     Formula_1.2-4       jsonlite_1.7.2     
[19] cluster_2.1.2       dbplyr_2.1.1        png_0.1-7          
[22] ggforce_0.3.3       shiny_1.7.1         compiler_4.1.2     
[25] httr_1.4.2          backports_1.3.0     assertthat_0.2.1   
[28] fastmap_1.1.0       cli_3.1.0           later_1.3.0        
[31] tweenr_1.0.2        htmltools_0.5.2     tools_4.1.2        
[34] gtable_0.3.0        glue_1.5.1          Rcpp_1.0.7         
[37] cellranger_1.1.0    jquerylib_0.1.4     vctrs_0.3.8        
[40] nlme_3.1-153        crosstalk_1.2.0     xfun_0.28.10       
[43] rvest_1.0.2         mime_0.12           lifecycle_1.0.1    
[46] mosaicCore_0.9.0    MASS_7.3-54         scales_1.1.1       
[49] hms_1.1.1           promises_1.2.0.1    RColorBrewer_1.1-2 
[52] yaml_2.2.1          gridExtra_2.3       sass_0.4.0         
[55] labelled_2.9.0      rpart_4.1-15        reshape_0.8.8      
[58] latticeExtra_0.6-29 stringi_1.7.6       highr_0.9          
[61] checkmate_2.0.0     rlang_0.4.11        pkgconfig_2.0.3    
[64] evaluate_0.14       labeling_0.4.2      htmlwidgets_1.5.4  
[67] tidyselect_1.1.1    plyr_1.8.6          bookdown_0.24      
[70] R6_2.5.1            generics_0.1.1      Hmisc_4.6-0        
[73] DBI_1.1.1           mgcv_1.8-38         pillar_1.6.4       
[76] foreign_0.8-81      withr_2.4.3         survival_3.2-13    
[79] abind_1.4-5         nnet_7.3-16         modelr_0.1.8       
[82] crayon_1.4.2        utf8_1.2.2          tzdb_0.2.0         
[85] rmarkdown_2.11.3    jpeg_0.1-9          grid_4.1.2         
[88] readxl_1.3.1        data.table_1.14.2   rmdformats_1.0.3   
[91] reprex_2.0.1        digest_0.6.29       xtable_1.8-4       
[94] httpuv_1.6.3        munsell_0.5.0       viridisLite_0.4.0  
[97] bslib_0.3.1