Exploring Sleep Time, Snoring, and Sleepiness Between the Sexes Among US Adults

1 Setup and Data Ingest

1.1 Package Loading

I will load in my necessary packages.

library(knitr)
library(rmdformats)
library(janitor)
library(magrittr)
library(naniar)
library(broom)
library(patchwork)
library(readxl)
library(Epi)
library(Hmisc)
library(GGally)
library(ggridges)
library(mosaic)
library(nhanesA)
library(haven)
library(tidyverse) 

source("data/Love-boost.R.rmd")

opts_chunk$set(comment=NA)
opts_knit$set(width=75)

theme_set(theme_bw())
options(dplyr.summarise.inform = FALSE)

1.2 Data Ingest

I will ingest the 2 nhanesA data sets I will be using into a raw data set tibble, remove labels from the variables, and save each data set as a new file on my computer for subsequent sessions.

sleep_raw <- nhanes('SLQ_J') %>% zap_label() %>% tibble()
Processing SAS dataset SLQ_J     ..
# saveRDS(sleep_raw, "data/SLQ_J.Rds")

# sleep_raw <- readRDS("data/SLQ_J.Rds")
demo_raw <- nhanes('DEMO_J') %>% zap_label() %>% tibble()
Processing SAS dataset DEMO_J    ..
# saveRDS(demo_raw, "data/DEMO_J.Rds")

# demo_raw <- readRDS("data/DEMO_J.Rds")

2 Cleaning the Data

2.1 Variable Selection

I will select the variables of interest from each raw data set. I need SEQN from each in order to combine the data sets next. I need SLD012, SLQ030, and SLQ120 from the sleep_raw data and RIAGENDR, RIDAGEYR, and RIDSTATR from the demo_raw data set.

sleep_raw <- sleep_raw %>%
  select(SEQN, SLD012, SLQ030, SLQ120)

demo_raw <- demo_raw %>%
  select(SEQN, RIAGENDR, RIDAGEYR, RIDSTATR) 

2.2 Merging the Data

I will merge the raw data sets by joining sleep_data to demo_data. I will join them by the unique subject identifier SEQN to create 1 row per subject.

raw_data <- left_join(demo_raw, sleep_raw, by = "SEQN")

dim(raw_data)
[1] 9254    7

I have 9254 rows (subjects) and 8 columns (variables) as expected.

I will check that the number of rows equals the number of unique subjects to ensure I avoided any errors in merging.

identical(raw_data %$% n_distinct(SEQN), 
          raw_data %>% nrow())
[1] TRUE

I will remove data sets I will no longer be using to keep my work space clear.

rm(demo_raw, sleep_raw)

2.3 Names and Types

I will begin to clean my data by make each variable the type I want. My categorical variables (RIDSTATR, RIAGENDR, SLQ030, SLQ120) will be factors. SEQN will be a character variable, as its numbers have no numerical quality. SLD012, and RIDAGEYR are my quantitative variables, so they will be of ‘dbl’ type. Lastly, I will remove any levels of a factor variable that have 0 subjects.

dim(raw_data) 
[1] 9254    7
clean_data <- raw_data %>%
  mutate(SEQN = as.character(SEQN),
         SLD012 = as.double(SLD012),
         RIDAGEYR = as.double(RIDAGEYR),
         RIDSTATR = factor(RIDSTATR),
         RIAGENDR = factor(RIAGENDR),
         SLQ030 = factor(SLQ030),
         SLQ120 = factor(SLQ120)) %>%
  droplevels()

dim(clean_data)
[1] 9254    7

I did not lose any columns or rows as I did this.

glimpse(clean_data)
Rows: 9,254
Columns: 7
$ SEQN     <chr> "93703", "93704", "93705", "93706", "93707", "93708", "93709"~
$ RIAGENDR <fct> 2, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1~
$ RIDAGEYR <dbl> 2, 2, 66, 18, 13, 66, 75, 0, 56, 18, 67, 54, 71, 61, 22, 45, ~
$ RIDSTATR <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
$ SLD012   <dbl> NA, NA, 8.0, 10.5, NA, 8.0, 7.0, NA, 7.0, 7.5, 5.5, 7.0, 5.0,~
$ SLQ030   <fct> NA, NA, 2, 1, NA, 9, 1, NA, 2, 1, 0, 3, 3, 3, 2, 0, NA, NA, 0~
$ SLQ120   <fct> NA, NA, 0, 1, NA, 2, 1, NA, 3, 2, 2, 3, 4, 3, 1, 2, NA, NA, 0~

My variable types are as expected and described above.

2.4 Variables

2.4.1 Total Sleep Time

I will start with a numerical summary.

df_stats(~ SLD012, data = clean_data)

My range of values is 2-14 hours of sleep per night on weekdays and weekends. These are plausible values. The mean and median being close to each and around 7.5-8 hours is reassuring. There are 3141 subjects with missing data. I will rename this to a more descriptive name and filter to complete cases.

clean_data <- clean_data %>%
  rename(tst = SLD012) %>%
  filter(complete.cases(tst))

dim(clean_data) # should be 9254 - 3141 = 6113
[1] 6113    7

I am left with 6113 rows (subjects) and 7 columns (variables) as expected.

2.4.2 Age

I will start with a numerical summary.

df_stats(~ RIDAGEYR, data = clean_data)

There are no missing data. The age range of my subjects is 16-80 years old. I wish to study adults age 20-79 years in my analyses, so I will create these inclusion criteria now, as well as rename this variable to a more descriptive name. My upper limit is 79 years because subjects demarcated as 80 years old are, in actuality, 80 years old or greater. This ceiling effect will alter the accuracy of my model’s predictions, so I will remove them.

clean_data <- clean_data %>%
  rename(age = RIDAGEYR) %>%
  mutate(age = replace(age, age < 20 | age > 79, NA))

df_stats(~ age, data = clean_data)

The age range is now 20-79 years. I have 1011 subjects with missing data. I will filter to complete cases.

clean_data <- clean_data %>%
  filter(complete.cases(age))

dim(clean_data) # should be 6113 - 1011 = 5102
[1] 5102    7

I am left with 5102 rows (subjects) and 7 columns (variables) as expected.

2.4.3 Sex

I will start with a numerical summary.

clean_data %>% tabyl(RIAGENDR) %>% kable(digits = 3)
RIAGENDR n percent
1 2472 0.485
2 2630 0.515

My data set comprises 2472 males (48.5%) and 2630 females (51.5%). I will rename this variable to a more descriptive name, recode it so that I can see ‘Male’ and ‘Female’ instead of 1 and 2, and relevel it so it comes up appropriately in my tables later on.

# gender recode

clean_data <- clean_data %>%
  rename(sex = RIAGENDR) %>%
  mutate(sex = fct_recode(sex,
                          "Male" = "1",
                          "Female" = "2")) %>%
  mutate(sex = fct_relevel(sex, "Male", "Female"))

clean_data %>% tabyl(sex) %>% kable(digits = 3)
sex n percent
Male 2472 0.485
Female 2630 0.515
dim(clean_data) # should be 5102
[1] 5102    7

I am left with 5102 rows (subjects) and 7 columns (variables) as expected.

2.4.4 Snore

I will start with a numerical summary.

clean_data %>% tabyl(SLQ030) %>% kable(digits = 3)
SLQ030 n percent
0 1211 0.237
1 1167 0.229
2 936 0.183
3 1426 0.279
7 5 0.001
9 357 0.070

My data set comprises 1211 subjects (23.7%) who never snore, 1167 subjects (22.9%) who rarely snore, 936 subjects (18.3%) who occasionally snore, 1426 subjects (27.9%) who frequently snore, 5 subjects who refused to answer, and 357 subjects who did not know. I do not have any missing data. I will rename this variable to a more descriptive name, tell R to consider “Refused” and “Don’t know” as missing values, recode it to a binary variable so those who never or rarely snore will be in ‘Does not snore’ and those who occasionally or frequently snore will be in ‘Snores’, and relevel it so it comes up properly in tables later.

clean_data <- clean_data %>%
  rename(snore = SLQ030) %>%
  mutate(snore = na_if(snore, 7),
         snore = na_if(snore, 9)) %>%
  mutate(snore = fct_recode(snore,
                          "Does not snore" = "0",
                          "Does not snore" = "1",
                          "Snores" = "2",
                          "Snores" = "3")) %>%
  mutate(snore = fct_relevel(snore, 
                          "Snores", "Does not snore"))

clean_data %>% tabyl(snore) %>% kable(digits = 3)
snore n percent valid_percent
Snores 2362 0.463 0.498
Does not snore 2378 0.466 0.502
7 0 0.000 0.000
9 0 0.000 0.000
NA 362 0.071 NA

This looks good. I have created a binary variable and moved all the ‘Refused’ or ‘Don’t know’ responses to missing values. I will drop those missing values and those rows that do not contain any subjects.

clean_data <- clean_data %>%
  filter(complete.cases(snore)) %>%
  droplevels()

clean_data %>% tabyl(snore) %>% kable(digits = 3)
snore n percent
Snores 2362 0.498
Does not snore 2378 0.502
dim(clean_data) # should be 5102 - 362 = 4740
[1] 4740    7

I am left with 4740 rows (subjects) and 7 columns (variables) as expected.

2.4.5 Sleepy

I will start with a numerical summary.

clean_data %>% tabyl(SLQ120) %>% kable(digits = 3)
SLQ120 n percent
0 794 0.168
1 1147 0.242
2 1599 0.337
3 806 0.170
4 390 0.082
9 4 0.001

My data set comprises 750 subjects (16.6%) who are never sleepy, 1084 subjects (24.1%) who are rarely sleepy, 1525 subjects (33.9%) who are sometimes sleepy, 769 subjects (17.1%) who are often sleepy, 373 (8.3%) who are always sleepy, and 4 subjects who did not know their frequency of sleepiness. I do not have any missing data. I will rename this variable to a more descriptive name, tell R to consider “Don’t know” as a missing value, and rename the levels more descriptively.

# recode to multi-categorical variable

clean_data <- clean_data %>%
  rename(sleepy = SLQ120) %>%
  mutate(sleepy = na_if(sleepy, 7),
         sleepy = na_if(sleepy, 9)) %>%
  mutate(sleepy = fct_recode(sleepy,
                          "never" = "0",
                          "rarely" = "1",
                          "sometimes" = "2",
                          "often" = "3",
                          "always" = "4"))

clean_data %>% tabyl(sleepy) %>% kable(digits = 3)
sleepy n percent valid_percent
never 794 0.168 0.168
rarely 1147 0.242 0.242
sometimes 1599 0.337 0.338
often 806 0.170 0.170
always 390 0.082 0.082
9 0 0.000 0.000
NA 4 0.001 NA

I have moved all the ‘Don’t know’ responses to missing values and these level names look good. I will drop those missing values and the row that does not contain any subjects.

clean_data <- clean_data %>%
  filter(complete.cases(sleepy)) %>%
  droplevels()

dim(clean_data) # should be 4740 - 4 = 4736
[1] 4736    7

I am left with 4736 rows (subjects) and 7 columns (variables) as expected.

2.4.6 RIDSTATR: Indicator Variable

RIDSTATR is a variable which designates whether a subject partook in both the questionnaire and the examination, for which they are given a value of 2. I will start by seeing if all the subjects in my data set have a value of 2.

clean_data %>% tabyl(RIDSTATR) %>% kable(digits = 3)
RIDSTATR n percent
1 235 0.05
2 4501 0.95

I have 235 with a value of 1 that I will need to remove from my data set. I will tell R to count these values as missing. Then I will filter to complete cases and drop the empty row from my data set.

clean_data <- clean_data %>%
  mutate(RIDSTATR = na_if(RIDSTATR, 1)) %>%
  filter(complete.cases(.)) %>%
  droplevels()

clean_data %>% tabyl(RIDSTATR) %>% kable(digits = 3)
RIDSTATR n percent
2 4501 1
dim(clean_data) # should be 4736 - 235 = 4501
[1] 4501    7

I am left with 4501 rows (subjects) and 7 columns (variables) as expected.

2.5 Missingness Checks

I will check that I was successful in removing the missing data as I cleaned each variable.

miss_var_summary(clean_data)

I was.

2.6 Creating My Tibble

I will create a tibble called study1 from the clean data, selecting only those variables I will use in my analyses. I will rename the unique subject identifier to a more descriptive name. Lastly, I will save my tibble in my data folder.

study1 <- clean_data %>%
  select(SEQN, tst, age, sex, snore, sleepy) %>%
  rename(sub_id = SEQN)

rm(raw_data, clean_data)

saveRDS(study1, "data/study1.Rds")

3 Codebook and Data Description

These data are from the 2017-2018 National Health and Nutrition Examination Survey (NHANES) which tracks the health of the population of the United States by collecting questionnaire and examination data. NHANES is run by the Center for Disease Control and Prevention (CDC). Their target population is the non-institutionalized civilian resident population of the United States. They create their sample population through random selection using statistical processes on US Census information. For the 2017-2018 sampling, 16,211 people were selected. Of these, 9,254 completed the interview and 8,704 completed the exams. The following analyses include subjects between 20-79 years old who have complete data on the variables of interest.

3.1 Codebook

Variable Type Description Raw Name
sub_id Chr A unique subject identifier within each NHANES data set SEQN
tst Quant Outcome This is in the Questionnaire category and describes estimated total sleep time on weekdays or workdays, in hours, ranging from 2 to 14 hours. SLD012
age Quant This is in the Demographics category and describes age in years. Inclusion criteria ages 20 to 79 years. RIDAGEYR
sex Binary This is in the Demographics category and describes sex as male or female. These data comprise 48.2% males and 51.8% females. RIAGENDR
snore Binary This is in the Questionnaire category and describes subjective snoring frequency in the NHANES data. In converting this to binary for these analyses, the meaning has become presence of snoring, thus the categories are ‘Snores’ and ‘Does not snore’. ‘Snores’ includes the questionnaire answers “Occasionally - 3-4 nights a week” or “Frequently - 5 or more nights a week”. ‘Does not snore’ includes the questionnaire answers “Never” or “Rarely - 1-2 nights a week”. In these data, 50.6% admit to snoring, and 49.4% report never or rarely snoring. SLQ030
sleepy 5-Cat This is in the Questionnaire category and describes subjective frequency of feeling overly sleepy. The categories are ‘never’, ‘rarely’, ‘sometimes’, ‘often’, and ‘always’. ‘Never’ corresponds to a questionnaire response of ‘never’, ‘rarely’ to ‘rarely - 1 time a month, ’sometimes’ to ‘sometimes - 2-4 times a month’, ‘often’ to ‘often - 5-15 times a month’, and ‘always’ to ‘always - 16-30 times a month’. In these data, 16.5% are never sleepy, 24.2% are rarely sleepy, 34.2% are sometimes sleepy, 17.1% are often sleepy, and 8.1% are always sleepy. SLQ120

3.2 Analytic Tibble

Here is my tibble.

study1

As anticipated, I have 4501 rows (subjects) and 6 variables in my final data set.

I will demonstrate that study1 is, in fact, a tibble.

is_tibble(study1)
[1] TRUE

3.3 Data Summary

I would like to see a summary of my data now that it has been cleaned up.

study1 %>% 
  select(-sub_id) %>% 
  Hmisc::describe(.)
. 

 5  Variables      4501  Observations
--------------------------------------------------------------------------------
tst 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    4501        0       23    0.985    7.553    1.779      5.0      5.5 
     .25      .50      .75      .90      .95 
     6.5      7.5      8.5      9.5     10.0 

lowest :  2.0  3.0  3.5  4.0  4.5, highest: 11.5 12.0 12.5 13.0 14.0
--------------------------------------------------------------------------------
age 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    4501        0       60        1     48.7    18.81       23       26 
     .25      .50      .75      .90      .95 
      34       50       62       70       74 

lowest : 20 21 22 23 24, highest: 75 76 77 78 79
--------------------------------------------------------------------------------
sex 
       n  missing distinct 
    4501        0        2 
                        
Value        Male Female
Frequency    2181   2320
Proportion  0.485  0.515
--------------------------------------------------------------------------------
snore 
       n  missing distinct 
    4501        0        2 
                                        
Value              Snores Does not snore
Frequency            2265           2236
Proportion          0.503          0.497
--------------------------------------------------------------------------------
sleepy 
       n  missing distinct 
    4501        0        5 

lowest : never     rarely    sometimes often     always   
highest: never     rarely    sometimes often     always   
                                                            
Value          never    rarely sometimes     often    always
Frequency        750      1084      1525       769       373
Proportion     0.167     0.241     0.339     0.171     0.083
--------------------------------------------------------------------------------

4 Analysis B

In this analysis, I will compare two means of independent samples. My samples will be 2017-2018 NHANES respondents, who were randomly selected and meet the aforementioned inclusion criteria, split into 2 groups: a snoring group and a non-snoring group.

4.1 The Question

How well can the presence of snoring predict sleep time on workdays/weekdays, in a randomly selected sample of the adult population in the US?

4.2 Describing the Data

I will compare the mean estimated total sleep time on workdays/weekdays in subjects who snore to subjects who do not snore. First, here is a graphical summary of tst within each group.

ggplot(study1, aes(x = snore, y = tst)) +
    geom_violin(aes(fill = snore)) +
    geom_boxplot(width = 0.3, outlier.size = 2) +
    stat_summary(fun = "mean", geom = "point",
               shape = 23, size = 2, fill = "magenta") +
    coord_flip() +
    guides(fill = "none") +
    labs(title = "Sleep Time Follows Symmetric Distribution, with Outliers",
         subtitle = "stratified by snoring presence, with mean in magenta",
         caption = "4501 subjects in `study1`",
         x = "",
         y = "Total Sleep Time (hrs)")

The distribution of tst in both the ‘Snores’ group and the ‘Does not snore’ group are fairly symmetric, with means close to medians, the ‘Snores’ group more so. Both groups have just slightly more outliers on the higher scale than lower. The shape of the distributions shows that subjects had a preference to choose a whole number as their estimated hours of sleep, as opposed to by the half hour.

Here is a numerical summary.

mosaic::favstats(tst ~ snore, data = study1) %>% kable(digits = 2)
snore min Q1 median Q3 max mean sd n missing
Snores 2 6.5 7.5 8.5 14 7.50 1.65 2265 0
Does not snore 2 7.0 7.5 8.5 14 7.61 1.61 2236 0

I have no missing data because I took care of that in my data cleaning. The range of values in both groups is 2-14 hours. The mean sleep time in the ‘Snores’ group is 7.5 hours and in the ‘Does not snore’ group is 7.61 hours, so the mean difference is 0.11 hours. I will need to determine if I can be confident this difference is more than coincidence at a 10% significance level. It is not reasonable to assume normality here due to the outliers, although the distributions are symmetric and have the same range.

4.3 Main Analysis

I will use confidence level of 90% and a two-sided comparison. Because I can’t assume normality here, but my sample was randomly selected, I will use a bootstrap approach to determine a confidence interval around the mean difference.

set.seed(1228)

study1 %$% bootdif(tst, snore, conf.level = 0.90)
Mean Difference            0.05            0.95 
     0.10977423      0.03010331      0.19284301 

My model implies that the population mean sleep time in hours on workdays/weekdays in those who do not snore is 0.11 hours higher than the mean sleep time in those who snore based on the subjects in my sample. My 90% confidence interval of (0.03,0.19) provides evidence that the difference between the groups is statistically detectable.

4.4 Conclusions

Based on my sample of randomly selected US adults aged 20-79 years with complete data on tst, age, sex, snore, and sleepy, there is a statistically detectable difference in the population mean number of sleep hours on workdays/weekdays between people who snore and people who do not snore. I have come to this conclusion by using a bootstrap to compare the mean sleep hours of people who snore with the mean sleep hours of people who do not snore. I chose a bootstrap due to the lack of normality of my data (due to outliers) and the fact they were randomly selected. A limitation of this analysis is that it is difficult for people to know if they snore or not. People are often inclined to underestimate the frequency at which they snore, or even deny it despite their bed partner reporting otherwise. I need to take this subjectivity into account when discussing my results. Estimated total sleep time is also subjective. Taking 2 subjective measures and comparing them may affect the replicability and accuracy of the results. Some appropriate next steps would be to use sleep study data to gain objective evidence of snoring, as the standard sleep study montage includes a snore monitor, whether microphone or vibratory sensor.

To answer my question, the presence of snoring is not fantastic at predicting total sleep time on workdays/weekdays. Although there is a statistically detectable difference, it is very minor.

5 Analysis C

In this analysis, I will compare means of 5 groups. My samples will be 2017-2018 NHANES respondents, who were randomly selected and meet the aforementioned inclusion criteria, split into 5 groups based on how frequency they feel overly sleepy.

5.1 The Question

How well can a subject’s estimated frequency of feeling overly sleepy predict total sleep time on workdays/weekdays, in a randomly selected sample of the adult population in the US?

5.2 Describing the Data

I will compare the mean estimated total sleep time on workdays/weekdays in subjects who are never, rarely, sometimes, often, or always sleepy. First, here is a graphical summary of tst within each group. I will use a ridgeline density plot to account for differences in size of each group.

ggplot(study1, aes(x = tst, y = sleepy, fill = sleepy)) +
  geom_density_ridges(scale = 0.9, alpha = 0.8) +
  scale_fill_viridis_d() +
  guides(fill = "none") + 
  theme_ridges() +
  labs(title = "Fairly Symmetric Distribution of Total Sleep Time",
       subtitle = "stratified by sleepiness frequency",
       caption = "4501 subjects in `study1`",
       x = "Total Sleep Time (hrs)",
       y = "Sleepiness Frequency")
Picking joint bandwidth of 0.319

The spread of data is widest in the always sleepy group and most clustered toward the center the never sleepy group. The centers of the distributions look very similar, but the always sleepy group looks a bit left skewed. Without being able to see outliers, these data look normally distributed. I will look at box plots to check for outliers.

ggplot(study1, aes(x = sleepy, y = tst, fill = sleepy)) +
  geom_boxplot() +
  scale_fill_viridis_d() +
  coord_flip() +
  guides(fill = "none") +
  labs(title = "Outlier Assessment of Total Sleep Time",
       subtitle = "stratified by sleepiness frequency",
       caption = "4501 subjects in `study1`",
       y = "Comfort with 431 Materials (0-100)",
       x = "Sleepiness Frequency")

There are outliers in each of my groups, mostly in the never sleepy group. This is evidence against normality in the groups.

I will also look at a numerical summary.

mosaic::favstats(tst ~ sleepy, data = study1) %>% kable(digits = 3)
sleepy min Q1 median Q3 max mean sd n missing
never 2 7.0 8.0 8.5 14 7.677 1.624 750 0
rarely 2 7.0 7.5 8.5 13 7.565 1.496 1084 0
sometimes 2 6.5 7.5 8.5 14 7.600 1.547 1525 0
often 2 6.5 7.5 8.5 14 7.467 1.755 769 0
always 2 6.0 7.5 8.0 14 7.253 2.027 373 0

There are 750 never sleepy, 1084 rarely sleepy, 1525 sometimes sleepy, 769 often sleepy, and 373 always sleepy subjects in my sample. The range for all but the rarely sleepy group are 2-14 hours. The rarely sleepy group ranges from 2-13 hours. The mean is the lowest for the always sleepy group and the highest for the never sleepy group. The medians of all but the never sleepy group are 7.5 hours, but the never sleepy group is 8 hours.

5.3 Main Analysis

I will run a Kruskal-Wallis test to assess the mean differences between my groups because each group does not follow a normal distribution, and the sample sizes vary across the groups, so I cannot meet the assumptions of an ANOVA.

kruskal.test(data = study1, tst ~ sleepy)

    Kruskal-Wallis rank sum test

data:  tst by sleepy
Kruskal-Wallis chi-squared = 19.861, df = 4, p-value = 0.000532

There is a statistically detectable difference at a 10% significance level between the population mean sleep times for the 5 sleepiness groups. Regarding assumptions: The samples are independent and were randomly selected from the population of interest.

My next step is to determine between which groups the difference(s) lie. I will start by running a Tukey’s pairwise comparison.

TukeyHSD(aov(data = study1, tst ~ sleepy), conf.level = 0.90)
  Tukey multiple comparisons of means
    90% family-wise confidence level

Fit: aov(formula = tst ~ sleepy, data = study1)

$sleepy
                        diff        lwr          upr     p adj
rarely-never     -0.11183518 -0.3022337  0.078563353 0.5983814
sometimes-never  -0.07766120 -0.2564475  0.101125143 0.8226388
often-never      -0.21049328 -0.4162217 -0.004764864 0.0869610
always-never     -0.42398213 -0.6779706 -0.169993640 0.0003927
sometimes-rarely  0.03417398 -0.1250823  0.193430296 0.9845203
often-rarely     -0.09865810 -0.2876613  0.090345072 0.7009862
always-rarely    -0.31214695 -0.5527881 -0.071505785 0.0124146
often-sometimes  -0.13283208 -0.3101317  0.044467545 0.3487371
always-sometimes -0.34632092 -0.5778833 -0.114758565 0.0021997
always-often     -0.21348885 -0.4664330  0.039455320 0.2304195

By looking at the confidence intervals, I can see that the statistically detectable differences lie between the often and never sleep groups, always and never sleepy groups, always and rarely sleep groups, and always and sometimes sleep groups. All of these differences are negative. This will be shown better in a visualization.

mar.default <- c(5,0,4,2) + 0.1

par(mar = mar.default + c(0, 8, 0, 0)) 

plot(TukeyHSD(aov(tst ~ sleepy, data = study1), 
              conf.level = 0.90), las = 2)

This visualization shows the differences between groups mentioned above. It shows that the often sleepy group gets significantly less sleep than the never sleepy group, and the always sleep group gets significantly less sleep than the never sleepy, the rarely sleepy, and the sometimes sleep groups.

5.4 Conclusions

Based on my sample of randomly selected US adults aged 20-79 years with complete data on tst, age, sex, snore, and sleepy, there is a statistically detectable difference in the population mean number of sleep hours on workdays/weekdays between people of these 5 different sleepiness categories: never, rarely, sometimes, often, always. I have come to this conclusion by using a Kruksal-Wallis test and visualizing the pairwise comparisons. I chose a Kruskal-Wallis test due to the lack of normality of my data (due to outliers) and the differences in sample sizes between groups. A limitation of this analysis is, similar to Analysis B, both the outcome and the predictor are subjective. People tend to overestimate the frequency of which they are overly sleepy and underestimate the hours the slept. An appropriate next step would be to use Multiple Sleep Latency Test (an objective test of sleepiness) results to determine sleepiness level and actigraphy (activity monitoring) to determine objective total sleep time, and use these in a similar model to this one.

To answer my question, a subject’s frequency of feeling overly sleepy is not a great predictor of their total sleep time on workdays/weekdays. If they are in the ‘always sleepy’ group, there predictive accuracy is best, but the difference remains minimal.

6 Analysis D

In this analysis, I will create and analyze a 2x2 table to evaluate the association of sex and presence of snoring. I will compare the presence of snoring (outcome) across both sexes (exposure).

6.1 The Question

Do more men snore than women, in a randomly selected sample of the adult population in the US?

6.2 Describing the Data

I will start by creating a 2x2 table.

study1 %>% tabyl(sex, snore) %>% 
  adorn_totals() %>%
  adorn_percentages(denom = "row") %>%
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns(position = "front") %>%
  kable()
sex Snores Does not snore
Male 1216 (55.8%) 965 (44.2%)
Female 1049 (45.2%) 1271 (54.8%)
Total 2265 (50.3%) 2236 (49.7%)

It looks like the highest frequency cells are males who snore and females who don’t snore, hinting that perhaps the answer to my question is “yes”.

6.3 Main Analysis

I will use twobytwo with a Bayesian augmentation to create a more descriptive table that provides me statistics on the relationships.

twobytwo(1216+2, 965+2, 1049+2, 1271+2, 
         "Male", "Female", "Snores", "Does not snore",
         conf.level = 0.90)
2 by 2 table analysis: 
------------------------------------------------------ 
Outcome   : Snores 
Comparing : Male vs. Female 

       Snores Does not snore    P(Snores) 90% conf. interval
Male     1218            967       0.5574    0.5399   0.5748
Female   1051           1273       0.4522    0.4353   0.4693

                                   90% conf. interval
             Relative Risk: 1.2326    1.1738   1.2944
         Sample Odds Ratio: 1.5256    1.3824   1.6837
Conditional MLE Odds Ratio: 1.5254    1.3799   1.6866
    Probability difference: 0.1052    0.0808   0.1295

             Exact P-value: 0.0000 
        Asymptotic P-value: 0.0000 
------------------------------------------------------

The probability of snoring in males is 0.557 with a 90% confidence interval of (0.540,0.575). The probability of snoring in females is 0.452 with a 90% confidence interval of (0.435,0.469). These do not overlap, providing evidence that there is a difference in snoring between the sexes. The relative risk is 1.233 (1.174,1.294), which is entirely greater than 1, indicating the likelihood of snoring is higher for males than for females. The odds ratio is 1.526 (1.382,1.684), which is entirely greater than 1, indicating that the odds of snoring are higher in males than in females. The risk difference is 0.105 (0.081,0.130), which is entirely greater than 0, indicating that the probability of snoring is higher for males than for females. These difference are large enough to rule out chance with 90% confidence because the p-value is less than 0.10.

6.4 Conclusions

Based on my sample of randomly selected US adults aged 20-79 years with complete data on tst, age, sex, snore, and sleepy, there is a statistically detectable difference in the presence of snoring between the sexes. I have come to this conclusion by analyzing a 2x2 table to determine a relative risk, odds ratio, and risk difference, all of which point to males being more likely to snore than females. A limitation of this analysis is, similar to Analysis B & C, snoring is subjective, and people are likely to underestimate their snoring or not know they snore. An appropriate next step would be to add age as predictor and see how this affects snoring between the sexes.

To answer my question: yes, more men snore than women.

7 Analysis E

In this analysis, I will create and analyze a 2x5 table to analyse the relationship between snoring and sleepiness.

7.1 The Question

Are people who snore overly sleepy more frequently,in a randomly selected sample of the adult population in the US?

7.2 Describing the Data

I will start off by putting the data into a table.

study1 %>%
  tabyl(sleepy, snore) %>%
  adorn_totals(c("row", "col")) %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting(rounding = "half up", digits = 0) %>%
  adorn_ns() %>%
  kable()
sleepy Snores Does not snore Total
never 43% (321) 57% (429) 100% (750)
rarely 45% (492) 55% (592) 100% (1084)
sometimes 52% (800) 48% (725) 100% (1525)
often 55% (424) 45% (345) 100% (769)
always 61% (228) 39% (145) 100% (373)
Total 50% (2265) 50% (2236) 100% (4501)

This table shows that out of 750 subjects who are never sleepy, 321 snore; 1084 subjects who are rarely sleepy, 492 snore; 1525 subjects who are sometimes sleepy, 800 snore; 769 subjects who are often sleepy, 424 snore; and 373 subjects who are always sleepy, 228 snore. We have about the same number in the ‘Snores’ and ‘Does not snore’ group, and both groups have the most subjects in the sometimes sleepy group.

My data meets the Cochran conditions.

7.3 Main Analysis

I will create a data set that contains only the sleepy and snore variables and put them in a table. Then I will run a chi-square test on this data set.

sleep_tab <- study1 %$% table(sleepy, snore)

chisq.test(sleep_tab)

    Pearson's Chi-squared test

data:  sleep_tab
X-squared = 54.866, df = 4, p-value = 3.466e-11

My null hypothesis is that the rows and columns are independent. My alternative hypothesis is that there is an association between the rows and columns. The chi-square test statistic is 52 on 4 degrees of freedom, yielding p < 0.0001, providing evidence that the rows and columns are different.

I would like to see how they are different, so I will plot them.

assocplot(sleep_tab, col = c("lightskyblue", "slateblue"), 
          main = "Association of Snoring and Sleepiness", 
          xlab = "Sleepiness frequency", 
          ylab = "Presence of Snoring")

It looks like within the never and rarely sleepy groups, there are more subjects who do not snore. Conversely, within the sometimes, often, and always sleepy groups, there are more subjects who do snore. I am unable to state causation here. The largest residuals are in the never sleepy and always sleepy groups, and the smallest residual is in the sometimes sleepy group. Within the ‘Snores’ group, the observed frequency is less than expected in the never and rarely sleepy groups, and greater than expected for the sometimes, often, and always sleepy groups. The opposite is true for the ‘Does not snore’ group. This is in line with what I suspected prior to the analyses.

7.4 Conclusions

Based on my sample of randomly selected US adults aged 20-79 years with complete data on tst, age, sex, snore, and sleepy, there is a statistically detectable difference in the presence of snoring between groups of differing sleepiness frequency. I have come to this conclusion by analyzing a 2x5 table and running a chi-squared test, which provided a p-value that provides evidence that there is a statistically detectable difference in the groups. The plot shows that within the ‘Snores’ group, there are more subjects who are more frequently sleepy. Conversely, in the ‘Does not snore’ group, there are more subjects who are less frequently sleepy. A limitation of this analysis is, similar to Analysis B, C, & D, snoring and sleepiness are both subjective, and people are likely to underestimate their snoring or not know they snore, and overestimate their sleepiness.

To answer my question: yes, people who snore are overly sleepy more frequently than people who do not snore.

8 Session Information

sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.1     stringr_1.4.0     purrr_0.3.4       readr_2.1.1      
 [5] tidyr_1.1.4       tibble_3.1.6      tidyverse_1.3.1   haven_2.4.3      
 [9] nhanesA_0.6.5.3   mosaic_1.8.3      mosaicData_0.20.2 ggformula_0.10.1 
[13] ggstance_0.3.5    dplyr_1.0.7       Matrix_1.3-4      ggridges_0.5.3   
[17] GGally_2.1.2      Hmisc_4.6-0       ggplot2_3.3.5     Formula_1.2-4    
[21] survival_3.2-13   lattice_0.20-45   Epi_2.44          readxl_1.3.1     
[25] patchwork_1.1.1   broom_0.7.10      naniar_0.6.1      magrittr_2.0.1   
[29] janitor_2.1.0     rmdformats_1.0.3  knitr_1.36       

loaded via a namespace (and not attached):
 [1] colorspace_2.0-2    ellipsis_0.3.2      visdat_0.5.3       
 [4] leaflet_2.0.4.1     snakecase_0.11.0    htmlTable_2.3.0    
 [7] base64enc_0.1-3     ggdendro_0.1.22     fs_1.5.1           
[10] rstudioapi_0.13     farver_2.1.0        ggrepel_0.9.1      
[13] fansi_0.5.0         lubridate_1.8.0     xml2_1.3.3         
[16] splines_4.1.2       polyclip_1.10-0     jsonlite_1.7.2     
[19] cluster_2.1.2       dbplyr_2.1.1        png_0.1-7          
[22] ggforce_0.3.3       compiler_4.1.2      httr_1.4.2         
[25] backports_1.3.0     assertthat_0.2.1    fastmap_1.1.0      
[28] cli_3.1.0           tweenr_1.0.2        htmltools_0.5.2    
[31] tools_4.1.2         gtable_0.3.0        glue_1.5.1         
[34] Rcpp_1.0.7          cellranger_1.1.0    jquerylib_0.1.4    
[37] vctrs_0.3.8         nlme_3.1-153        crosstalk_1.2.0    
[40] xfun_0.28.10        rvest_1.0.2         lifecycle_1.0.1    
[43] mosaicCore_0.9.0    MASS_7.3-54         zoo_1.8-9          
[46] scales_1.1.1        hms_1.1.1           parallel_4.1.2     
[49] RColorBrewer_1.1-2  yaml_2.2.1          gridExtra_2.3      
[52] sass_0.4.0          labelled_2.9.0      rpart_4.1-15       
[55] reshape_0.8.8       latticeExtra_0.6-29 stringi_1.7.6      
[58] highr_0.9           checkmate_2.0.0     rlang_0.4.11       
[61] pkgconfig_2.0.3     evaluate_0.14       labeling_0.4.2     
[64] htmlwidgets_1.5.4   cmprsk_2.2-10       tidyselect_1.1.1   
[67] plyr_1.8.6          bookdown_0.24       R6_2.5.1           
[70] generics_0.1.1      DBI_1.1.1           pillar_1.6.4       
[73] foreign_0.8-81      withr_2.4.3         mgcv_1.8-38        
[76] nnet_7.3-16         etm_1.1.1           modelr_0.1.8       
[79] crayon_1.4.2        utf8_1.2.2          tzdb_0.2.0         
[82] rmarkdown_2.11.3    jpeg_0.1-9          grid_4.1.2         
[85] data.table_1.14.2   reprex_2.0.1        digest_0.6.29      
[88] numDeriv_2016.8-1.1 munsell_0.5.0       viridisLite_0.4.0  
[91] bslib_0.3.1