Exploring Sleep Time, Snoring, and Sleepiness Between the Sexes Among US Adults
1 Setup and Data Ingest
1.1 Package Loading
I will load in my necessary packages.
library(knitr)
library(rmdformats)
library(janitor)
library(magrittr)
library(naniar)
library(broom)
library(patchwork)
library(readxl)
library(Epi)
library(Hmisc)
library(GGally)
library(ggridges)
library(mosaic)
library(nhanesA)
library(haven)
library(tidyverse)
source("data/Love-boost.R.rmd")
$set(comment=NA)
opts_chunk$set(width=75)
opts_knit
theme_set(theme_bw())
options(dplyr.summarise.inform = FALSE)
1.2 Data Ingest
I will ingest the 2 nhanesA
data sets I will be using into a raw data set tibble, remove labels from the variables, and save each data set as a new file on my computer for subsequent sessions.
<- nhanes('SLQ_J') %>% zap_label() %>% tibble() sleep_raw
Processing SAS dataset SLQ_J ..
# saveRDS(sleep_raw, "data/SLQ_J.Rds")
# sleep_raw <- readRDS("data/SLQ_J.Rds")
<- nhanes('DEMO_J') %>% zap_label() %>% tibble() demo_raw
Processing SAS dataset DEMO_J ..
# saveRDS(demo_raw, "data/DEMO_J.Rds")
# demo_raw <- readRDS("data/DEMO_J.Rds")
2 Cleaning the Data
2.1 Variable Selection
I will select the variables of interest from each raw data set. I need SEQN
from each in order to combine the data sets next. I need SLD012
, SLQ030
, and SLQ120
from the sleep_raw
data and RIAGENDR
, RIDAGEYR
, and RIDSTATR
from the demo_raw
data set.
<- sleep_raw %>%
sleep_raw select(SEQN, SLD012, SLQ030, SLQ120)
<- demo_raw %>%
demo_raw select(SEQN, RIAGENDR, RIDAGEYR, RIDSTATR)
2.2 Merging the Data
I will merge the raw data sets by joining sleep_data
to demo_data
. I will join them by the unique subject identifier SEQN
to create 1 row per subject.
<- left_join(demo_raw, sleep_raw, by = "SEQN")
raw_data
dim(raw_data)
[1] 9254 7
I have 9254 rows (subjects) and 8 columns (variables) as expected.
I will check that the number of rows equals the number of unique subjects to ensure I avoided any errors in merging.
identical(raw_data %$% n_distinct(SEQN),
%>% nrow()) raw_data
[1] TRUE
I will remove data sets I will no longer be using to keep my work space clear.
rm(demo_raw, sleep_raw)
2.3 Names and Types
I will begin to clean my data by make each variable the type I want. My categorical variables (RIDSTATR
, RIAGENDR
, SLQ030
, SLQ120
) will be factors. SEQN
will be a character variable, as its numbers have no numerical quality. SLD012
, and RIDAGEYR
are my quantitative variables, so they will be of ‘dbl’ type. Lastly, I will remove any levels of a factor variable that have 0 subjects.
dim(raw_data)
[1] 9254 7
<- raw_data %>%
clean_data mutate(SEQN = as.character(SEQN),
SLD012 = as.double(SLD012),
RIDAGEYR = as.double(RIDAGEYR),
RIDSTATR = factor(RIDSTATR),
RIAGENDR = factor(RIAGENDR),
SLQ030 = factor(SLQ030),
SLQ120 = factor(SLQ120)) %>%
droplevels()
dim(clean_data)
[1] 9254 7
I did not lose any columns or rows as I did this.
glimpse(clean_data)
Rows: 9,254
Columns: 7
$ SEQN <chr> "93703", "93704", "93705", "93706", "93707", "93708", "93709"~
$ RIAGENDR <fct> 2, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1~
$ RIDAGEYR <dbl> 2, 2, 66, 18, 13, 66, 75, 0, 56, 18, 67, 54, 71, 61, 22, 45, ~
$ RIDSTATR <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
$ SLD012 <dbl> NA, NA, 8.0, 10.5, NA, 8.0, 7.0, NA, 7.0, 7.5, 5.5, 7.0, 5.0,~
$ SLQ030 <fct> NA, NA, 2, 1, NA, 9, 1, NA, 2, 1, 0, 3, 3, 3, 2, 0, NA, NA, 0~
$ SLQ120 <fct> NA, NA, 0, 1, NA, 2, 1, NA, 3, 2, 2, 3, 4, 3, 1, 2, NA, NA, 0~
My variable types are as expected and described above.
2.4 Variables
2.4.1 Total Sleep Time
I will start with a numerical summary.
df_stats(~ SLD012, data = clean_data)
My range of values is 2-14 hours of sleep per night on weekdays and weekends. These are plausible values. The mean and median being close to each and around 7.5-8 hours is reassuring. There are 3141 subjects with missing data. I will rename this to a more descriptive name and filter to complete cases.
<- clean_data %>%
clean_data rename(tst = SLD012) %>%
filter(complete.cases(tst))
dim(clean_data) # should be 9254 - 3141 = 6113
[1] 6113 7
I am left with 6113 rows (subjects) and 7 columns (variables) as expected.
2.4.2 Age
I will start with a numerical summary.
df_stats(~ RIDAGEYR, data = clean_data)
There are no missing data. The age range of my subjects is 16-80 years old. I wish to study adults age 20-79 years in my analyses, so I will create these inclusion criteria now, as well as rename this variable to a more descriptive name. My upper limit is 79 years because subjects demarcated as 80 years old are, in actuality, 80 years old or greater. This ceiling effect will alter the accuracy of my model’s predictions, so I will remove them.
<- clean_data %>%
clean_data rename(age = RIDAGEYR) %>%
mutate(age = replace(age, age < 20 | age > 79, NA))
df_stats(~ age, data = clean_data)
The age range is now 20-79 years. I have 1011 subjects with missing data. I will filter to complete cases.
<- clean_data %>%
clean_data filter(complete.cases(age))
dim(clean_data) # should be 6113 - 1011 = 5102
[1] 5102 7
I am left with 5102 rows (subjects) and 7 columns (variables) as expected.
2.4.3 Sex
I will start with a numerical summary.
%>% tabyl(RIAGENDR) %>% kable(digits = 3) clean_data
RIAGENDR | n | percent |
---|---|---|
1 | 2472 | 0.485 |
2 | 2630 | 0.515 |
My data set comprises 2472 males (48.5%) and 2630 females (51.5%). I will rename this variable to a more descriptive name, recode it so that I can see ‘Male’ and ‘Female’ instead of 1 and 2, and relevel it so it comes up appropriately in my tables later on.
# gender recode
<- clean_data %>%
clean_data rename(sex = RIAGENDR) %>%
mutate(sex = fct_recode(sex,
"Male" = "1",
"Female" = "2")) %>%
mutate(sex = fct_relevel(sex, "Male", "Female"))
%>% tabyl(sex) %>% kable(digits = 3) clean_data
sex | n | percent |
---|---|---|
Male | 2472 | 0.485 |
Female | 2630 | 0.515 |
dim(clean_data) # should be 5102
[1] 5102 7
I am left with 5102 rows (subjects) and 7 columns (variables) as expected.
2.4.4 Snore
I will start with a numerical summary.
%>% tabyl(SLQ030) %>% kable(digits = 3) clean_data
SLQ030 | n | percent |
---|---|---|
0 | 1211 | 0.237 |
1 | 1167 | 0.229 |
2 | 936 | 0.183 |
3 | 1426 | 0.279 |
7 | 5 | 0.001 |
9 | 357 | 0.070 |
My data set comprises 1211 subjects (23.7%) who never snore, 1167 subjects (22.9%) who rarely snore, 936 subjects (18.3%) who occasionally snore, 1426 subjects (27.9%) who frequently snore, 5 subjects who refused to answer, and 357 subjects who did not know. I do not have any missing data. I will rename this variable to a more descriptive name, tell R to consider “Refused” and “Don’t know” as missing values, recode it to a binary variable so those who never or rarely snore will be in ‘Does not snore’ and those who occasionally or frequently snore will be in ‘Snores’, and relevel it so it comes up properly in tables later.
<- clean_data %>%
clean_data rename(snore = SLQ030) %>%
mutate(snore = na_if(snore, 7),
snore = na_if(snore, 9)) %>%
mutate(snore = fct_recode(snore,
"Does not snore" = "0",
"Does not snore" = "1",
"Snores" = "2",
"Snores" = "3")) %>%
mutate(snore = fct_relevel(snore,
"Snores", "Does not snore"))
%>% tabyl(snore) %>% kable(digits = 3) clean_data
snore | n | percent | valid_percent |
---|---|---|---|
Snores | 2362 | 0.463 | 0.498 |
Does not snore | 2378 | 0.466 | 0.502 |
7 | 0 | 0.000 | 0.000 |
9 | 0 | 0.000 | 0.000 |
NA | 362 | 0.071 | NA |
This looks good. I have created a binary variable and moved all the ‘Refused’ or ‘Don’t know’ responses to missing values. I will drop those missing values and those rows that do not contain any subjects.
<- clean_data %>%
clean_data filter(complete.cases(snore)) %>%
droplevels()
%>% tabyl(snore) %>% kable(digits = 3) clean_data
snore | n | percent |
---|---|---|
Snores | 2362 | 0.498 |
Does not snore | 2378 | 0.502 |
dim(clean_data) # should be 5102 - 362 = 4740
[1] 4740 7
I am left with 4740 rows (subjects) and 7 columns (variables) as expected.
2.4.5 Sleepy
I will start with a numerical summary.
%>% tabyl(SLQ120) %>% kable(digits = 3) clean_data
SLQ120 | n | percent |
---|---|---|
0 | 794 | 0.168 |
1 | 1147 | 0.242 |
2 | 1599 | 0.337 |
3 | 806 | 0.170 |
4 | 390 | 0.082 |
9 | 4 | 0.001 |
My data set comprises 750 subjects (16.6%) who are never sleepy, 1084 subjects (24.1%) who are rarely sleepy, 1525 subjects (33.9%) who are sometimes sleepy, 769 subjects (17.1%) who are often sleepy, 373 (8.3%) who are always sleepy, and 4 subjects who did not know their frequency of sleepiness. I do not have any missing data. I will rename this variable to a more descriptive name, tell R to consider “Don’t know” as a missing value, and rename the levels more descriptively.
# recode to multi-categorical variable
<- clean_data %>%
clean_data rename(sleepy = SLQ120) %>%
mutate(sleepy = na_if(sleepy, 7),
sleepy = na_if(sleepy, 9)) %>%
mutate(sleepy = fct_recode(sleepy,
"never" = "0",
"rarely" = "1",
"sometimes" = "2",
"often" = "3",
"always" = "4"))
%>% tabyl(sleepy) %>% kable(digits = 3) clean_data
sleepy | n | percent | valid_percent |
---|---|---|---|
never | 794 | 0.168 | 0.168 |
rarely | 1147 | 0.242 | 0.242 |
sometimes | 1599 | 0.337 | 0.338 |
often | 806 | 0.170 | 0.170 |
always | 390 | 0.082 | 0.082 |
9 | 0 | 0.000 | 0.000 |
NA | 4 | 0.001 | NA |
I have moved all the ‘Don’t know’ responses to missing values and these level names look good. I will drop those missing values and the row that does not contain any subjects.
<- clean_data %>%
clean_data filter(complete.cases(sleepy)) %>%
droplevels()
dim(clean_data) # should be 4740 - 4 = 4736
[1] 4736 7
I am left with 4736 rows (subjects) and 7 columns (variables) as expected.
2.4.6 RIDSTATR: Indicator Variable
RIDSTATR
is a variable which designates whether a subject partook in both the questionnaire and the examination, for which they are given a value of 2. I will start by seeing if all the subjects in my data set have a value of 2.
%>% tabyl(RIDSTATR) %>% kable(digits = 3) clean_data
RIDSTATR | n | percent |
---|---|---|
1 | 235 | 0.05 |
2 | 4501 | 0.95 |
I have 235 with a value of 1 that I will need to remove from my data set. I will tell R to count these values as missing. Then I will filter to complete cases and drop the empty row from my data set.
<- clean_data %>%
clean_data mutate(RIDSTATR = na_if(RIDSTATR, 1)) %>%
filter(complete.cases(.)) %>%
droplevels()
%>% tabyl(RIDSTATR) %>% kable(digits = 3) clean_data
RIDSTATR | n | percent |
---|---|---|
2 | 4501 | 1 |
dim(clean_data) # should be 4736 - 235 = 4501
[1] 4501 7
I am left with 4501 rows (subjects) and 7 columns (variables) as expected.
2.5 Missingness Checks
I will check that I was successful in removing the missing data as I cleaned each variable.
miss_var_summary(clean_data)
I was.
2.6 Creating My Tibble
I will create a tibble called study1
from the clean data, selecting only those variables I will use in my analyses. I will rename the unique subject identifier to a more descriptive name. Lastly, I will save my tibble in my data folder.
<- clean_data %>%
study1 select(SEQN, tst, age, sex, snore, sleepy) %>%
rename(sub_id = SEQN)
rm(raw_data, clean_data)
saveRDS(study1, "data/study1.Rds")
3 Codebook and Data Description
These data are from the 2017-2018 National Health and Nutrition Examination Survey (NHANES) which tracks the health of the population of the United States by collecting questionnaire and examination data. NHANES is run by the Center for Disease Control and Prevention (CDC). Their target population is the non-institutionalized civilian resident population of the United States. They create their sample population through random selection using statistical processes on US Census information. For the 2017-2018 sampling, 16,211 people were selected. Of these, 9,254 completed the interview and 8,704 completed the exams. The following analyses include subjects between 20-79 years old who have complete data on the variables of interest.
3.1 Codebook
Variable | Type | Description | Raw Name |
---|---|---|---|
sub_id |
Chr | A unique subject identifier within each NHANES data set | SEQN |
tst |
Quant | Outcome This is in the Questionnaire category and describes estimated total sleep time on weekdays or workdays, in hours, ranging from 2 to 14 hours. | SLD012 |
age |
Quant | This is in the Demographics category and describes age in years. Inclusion criteria ages 20 to 79 years. | RIDAGEYR |
sex |
Binary | This is in the Demographics category and describes sex as male or female. These data comprise 48.2% males and 51.8% females. | RIAGENDR |
snore |
Binary | This is in the Questionnaire category and describes subjective snoring frequency in the NHANES data. In converting this to binary for these analyses, the meaning has become presence of snoring, thus the categories are ‘Snores’ and ‘Does not snore’. ‘Snores’ includes the questionnaire answers “Occasionally - 3-4 nights a week” or “Frequently - 5 or more nights a week”. ‘Does not snore’ includes the questionnaire answers “Never” or “Rarely - 1-2 nights a week”. In these data, 50.6% admit to snoring, and 49.4% report never or rarely snoring. | SLQ030 |
sleepy |
5-Cat | This is in the Questionnaire category and describes subjective frequency of feeling overly sleepy. The categories are ‘never’, ‘rarely’, ‘sometimes’, ‘often’, and ‘always’. ‘Never’ corresponds to a questionnaire response of ‘never’, ‘rarely’ to ‘rarely - 1 time a month, ’sometimes’ to ‘sometimes - 2-4 times a month’, ‘often’ to ‘often - 5-15 times a month’, and ‘always’ to ‘always - 16-30 times a month’. In these data, 16.5% are never sleepy, 24.2% are rarely sleepy, 34.2% are sometimes sleepy, 17.1% are often sleepy, and 8.1% are always sleepy. | SLQ120 |
3.2 Analytic Tibble
Here is my tibble.
study1
As anticipated, I have 4501 rows (subjects) and 6 variables in my final data set.
I will demonstrate that study1
is, in fact, a tibble.
is_tibble(study1)
[1] TRUE
3.3 Data Summary
I would like to see a summary of my data now that it has been cleaned up.
%>%
study1 select(-sub_id) %>%
::describe(.) Hmisc
.
5 Variables 4501 Observations
--------------------------------------------------------------------------------
tst
n missing distinct Info Mean Gmd .05 .10
4501 0 23 0.985 7.553 1.779 5.0 5.5
.25 .50 .75 .90 .95
6.5 7.5 8.5 9.5 10.0
lowest : 2.0 3.0 3.5 4.0 4.5, highest: 11.5 12.0 12.5 13.0 14.0
--------------------------------------------------------------------------------
age
n missing distinct Info Mean Gmd .05 .10
4501 0 60 1 48.7 18.81 23 26
.25 .50 .75 .90 .95
34 50 62 70 74
lowest : 20 21 22 23 24, highest: 75 76 77 78 79
--------------------------------------------------------------------------------
sex
n missing distinct
4501 0 2
Value Male Female
Frequency 2181 2320
Proportion 0.485 0.515
--------------------------------------------------------------------------------
snore
n missing distinct
4501 0 2
Value Snores Does not snore
Frequency 2265 2236
Proportion 0.503 0.497
--------------------------------------------------------------------------------
sleepy
n missing distinct
4501 0 5
lowest : never rarely sometimes often always
highest: never rarely sometimes often always
Value never rarely sometimes often always
Frequency 750 1084 1525 769 373
Proportion 0.167 0.241 0.339 0.171 0.083
--------------------------------------------------------------------------------
4 Analysis B
In this analysis, I will compare two means of independent samples. My samples will be 2017-2018 NHANES respondents, who were randomly selected and meet the aforementioned inclusion criteria, split into 2 groups: a snoring group and a non-snoring group.
4.1 The Question
How well can the presence of snoring predict sleep time on workdays/weekdays, in a randomly selected sample of the adult population in the US?
4.2 Describing the Data
I will compare the mean estimated total sleep time on workdays/weekdays in subjects who snore to subjects who do not snore. First, here is a graphical summary of tst
within each group.
ggplot(study1, aes(x = snore, y = tst)) +
geom_violin(aes(fill = snore)) +
geom_boxplot(width = 0.3, outlier.size = 2) +
stat_summary(fun = "mean", geom = "point",
shape = 23, size = 2, fill = "magenta") +
coord_flip() +
guides(fill = "none") +
labs(title = "Sleep Time Follows Symmetric Distribution, with Outliers",
subtitle = "stratified by snoring presence, with mean in magenta",
caption = "4501 subjects in `study1`",
x = "",
y = "Total Sleep Time (hrs)")
The distribution of tst
in both the ‘Snores’ group and the ‘Does not snore’ group are fairly symmetric, with means close to medians, the ‘Snores’ group more so. Both groups have just slightly more outliers on the higher scale than lower. The shape of the distributions shows that subjects had a preference to choose a whole number as their estimated hours of sleep, as opposed to by the half hour.
Here is a numerical summary.
::favstats(tst ~ snore, data = study1) %>% kable(digits = 2) mosaic
snore | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
Snores | 2 | 6.5 | 7.5 | 8.5 | 14 | 7.50 | 1.65 | 2265 | 0 |
Does not snore | 2 | 7.0 | 7.5 | 8.5 | 14 | 7.61 | 1.61 | 2236 | 0 |
I have no missing data because I took care of that in my data cleaning. The range of values in both groups is 2-14 hours. The mean sleep time in the ‘Snores’ group is 7.5 hours and in the ‘Does not snore’ group is 7.61 hours, so the mean difference is 0.11 hours. I will need to determine if I can be confident this difference is more than coincidence at a 10% significance level. It is not reasonable to assume normality here due to the outliers, although the distributions are symmetric and have the same range.
4.3 Main Analysis
I will use confidence level of 90% and a two-sided comparison. Because I can’t assume normality here, but my sample was randomly selected, I will use a bootstrap approach to determine a confidence interval around the mean difference.
set.seed(1228)
%$% bootdif(tst, snore, conf.level = 0.90) study1
Mean Difference 0.05 0.95
0.10977423 0.03010331 0.19284301
My model implies that the population mean sleep time in hours on workdays/weekdays in those who do not snore is 0.11 hours higher than the mean sleep time in those who snore based on the subjects in my sample. My 90% confidence interval of (0.03,0.19) provides evidence that the difference between the groups is statistically detectable.
4.4 Conclusions
Based on my sample of randomly selected US adults aged 20-79 years with complete data on tst
, age
, sex
, snore
, and sleepy
, there is a statistically detectable difference in the population mean number of sleep hours on workdays/weekdays between people who snore and people who do not snore. I have come to this conclusion by using a bootstrap to compare the mean sleep hours of people who snore with the mean sleep hours of people who do not snore. I chose a bootstrap due to the lack of normality of my data (due to outliers) and the fact they were randomly selected. A limitation of this analysis is that it is difficult for people to know if they snore or not. People are often inclined to underestimate the frequency at which they snore, or even deny it despite their bed partner reporting otherwise. I need to take this subjectivity into account when discussing my results. Estimated total sleep time is also subjective. Taking 2 subjective measures and comparing them may affect the replicability and accuracy of the results. Some appropriate next steps would be to use sleep study data to gain objective evidence of snoring, as the standard sleep study montage includes a snore monitor, whether microphone or vibratory sensor.
To answer my question, the presence of snoring is not fantastic at predicting total sleep time on workdays/weekdays. Although there is a statistically detectable difference, it is very minor.
5 Analysis C
In this analysis, I will compare means of 5 groups. My samples will be 2017-2018 NHANES respondents, who were randomly selected and meet the aforementioned inclusion criteria, split into 5 groups based on how frequency they feel overly sleepy.
5.1 The Question
How well can a subject’s estimated frequency of feeling overly sleepy predict total sleep time on workdays/weekdays, in a randomly selected sample of the adult population in the US?
5.2 Describing the Data
I will compare the mean estimated total sleep time on workdays/weekdays in subjects who are never, rarely, sometimes, often, or always sleepy. First, here is a graphical summary of tst
within each group. I will use a ridgeline density plot to account for differences in size of each group.
ggplot(study1, aes(x = tst, y = sleepy, fill = sleepy)) +
geom_density_ridges(scale = 0.9, alpha = 0.8) +
scale_fill_viridis_d() +
guides(fill = "none") +
theme_ridges() +
labs(title = "Fairly Symmetric Distribution of Total Sleep Time",
subtitle = "stratified by sleepiness frequency",
caption = "4501 subjects in `study1`",
x = "Total Sleep Time (hrs)",
y = "Sleepiness Frequency")
Picking joint bandwidth of 0.319
The spread of data is widest in the always sleepy group and most clustered toward the center the never sleepy group. The centers of the distributions look very similar, but the always sleepy group looks a bit left skewed. Without being able to see outliers, these data look normally distributed. I will look at box plots to check for outliers.
ggplot(study1, aes(x = sleepy, y = tst, fill = sleepy)) +
geom_boxplot() +
scale_fill_viridis_d() +
coord_flip() +
guides(fill = "none") +
labs(title = "Outlier Assessment of Total Sleep Time",
subtitle = "stratified by sleepiness frequency",
caption = "4501 subjects in `study1`",
y = "Comfort with 431 Materials (0-100)",
x = "Sleepiness Frequency")
There are outliers in each of my groups, mostly in the never sleepy group. This is evidence against normality in the groups.
I will also look at a numerical summary.
::favstats(tst ~ sleepy, data = study1) %>% kable(digits = 3) mosaic
sleepy | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
never | 2 | 7.0 | 8.0 | 8.5 | 14 | 7.677 | 1.624 | 750 | 0 |
rarely | 2 | 7.0 | 7.5 | 8.5 | 13 | 7.565 | 1.496 | 1084 | 0 |
sometimes | 2 | 6.5 | 7.5 | 8.5 | 14 | 7.600 | 1.547 | 1525 | 0 |
often | 2 | 6.5 | 7.5 | 8.5 | 14 | 7.467 | 1.755 | 769 | 0 |
always | 2 | 6.0 | 7.5 | 8.0 | 14 | 7.253 | 2.027 | 373 | 0 |
There are 750 never sleepy, 1084 rarely sleepy, 1525 sometimes sleepy, 769 often sleepy, and 373 always sleepy subjects in my sample. The range for all but the rarely sleepy group are 2-14 hours. The rarely sleepy group ranges from 2-13 hours. The mean is the lowest for the always sleepy group and the highest for the never sleepy group. The medians of all but the never sleepy group are 7.5 hours, but the never sleepy group is 8 hours.
5.3 Main Analysis
I will run a Kruskal-Wallis test to assess the mean differences between my groups because each group does not follow a normal distribution, and the sample sizes vary across the groups, so I cannot meet the assumptions of an ANOVA.
kruskal.test(data = study1, tst ~ sleepy)
Kruskal-Wallis rank sum test
data: tst by sleepy
Kruskal-Wallis chi-squared = 19.861, df = 4, p-value = 0.000532
There is a statistically detectable difference at a 10% significance level between the population mean sleep times for the 5 sleepiness groups. Regarding assumptions: The samples are independent and were randomly selected from the population of interest.
My next step is to determine between which groups the difference(s) lie. I will start by running a Tukey’s pairwise comparison.
TukeyHSD(aov(data = study1, tst ~ sleepy), conf.level = 0.90)
Tukey multiple comparisons of means
90% family-wise confidence level
Fit: aov(formula = tst ~ sleepy, data = study1)
$sleepy
diff lwr upr p adj
rarely-never -0.11183518 -0.3022337 0.078563353 0.5983814
sometimes-never -0.07766120 -0.2564475 0.101125143 0.8226388
often-never -0.21049328 -0.4162217 -0.004764864 0.0869610
always-never -0.42398213 -0.6779706 -0.169993640 0.0003927
sometimes-rarely 0.03417398 -0.1250823 0.193430296 0.9845203
often-rarely -0.09865810 -0.2876613 0.090345072 0.7009862
always-rarely -0.31214695 -0.5527881 -0.071505785 0.0124146
often-sometimes -0.13283208 -0.3101317 0.044467545 0.3487371
always-sometimes -0.34632092 -0.5778833 -0.114758565 0.0021997
always-often -0.21348885 -0.4664330 0.039455320 0.2304195
By looking at the confidence intervals, I can see that the statistically detectable differences lie between the often and never sleep groups, always and never sleepy groups, always and rarely sleep groups, and always and sometimes sleep groups. All of these differences are negative. This will be shown better in a visualization.
<- c(5,0,4,2) + 0.1
mar.default
par(mar = mar.default + c(0, 8, 0, 0))
plot(TukeyHSD(aov(tst ~ sleepy, data = study1),
conf.level = 0.90), las = 2)
This visualization shows the differences between groups mentioned above. It shows that the often sleepy group gets significantly less sleep than the never sleepy group, and the always sleep group gets significantly less sleep than the never sleepy, the rarely sleepy, and the sometimes sleep groups.
5.4 Conclusions
Based on my sample of randomly selected US adults aged 20-79 years with complete data on tst
, age
, sex
, snore
, and sleepy
, there is a statistically detectable difference in the population mean number of sleep hours on workdays/weekdays between people of these 5 different sleepiness categories: never, rarely, sometimes, often, always. I have come to this conclusion by using a Kruksal-Wallis test and visualizing the pairwise comparisons. I chose a Kruskal-Wallis test due to the lack of normality of my data (due to outliers) and the differences in sample sizes between groups. A limitation of this analysis is, similar to Analysis B, both the outcome and the predictor are subjective. People tend to overestimate the frequency of which they are overly sleepy and underestimate the hours the slept. An appropriate next step would be to use Multiple Sleep Latency Test (an objective test of sleepiness) results to determine sleepiness level and actigraphy (activity monitoring) to determine objective total sleep time, and use these in a similar model to this one.
To answer my question, a subject’s frequency of feeling overly sleepy is not a great predictor of their total sleep time on workdays/weekdays. If they are in the ‘always sleepy’ group, there predictive accuracy is best, but the difference remains minimal.
6 Analysis D
In this analysis, I will create and analyze a 2x2 table to evaluate the association of sex and presence of snoring. I will compare the presence of snoring (outcome) across both sexes (exposure).
6.1 The Question
Do more men snore than women, in a randomly selected sample of the adult population in the US?
6.2 Describing the Data
I will start by creating a 2x2 table.
%>% tabyl(sex, snore) %>%
study1 adorn_totals() %>%
adorn_percentages(denom = "row") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns(position = "front") %>%
kable()
sex | Snores | Does not snore |
---|---|---|
Male | 1216 (55.8%) | 965 (44.2%) |
Female | 1049 (45.2%) | 1271 (54.8%) |
Total | 2265 (50.3%) | 2236 (49.7%) |
It looks like the highest frequency cells are males who snore and females who don’t snore, hinting that perhaps the answer to my question is “yes”.
6.3 Main Analysis
I will use twobytwo
with a Bayesian augmentation to create a more descriptive table that provides me statistics on the relationships.
twobytwo(1216+2, 965+2, 1049+2, 1271+2,
"Male", "Female", "Snores", "Does not snore",
conf.level = 0.90)
2 by 2 table analysis:
------------------------------------------------------
Outcome : Snores
Comparing : Male vs. Female
Snores Does not snore P(Snores) 90% conf. interval
Male 1218 967 0.5574 0.5399 0.5748
Female 1051 1273 0.4522 0.4353 0.4693
90% conf. interval
Relative Risk: 1.2326 1.1738 1.2944
Sample Odds Ratio: 1.5256 1.3824 1.6837
Conditional MLE Odds Ratio: 1.5254 1.3799 1.6866
Probability difference: 0.1052 0.0808 0.1295
Exact P-value: 0.0000
Asymptotic P-value: 0.0000
------------------------------------------------------
The probability of snoring in males is 0.557 with a 90% confidence interval of (0.540,0.575). The probability of snoring in females is 0.452 with a 90% confidence interval of (0.435,0.469). These do not overlap, providing evidence that there is a difference in snoring between the sexes. The relative risk is 1.233 (1.174,1.294), which is entirely greater than 1, indicating the likelihood of snoring is higher for males than for females. The odds ratio is 1.526 (1.382,1.684), which is entirely greater than 1, indicating that the odds of snoring are higher in males than in females. The risk difference is 0.105 (0.081,0.130), which is entirely greater than 0, indicating that the probability of snoring is higher for males than for females. These difference are large enough to rule out chance with 90% confidence because the p-value is less than 0.10.
6.4 Conclusions
Based on my sample of randomly selected US adults aged 20-79 years with complete data on tst
, age
, sex
, snore
, and sleepy
, there is a statistically detectable difference in the presence of snoring between the sexes. I have come to this conclusion by analyzing a 2x2 table to determine a relative risk, odds ratio, and risk difference, all of which point to males being more likely to snore than females. A limitation of this analysis is, similar to Analysis B & C, snoring is subjective, and people are likely to underestimate their snoring or not know they snore. An appropriate next step would be to add age as predictor and see how this affects snoring between the sexes.
To answer my question: yes, more men snore than women.
7 Analysis E
In this analysis, I will create and analyze a 2x5 table to analyse the relationship between snoring and sleepiness.
7.1 The Question
Are people who snore overly sleepy more frequently,in a randomly selected sample of the adult population in the US?
7.2 Describing the Data
I will start off by putting the data into a table.
%>%
study1 tabyl(sleepy, snore) %>%
adorn_totals(c("row", "col")) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(rounding = "half up", digits = 0) %>%
adorn_ns() %>%
kable()
sleepy | Snores | Does not snore | Total |
---|---|---|---|
never | 43% (321) | 57% (429) | 100% (750) |
rarely | 45% (492) | 55% (592) | 100% (1084) |
sometimes | 52% (800) | 48% (725) | 100% (1525) |
often | 55% (424) | 45% (345) | 100% (769) |
always | 61% (228) | 39% (145) | 100% (373) |
Total | 50% (2265) | 50% (2236) | 100% (4501) |
This table shows that out of 750 subjects who are never sleepy, 321 snore; 1084 subjects who are rarely sleepy, 492 snore; 1525 subjects who are sometimes sleepy, 800 snore; 769 subjects who are often sleepy, 424 snore; and 373 subjects who are always sleepy, 228 snore. We have about the same number in the ‘Snores’ and ‘Does not snore’ group, and both groups have the most subjects in the sometimes sleepy group.
My data meets the Cochran conditions.
7.3 Main Analysis
I will create a data set that contains only the sleepy
and snore
variables and put them in a table. Then I will run a chi-square test on this data set.
<- study1 %$% table(sleepy, snore)
sleep_tab
chisq.test(sleep_tab)
Pearson's Chi-squared test
data: sleep_tab
X-squared = 54.866, df = 4, p-value = 3.466e-11
My null hypothesis is that the rows and columns are independent. My alternative hypothesis is that there is an association between the rows and columns. The chi-square test statistic is 52 on 4 degrees of freedom, yielding p < 0.0001, providing evidence that the rows and columns are different.
I would like to see how they are different, so I will plot them.
assocplot(sleep_tab, col = c("lightskyblue", "slateblue"),
main = "Association of Snoring and Sleepiness",
xlab = "Sleepiness frequency",
ylab = "Presence of Snoring")
It looks like within the never and rarely sleepy groups, there are more subjects who do not snore. Conversely, within the sometimes, often, and always sleepy groups, there are more subjects who do snore. I am unable to state causation here. The largest residuals are in the never sleepy and always sleepy groups, and the smallest residual is in the sometimes sleepy group. Within the ‘Snores’ group, the observed frequency is less than expected in the never and rarely sleepy groups, and greater than expected for the sometimes, often, and always sleepy groups. The opposite is true for the ‘Does not snore’ group. This is in line with what I suspected prior to the analyses.
7.4 Conclusions
Based on my sample of randomly selected US adults aged 20-79 years with complete data on tst
, age
, sex
, snore
, and sleepy
, there is a statistically detectable difference in the presence of snoring between groups of differing sleepiness frequency. I have come to this conclusion by analyzing a 2x5 table and running a chi-squared test, which provided a p-value that provides evidence that there is a statistically detectable difference in the groups. The plot shows that within the ‘Snores’ group, there are more subjects who are more frequently sleepy. Conversely, in the ‘Does not snore’ group, there are more subjects who are less frequently sleepy. A limitation of this analysis is, similar to Analysis B, C, & D, snoring and sleepiness are both subjective, and people are likely to underestimate their snoring or not know they snore, and overestimate their sleepiness.
To answer my question: yes, people who snore are overly sleepy more frequently than people who do not snore.
8 Session Information
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.5.1 stringr_1.4.0 purrr_0.3.4 readr_2.1.1
[5] tidyr_1.1.4 tibble_3.1.6 tidyverse_1.3.1 haven_2.4.3
[9] nhanesA_0.6.5.3 mosaic_1.8.3 mosaicData_0.20.2 ggformula_0.10.1
[13] ggstance_0.3.5 dplyr_1.0.7 Matrix_1.3-4 ggridges_0.5.3
[17] GGally_2.1.2 Hmisc_4.6-0 ggplot2_3.3.5 Formula_1.2-4
[21] survival_3.2-13 lattice_0.20-45 Epi_2.44 readxl_1.3.1
[25] patchwork_1.1.1 broom_0.7.10 naniar_0.6.1 magrittr_2.0.1
[29] janitor_2.1.0 rmdformats_1.0.3 knitr_1.36
loaded via a namespace (and not attached):
[1] colorspace_2.0-2 ellipsis_0.3.2 visdat_0.5.3
[4] leaflet_2.0.4.1 snakecase_0.11.0 htmlTable_2.3.0
[7] base64enc_0.1-3 ggdendro_0.1.22 fs_1.5.1
[10] rstudioapi_0.13 farver_2.1.0 ggrepel_0.9.1
[13] fansi_0.5.0 lubridate_1.8.0 xml2_1.3.3
[16] splines_4.1.2 polyclip_1.10-0 jsonlite_1.7.2
[19] cluster_2.1.2 dbplyr_2.1.1 png_0.1-7
[22] ggforce_0.3.3 compiler_4.1.2 httr_1.4.2
[25] backports_1.3.0 assertthat_0.2.1 fastmap_1.1.0
[28] cli_3.1.0 tweenr_1.0.2 htmltools_0.5.2
[31] tools_4.1.2 gtable_0.3.0 glue_1.5.1
[34] Rcpp_1.0.7 cellranger_1.1.0 jquerylib_0.1.4
[37] vctrs_0.3.8 nlme_3.1-153 crosstalk_1.2.0
[40] xfun_0.28.10 rvest_1.0.2 lifecycle_1.0.1
[43] mosaicCore_0.9.0 MASS_7.3-54 zoo_1.8-9
[46] scales_1.1.1 hms_1.1.1 parallel_4.1.2
[49] RColorBrewer_1.1-2 yaml_2.2.1 gridExtra_2.3
[52] sass_0.4.0 labelled_2.9.0 rpart_4.1-15
[55] reshape_0.8.8 latticeExtra_0.6-29 stringi_1.7.6
[58] highr_0.9 checkmate_2.0.0 rlang_0.4.11
[61] pkgconfig_2.0.3 evaluate_0.14 labeling_0.4.2
[64] htmlwidgets_1.5.4 cmprsk_2.2-10 tidyselect_1.1.1
[67] plyr_1.8.6 bookdown_0.24 R6_2.5.1
[70] generics_0.1.1 DBI_1.1.1 pillar_1.6.4
[73] foreign_0.8-81 withr_2.4.3 mgcv_1.8-38
[76] nnet_7.3-16 etm_1.1.1 modelr_0.1.8
[79] crayon_1.4.2 utf8_1.2.2 tzdb_0.2.0
[82] rmarkdown_2.11.3 jpeg_0.1-9 grid_4.1.2
[85] data.table_1.14.2 reprex_2.0.1 digest_0.6.29
[88] numDeriv_2016.8-1.1 munsell_0.5.0 viridisLite_0.4.0
[91] bslib_0.3.1