Impact of Mental Health and Unemployment on Violent Crime in Select US Counties
1 Preliminaries
1.1 My R Packages
I will load my packages here.
library(janitor)
library(magrittr)
library(naniar)
library(ggrepel)
library(equatiomatic)
library(patchwork)
library(car)
library(broom)
library(tidyverse)
source("Love-boost.R.rmd")
theme_set(theme_bw())
1.2 Data Ingest
I will read in the raw data here.
<-
data_url "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2021.csv"
<- read_csv(data_url, skip = 1, guess_max = 4000) chr_2021_raw
2 Data Development
2.1 Selecting My Data
The following code is creating a data set called chr_2021
which contains select parts of the chr_2021_raw
data set. I will use the filter
function to filter out all counties that are not ranked, and filter out counties in all states except my states of interest: New Jersey, New York, Ohio, Pennsylvania, and Vermont. I will select my variables of interest, then rename them to more descriptive names. My variables are fipscode
, state
, county
, v043_rawvalue
renamed to violent_crime
, v042_rawvalue
renamed to poor_mental_health_days
, v049_rawvalue
renamed to excess_drinking
, v023_rawvalue
renamed to unemployment
, and v063_rawvalue
renamed to med_household_income
. I then changed med_household_income
from dollars to thousands of dollars by dividing by 1000 and unemployment
and excess_drinking
from proportions to percentages by multiplying by 100.
<- chr_2021_raw %>%
chr_2021 filter(county_ranked == 1) %>%
filter(state %in% c("OH", "NY", "PA", "NJ", "VT")) %>%
select(fipscode, state, county,
v043_rawvalue, v042_rawvalue, v049_rawvalue, %>%
v023_rawvalue, v063_rawvalue) rename(violent_crime = v043_rawvalue,
poor_mental_health_days = v042_rawvalue,
excess_drinking = v049_rawvalue,
unemployment = v023_rawvalue,
med_household_income = v063_rawvalue) %>%
mutate(med_household_income = med_household_income / 1000,
unemployment = 100*unemployment,
excess_drinking = 100*excess_drinking)
2.2 Repairing the fipscode
and factoring the state
I will change the variable fipscode
from a numerical variable to a character variable because it contains no numerical meaning and I will add a “0” to any character with fewer than 5 digits. I will change the variable state
from a character variable to a factor.
<- chr_2021 %>%
chr_2021 mutate(fipscode = str_pad(fipscode, 5, pad = "0"),
state = factor(state))
2.2.1 Checking Initial Work
I will use the glimpse
function to check that I have the correct number of rows (which should equal my number of counties), that fipscode
is a character, and that state
is a factor.
glimpse(chr_2021)
Rows: 252
Columns: 8
$ fipscode <chr> "34001", "34003", "34005", "34007", "34009", "~
$ state <fct> NJ, NJ, NJ, NJ, NJ, NJ, NJ, NJ, NJ, NJ, NJ, NJ~
$ county <chr> "Atlantic County", "Bergen County", "Burlingto~
$ violent_crime <dbl> 373.18731, 84.25050, 162.75355, 464.90041, 235~
$ poor_mental_health_days <dbl> 4.715148, 3.997601, 4.450502, 4.595566, 4.5451~
$ excess_drinking <dbl> 17.18567, 18.14703, 19.78728, 17.48468, 21.408~
$ unemployment <dbl> 5.057650, 2.923421, 3.269887, 4.047373, 7.1538~
$ med_household_income <dbl> 62.678, 107.971, 88.443, 73.168, 66.565, 54.17~
The number of rows equals the expected number of counties (252), fipscode
is a character variable, and state
is a factor variable.
I will check to see that each state
has the number of counties I expect.
%>% tabyl(state) %>% adorn_pct_formatting() chr_2021
state n percent
NJ 21 8.3%
NY 62 24.6%
OH 88 34.9%
PA 67 26.6%
VT 14 5.6%
They do. New Jersey has 21 counties which make up 8.3% of my data, New York has 62 counties which make up 24.6% of my data, Ohio has 88 counties which make up 34.9% of my data, Pennsylvania has 67 counties which make up 26.6% of my data, and Vermont has 14 counties which make up 5.6% of my data.
2.3 Creating My Binary Categorical Variable
2.3.1 Split into two categories based on the median
I will use the mutate
function to create a temporary-named variable that is a binary factor variable, separating unemployment
at the median. I will call all values below the median “low” and above the median “high”. Then I will run a numerical summary using favstats
to check that I’ve created 2 equally sized categories that contain values below and above the median.
<- chr_2021 %>%
chr_2021 mutate(temp1_ms = case_when(
< median(unemployment) ~ "low",
unemployment TRUE ~ "high"),
temp1_ms = factor(temp1_ms))
::favstats(unemployment ~ temp1_ms, data = chr_2021) %>%
mosaickable(digits = 3)
temp1_ms | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
high | 4.281 | 4.652 | 4.957 | 5.457 | 8.269 | 5.138 | 0.692 | 126 | 0 |
low | 1.792 | 3.264 | 3.686 | 3.943 | 4.278 | 3.570 | 0.514 | 126 | 0 |
There are equal numbers of counties in each category and the range of values do not overlap. Therefore, I know I have split this variable at the median. I have created 2 mutually exclusive, collectively exhaustive categories. I will keep this variable split at the median, as opposed to a different specific value, because the median is a meaningful value here. In order to compare county to county in my states of interest, I want to know if that county is in the upper or lower half of counties for unemployment rate.
2.3.2 Cleaning up
I will rename the above temporary-named variable to unemp_cat
as this is a more meaningful name for my categorized unemployment variable.
<- chr_2021 %>%
chr_2021 rename(unemp_cat = temp1_ms)
I will check that I now have 9 variables and they all have the names I expect.
names(chr_2021)
[1] "fipscode" "state"
[3] "county" "violent_crime"
[5] "poor_mental_health_days" "excess_drinking"
[7] "unemployment" "med_household_income"
[9] "unemp_cat"
They do.
I will check that I still have the correct number of rows in my data which correlates with number of counties (252).
nrow(chr_2021)
[1] 252
I do.
2.4 Creating My Multi-Category Variable
2.4.1 Creating a 5-category variable based on quantile
I will use the mutate
function to create 5 categories of med_household_income
that are split into equal, ordered categories. I will then use favstats
to check my work.
<- chr_2021 %>%
chr_2021 mutate(temp3 = factor(Hmisc::cut2(med_household_income, g = 5)))
::favstats(med_household_income ~ temp3, data = chr_2021) %>%
mosaickable(digits = 3)
temp3 | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
[41.3, 51.0) | 41.267 | 47.046 | 48.826 | 50.154 | 51.023 | 48.239 | 2.543 | 51 | 0 |
[51.0, 55.8) | 51.038 | 51.908 | 52.781 | 54.224 | 55.664 | 53.090 | 1.372 | 50 | 0 |
[55.8, 61.0) | 55.778 | 56.922 | 58.441 | 60.192 | 60.918 | 58.555 | 1.615 | 51 | 0 |
[61.0, 68.6) | 61.036 | 62.577 | 64.319 | 65.662 | 68.364 | 64.280 | 2.062 | 50 | 0 |
[68.6,117.8] | 68.611 | 73.625 | 83.593 | 98.270 | 117.767 | 86.671 | 14.764 | 50 | 0 |
I have split med_household_income
into 5 mutually exclusive, collectively exhaustive quantiles. I chose 5 quantiles because, per the numerical summary, I suspect my data will have a right skew, so I want to use 5 categories as opposed to 3 or 4 to create a more stratified comparison. I prefer to split into quantiles as opposed to specified cutoff points because income varies state to state, especially between my states of interest, and there are no standardized cutoff points that hold more meaning than the quantile cutoffs.
I will rename these quantiles more meaningfully. In order from lowest to highest, they will be quant_1
, quant_2
, quant_3
, quant_4
, and quant_5
. I will then check my work by running a numerical summary using favstats
.
<- chr_2021 %>%
chr_2021 mutate(med_income_cat = fct_recode(temp3,
quant_1 = "[41.3, 51.0)",
quant_2 = "[51.0, 55.8)",
quant_3 = "[55.8, 61.0)",
quant_4 = "[61.0, 68.6)",
quant_5 = "[68.6,117.8]"))
::favstats(med_household_income ~ med_income_cat, data = chr_2021) %>%
mosaickable(digits = 3)
med_income_cat | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
quant_1 | 41.267 | 47.046 | 48.826 | 50.154 | 51.023 | 48.239 | 2.543 | 51 | 0 |
quant_2 | 51.038 | 51.908 | 52.781 | 54.224 | 55.664 | 53.090 | 1.372 | 50 | 0 |
quant_3 | 55.778 | 56.922 | 58.441 | 60.192 | 60.918 | 58.555 | 1.615 | 51 | 0 |
quant_4 | 61.036 | 62.577 | 64.319 | 65.662 | 68.364 | 64.280 | 2.062 | 50 | 0 |
quant_5 | 68.611 | 73.625 | 83.593 | 98.270 | 117.767 | 86.671 | 14.764 | 50 | 0 |
This worked as expected.
2.4.2 Cleaning up
I will remove the temporary name that I am no longer using.
<- chr_2021 %>%
chr_2021 select(-c(temp3))
2.5 Structure of My Tibble
This is a tibble which shows the dimensions of my data, 252 rows (counties) by 10 columns (variables). This the number of rows (counties) I expect. I have my three required variables, fipscode
(a character variable), county
(a character variable), and state
(a factor variable). My five selected variables have the correct names and are all numerical. My two categorical variables are each factor variables as expected, and have the correct number of categories each: 2 categories for my binary variable and 5 for my multi-categorical variable.
str(chr_2021)
tibble [252 x 10] (S3: tbl_df/tbl/data.frame)
$ fipscode : chr [1:252] "34001" "34003" "34005" "34007" ...
$ state : Factor w/ 5 levels "NJ","NY","OH",..: 1 1 1 1 1 1 1 1 1 1 ...
$ county : chr [1:252] "Atlantic County" "Bergen County" "Burlington County" "Camden County" ...
$ violent_crime : num [1:252] 373.2 84.3 162.8 464.9 235.9 ...
$ poor_mental_health_days: num [1:252] 4.72 4 4.45 4.6 4.55 ...
$ excess_drinking : num [1:252] 17.2 18.1 19.8 17.5 21.4 ...
$ unemployment : num [1:252] 5.06 2.92 3.27 4.05 7.15 ...
$ med_household_income : num [1:252] 62.7 108 88.4 73.2 66.6 ...
$ unemp_cat : Factor w/ 2 levels "high","low": 1 2 2 2 1 1 1 2 2 2 ...
$ med_income_cat : Factor w/ 5 levels "quant_1","quant_2",..: 4 5 5 5 4 2 4 5 5 5 ...
3 Codebook
Variable | Description |
---|---|
fipscode | FIPS code |
state | State: my five states are NY, NJ, OH, PA, VT |
county | County Name |
violent_crime | (v043) Violent Crime Rate, which will be my outcome |
poor_mental_health_days | (v042) Number of Poor Mental Health Days (age-adjusted) |
excess_drinking | (v049) Excessive Drinking Percentage |
unemployment | (v023) Unemployment Rate |
med_household_income | (v063) Median Household Income in Thousands of Dollars |
unemp_cat | 2 levels based on the median: low = unemployment below 4.3%, or high |
med_income_cat | 5 levels based on quantiles: quant_1 = first quantile (41.27-51.02 thousand dollars), quant_2 = second quantile (51.04-55.66 thousand dollars), quant_3 = third quantile (55.78-60.92 thousand dollars), quant_4 = fourth quantile (61.04-68.36 thousand dollars), or quant_5 = fifth quantile (68.61-117.77 thousand dollars) of median_household_income |
3.1 Detailed Descriptions
Variable | Detailed Description |
---|---|
violent_crime |
This was originally v043_rawvalue . This will be my outcome variable. It is listed in the Community Safety subcategory under the Social & Economic Factors category of the County Health Rankings. It describes the number of reported violent crime offenses per 100,000 of the county’s population. Violent crimes are defined as offenses that involve face-to-face confrontation between a victim and a perpetrator, which are counted in the precinct in which they occur. This measure only includes crimes the police precincts report to the FBI. Caution should be used when comparing this variable between states due to bias across jurisdictions. There is the potential for under-reporting due to victims’ willingness to report, law enforcement response differences, and obstacles to FBI reporting. It is based on data from the County-Level Detailed Arrest and Offense Data report from the Uniform Crime Reporting Program from 2014 and 2016. |
poor_mental_health_days |
This was originally v042_rawvalue . It is listed in the Quality of Life subcategory under the Health Outcomes category in County Health Rankings. It is an age-adjusted rate that describes the average number of mentally unhealthy days reported in the past 30 days. It is based on data of the non-institutionalized adult population (greater than 18 years of age) from the Behavioral Risk Factor Surveillance System from 2018. These survey-collected data are then re-created using statistical modeling, requiring certain assumptions must be made. |
excess_drinking |
This was originally v049_rawvalue . It is listed in the Alcohol Use section of the Health Behaviors subcategory under the Health Factors category in County Health Rankings. It describes the percentage of adults who report binge drinking or heavy drinking in the past 30 days. It is based on data from the Behavioral Risk Factor Surveillance System from 2018, which contains survey-based data from the non-institutionalized adult population (greater than 18 years of age). |
unemployment |
This was originally v023_rawvalue . It is listed in the Employment section of the Social & Economic Factors subcategory under the Health Factors category in County Health Rankings. It describes the percentage of the county’s civilian labor force who are age 16 or older and unemployed but seeking work. It does not include the population that is unemployed but not seeking work, and it does not differentiate between those who cannot find any work and those who cannot find work at their preferred wage. It is based on data from the Local Area Unemployment Statistics program of the Bureau of Labor Statistics from 2019. These data are created using statistical modeling, requiring certain assumptions must be made. |
med_household_income |
This was originally v063_rawvalue . It is listed in the Income section of the Social & Economic Factors subcategory under the Health Factors category in County Health Rankings. It describes the income for which half the county’s households earn more and half earn less. It includes all sources of usual income, and excludes unexpected or one-time monies, such as lump-sum receipts. It is based on data from the Small Area Income and Poverty Estimates program of the US Census Bureau from 2019. These data are created using statistical modeling, requiring certain assumptions must be made. These model-based data have been shown to provide more consistent and reliable estimates than survey estimates. |
3.2 Proposal Requirement 1
I selected New Jersey which has 21 counties, New York which has 62 counties, Ohio which has 88 counties, Pennsylvania which has 67 counties, and Vermont which has 14 counties. My total number of counties is 252 counties. There are two reasons I chose these states: personal and practical. The personal reasons are that I just moved to Ohio from New York and, in doing so, have become intimately familiar with driving back and forth through Pennsylvania and New Jersey, and my best friend lives in Vermont. The practical reason is that these states comprise between 200-400 counties, which meets the project requirement, and have very little missing data for the variables I am looking to study.
%>% tabyl(state) %>%
chr_2021 adorn_pct_formatting() %>%
%>%
adorn_totals kable()
state | n | percent |
---|---|---|
NJ | 21 | 8.3% |
NY | 62 | 24.6% |
OH | 88 | 34.9% |
PA | 67 | 26.6% |
VT | 14 | 5.6% |
Total | 252 | - |
3.3 Proposal Requirement 2
I selected 5 variables: Violent Crime Rate (violent_crime
, previously v043_rawvalue
), Number of Poor Mental Health Days (poor_mental_health_days
, previously v042_rawvalue
), Proportion of Excessive Drinking (excess_drinking
, previously v049
), Unemployment Rate (unemployment
, previously v023
), and Median Household Income in dollars (med_household_income
, previously v063
). I changed my unit of my Median Household Income variable to thousands of dollars by dividing by 1000. I changed the unit of my Unemployment and Excessive Drinking variables from proportions to percentages by multiplying by 100.
My outcome variable is Violent Crime Rate. My two quantitative predictor variables are Poor Mental Health Days and Excessive Drinking. My binary categorical variable is Unemployment which is split at the median value. I chose Unemployment as my binary variable because at the individual level it is binary (employed vs. unemployed), so it makes logical sense to make it binary at the county level. I split it at the median because this makes the most sense when comparing my data as I can ask the question: Is this county’s unemployment rate more or less than the median unemployment rate for the counties I’m studying? My multi-categorical variable is Median Household Income which I split into 5 categories by quantile. I chose Median Household Income as my multi-categorical variable because small differences are not that meaningful for this variable, the data has a wide range, and we are accustomed to seeing income in categories in the real world, e.g. tax brackets.
My motivation for choosing violent_crime
as my outcome variable is that I was recently the victim of a violent crime that occurred only a few months after moving to Ohio. I had never been a victim when I lived in New York. I am curious about whether there is a difference in violent crime rates between counties in these two states and some of the states surrounding them, and if so, I would like to explore some predictor variables that could explain this.
My motivation for choosing my predictor variables is as follows. I chose poor_mental_health_days
because I suspect a quality of life measure could predict violent crime, and mental health likely contributes more than physical health given the impulsive nature of many violent crimes. I chose excess_drinking
because I suspect a lifestyle/behavior measure could predict violent crime, and this measure affects impulsiveness and decision-making. I chose unemployment
and med_household_income
because I suspect social & economic factors could predict violent crime; particularly those with a wide range, stratified across differing socioeconomic status, which is why I chose to categorize these variables. Unemployment may cause people to have more available time and use dangerous ways to obtain money to live. Median household income may similarly cause people to feel the need to obtain money in dangerous ways if they are in one of the lower quantiles, and it may be an indirect measure of lower socioeconomic status which will be useful to ascertain for each county.
3.4 Proposal Requirement 3
This is a tibble of my data chr_2021
.
chr_2021
# A tibble: 252 x 10
fipscode state county violent_crime poor_mental_health~ excess_drinking
<chr> <fct> <chr> <dbl> <dbl> <dbl>
1 34001 NJ Atlantic Co~ 373. 4.72 17.2
2 34003 NJ Bergen Coun~ 84.3 4.00 18.1
3 34005 NJ Burlington ~ 163. 4.45 19.8
4 34007 NJ Camden Coun~ 465. 4.60 17.5
5 34009 NJ Cape May Co~ 236. 4.55 21.4
6 34011 NJ Cumberland ~ 516. 4.93 18.5
7 34013 NJ Essex County 606. 4.37 17.3
8 34015 NJ Gloucester ~ 116. 4.51 18.9
9 34017 NJ Hudson Coun~ 342. 4.28 18.1
10 34019 NJ Hunterdon C~ 42.3 4.02 20.7
# ... with 242 more rows, and 4 more variables: unemployment <dbl>,
# med_household_income <dbl>, unemp_cat <fct>, med_income_cat <fct>
3.5 Proposal Requirement 4
Here are numerical summaries for each of my variables.
::describe(chr_2021) Hmisc
chr_2021
10 Variables 252 Observations
--------------------------------------------------------------------------------
fipscode
n missing distinct
252 0 252
lowest : 34001 34003 34005 34007 34009, highest: 50019 50021 50023 50025 50027
--------------------------------------------------------------------------------
state
n missing distinct
252 0 5
lowest : NJ NY OH PA VT, highest: NJ NY OH PA VT
Value NJ NY OH PA VT
Frequency 21 62 88 67 14
Proportion 0.083 0.246 0.349 0.266 0.056
--------------------------------------------------------------------------------
county
n missing distinct
252 0 200
lowest : Adams County Addison County Albany County Allegany County Allegheny County
highest: Wood County Wyandot County Wyoming County Yates County York County
--------------------------------------------------------------------------------
violent_crime
n missing distinct Info Mean Gmd .05 .10
248 4 248 1 190.1 133.1 57.43 72.15
.25 .50 .75 .90 .95
106.52 149.06 224.88 369.73 452.51
lowest : 0.00000 16.33720 19.05950 20.56649 36.14752
highest: 597.89467 606.18998 645.99459 824.39143 1001.47354
--------------------------------------------------------------------------------
poor_mental_health_days
n missing distinct Info Mean Gmd .05 .10
252 0 252 1 4.798 0.5013 4.009 4.216
.25 .50 .75 .90 .95
4.523 4.812 5.122 5.319 5.473
lowest : 3.555941 3.670392 3.690357 3.696267 3.744690
highest: 5.664264 5.719813 5.808883 5.814942 5.817593
--------------------------------------------------------------------------------
excess_drinking
n missing distinct Info Mean Gmd .05 .10
252 0 252 1 19.91 2.284 16.72 17.17
.25 .50 .75 .90 .95
18.31 20.24 21.41 22.14 22.81
lowest : 15.32379 15.51032 15.65669 15.85072 15.87386
highest: 24.09274 24.13436 24.33990 24.58178 24.90131
--------------------------------------------------------------------------------
unemployment
n missing distinct Info Mean Gmd .05 .10
252 0 252 1 4.354 1.111 2.932 3.175
.25 .50 .75 .90 .95
3.687 4.280 4.955 5.584 5.964
lowest : 1.791881 2.096567 2.252619 2.309637 2.339545
highest: 6.839601 6.852522 6.944904 7.153895 8.269411
--------------------------------------------------------------------------------
med_household_income
n missing distinct Info Mean Gmd .05 .10
252 0 251 1 62.1 15.1 47.05 48.84
.25 .50 .75 .90 .95
51.90 58.43 65.59 82.86 96.40
lowest : 41.267 41.470 43.145 43.621 43.704
highest: 110.252 112.722 116.328 117.275 117.767
--------------------------------------------------------------------------------
unemp_cat
n missing distinct
252 0 2
Value high low
Frequency 126 126
Proportion 0.5 0.5
--------------------------------------------------------------------------------
med_income_cat
n missing distinct
252 0 5
lowest : quant_1 quant_2 quant_3 quant_4 quant_5
highest: quant_1 quant_2 quant_3 quant_4 quant_5
Value quant_1 quant_2 quant_3 quant_4 quant_5
Frequency 51 50 51 50 50
Proportion 0.202 0.198 0.202 0.198 0.198
--------------------------------------------------------------------------------
3.6 Three Important Checks
3.6.1 Each of my variables must have data for 75+% of counties in each state I plan to study
I will create a tibble which summarizes any missing data.
%>%
chr_2021 miss_var_summary()
# A tibble: 10 x 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 violent_crime 4 1.59
2 fipscode 0 0
3 state 0 0
4 county 0 0
5 poor_mental_health_days 0 0
6 excess_drinking 0 0
7 unemployment 0 0
8 med_household_income 0 0
9 unemp_cat 0 0
10 med_income_cat 0 0
I see that violent_crime
is missing from 4 counties (1.6% of counties). I will run favstats
to determine the distribution of my missing data between states. I will use mutate
to view the percentages correctly.
::favstats(violent_crime ~ state, data = chr_2021) %>%
mosaicselect(state, n, missing) %>%
mutate(pct_available = 100*(n - missing)/n) %>%
kable()
state | n | missing | pct_available |
---|---|---|---|
NJ | 21 | 0 | 100.0000 |
NY | 62 | 0 | 100.0000 |
OH | 84 | 4 | 95.2381 |
PA | 67 | 0 | 100.0000 |
VT | 14 | 0 | 100.0000 |
All 4 of the counties that are missing violent_crime
are in the state of Ohio. I have this data available for 95.2% counties in the state of Ohio and 100% of counties in my other 4 states (NJ, NY, PA, VT).
3.6.2 Each of my 5 selected variables must have 10+ distinct non-missing values
I will create a tibble which shows the number of distinct non-missing values for each variable.
%>%
chr_2021 summarize(across(violent_crime:med_household_income, ~ n_distinct(.)))
# A tibble: 1 x 5
violent_crime poor_mental_healt~ excess_drinking unemployment med_household_i~
<int> <int> <int> <int> <int>
1 249 252 252 252 251
I have at least 10 distinct values per variable.
3.6.3 Each of my categorical variables must include 10+ distinct counties in every category
I will create a table to show the number of distinct counties in category of my two categorical variables.
%>% tabyl(unemp_cat) chr_2021
unemp_cat n percent
high 126 0.5
low 126 0.5
%>% tabyl(med_income_cat) chr_2021
med_income_cat n percent
quant_1 51 0.2023810
quant_2 50 0.1984127
quant_3 51 0.2023810
quant_4 50 0.1984127
quant_5 50 0.1984127
I have at least 10 counties in each category for each of the categorical variables that I created.
3.7 Saving the Tibble
Here I will save my tibble as a file.
saveRDS(chr_2021, file = "chr_2021_Catherine_Heinzinger.Rds")
3.8 Proposal Requirement 5
The most challenging part of this work so far has been determining exactly what each line of code means. Throughout the data ingestion and tidying, I have had to keep in mind which variable was my outcome measure and which were my predictors. I had to think creatively about variables that may differ across these states and counties and could potentially predict violent crime. In order to overcome this difficulty, I used Dr. Love’s example, but I rewrote each chunk of code myself and I wrote one line at a time to see how each line changed the output. I made sure to say each line of code as an English sentence so I could start to really understand what it would do to the data and output. This took a lot of extra reading and internet searches, but made this proposal an incredible learning experience.
4 Analysis 1: Linear Regression Model
In this section, I will build a linear regression model to predict a quantitative outcome, violent crime rate, using a quantitative predictor, number of poor mental health days.
4.1 The Variables
My outcome variable is violent_crime
and my predictor variable is poor_mental_health_days
. violent_crime
is a rate that describes the number of reported violent crime offenses per 100,000 of the county’s population. In this data set, a violent crime is an offense that involves face-to-face confrontation between a victim and a perpetrator and includes only those crimes reported to the FBI. My outcome variable is missing in 4 counties, all of which are in Ohio. poor_mental_health_days
is an age-adjusted rate that describes the average number of mentally unhealthy days reported in the past 30 days among the non-institutionalized adult population. My predictor variable is not missing for any counties in my states of interest which are New Jersey, New York, Ohio, Pennsylvania, and Vermont.
I will restrict my data set to the complete cases of these two variables and create a tibble called chr_a1
. This data set will contain values for violent_crime
, poor_mental_health_days
, county
, and state
. violent_crime
will be renamed to outcome_crime
and poor_mental_health_days
will be renamed to predictor_mhealth
. I will add in a new variable called id
to give each row a unique variable, to help with my analyses later on.
<- chr_2021 %>%
chr_a1 filter(complete.cases(violent_crime, poor_mental_health_days)) %>%
select(violent_crime, poor_mental_health_days, county, state) %>%
rename(outcome_crime = violent_crime,
predictor_mhealth = poor_mental_health_days) %>%
mutate(id = row_number()) %>%
clean_names()
chr_a1
# A tibble: 248 x 5
outcome_crime predictor_mhealth county state id
<dbl> <dbl> <chr> <fct> <int>
1 373. 4.72 Atlantic County NJ 1
2 84.3 4.00 Bergen County NJ 2
3 163. 4.45 Burlington County NJ 3
4 465. 4.60 Camden County NJ 4
5 236. 4.55 Cape May County NJ 5
6 516. 4.93 Cumberland County NJ 6
7 606. 4.37 Essex County NJ 7
8 116. 4.51 Gloucester County NJ 8
9 342. 4.28 Hudson County NJ 9
10 42.3 4.02 Hunterdon County NJ 10
# ... with 238 more rows
The above tibble has 248 rows (counties) which I expected because 248 counties have complete data on both outcome_crime
and predictor_mhealth
.
One of the counties in my data set has a ‘0’ value for violent_crime
. In order to be able to properly test transformations of my data, particularly to take the logarithm of my outcome, I need to omit this county from my data set. To do this, I will filter my data to include only counties with an outcome value greater than ‘0’.
<- chr_a1 %>%
chr_a1 filter(outcome_crime > 0) %>%
mutate(id = row_number())
nrow(chr_a1)
[1] 247
There are 247 rows (counties) remaining, which is correct.
I will specify the values of my outcome, outcome_crime
, and predictor, predictor_mhealth
for Cuyahoga County in Ohio, where CWRU’s campus is located.
%>%
chr_a1 filter(county == "Cuyahoga County") %>%
summarise(outcome_crime, predictor_mhealth, county, state) %>%
kable(digits = 2)
outcome_crime | predictor_mhealth | county | state |
---|---|---|---|
645.99 | 4.97 | Cuyahoga County | OH |
In Ohio’s Cuyahoga County, the rate of violent crime is 645.99 per 100,000 people and the rate of poor mental health days is 4.97 days in the past 30 days.
4.2 Research Question
Do populations with a higher rate of poor mental health days also have a higher rate of violent crime, in 247 counties in the states of New Jersey, New York, Ohio, Pennsylvania, and Vermont?
4.3 Visualizing the Data
I will visualize my data in a scatter plot using the ggplot
function, with the addition of a linear model and a loess smooth curve to assess how well predictor_mhealth
predicts outcome_crime
. I will identify the two furthest outliers by creating two subsets of the data: one with a violent crime rate >1000 (s1
) and one with a violent crime rate >750 and <1000 (s2
). In order to more easily summarize the county and state of these outliers, I will create a third subset which includes both outliers by using a standard of violent crime rate >750. Finally, I will create a data summary to determine which states my two outliers are in.
<- filter(chr_a1, outcome_crime > 1000)
s1 <- filter(chr_a1, outcome_crime > 750 & outcome_crime < 1000)
s2 <- filter(chr_a1, outcome_crime > 750)
s3
ggplot(chr_a1, aes(x = predictor_mhealth, y = outcome_crime)) +
geom_point(size = 2, alpha = 0.6) +
geom_smooth(method = "lm", formula = y ~ x,
se = TRUE, size = 1, col = "blue") +
geom_smooth(method = "loess", formula = y ~ x,
se = FALSE, size = 1, col = "red") +
geom_point(data = s1, size = 2, col = "forestgreen") +
geom_text(data = s1, label = s1$county,
vjust = 1.7, col = "forestgreen") +
geom_point(data = s2, size = 2, col = "tomato") +
geom_text(data = s2, label = s2$county,
vjust = 1.7, col = "tomato") +
labs(title = "Poor Mental Health as a Predictor of Violent Crime",
subtitle = "with OLS in blue, loess smooth in red, and outliers labeled",
caption = "247 counties in NJ, NY, OH, PA, VT from `chr_a1`",
x = "# Poor Mental Health Days in Past 30 Days",
y = "Violent Crime Rate (per 100,000 people)")
%>% summarise(id, county, state) %>%
s3 kable()
id | county | state |
---|---|---|
128 | Lucas County | OH |
217 | Philadelphia County | PA |
The number of poor mental health days in the past 30 days does not predict the violent crime rate well in these 247 counties. The relationship is very weak and slightly negative. The fact that it is nearly a horizontal line indicates that it is essentially depicting the mean of the y axis values, which are quite randomly scattered. Although this model does not work well for any counties, I highlighted the two counties for which this model is least able to predict violent crime, by identifying the two points with the furthest vertical distance from the linear model. These are Lucas County in Ohio and Philadelphia County in Pennsylvania.
4.4 Transformation Assessment
I applied a logarithm, an inverse, and a square to my outcome and then to my predictor. None of these appeared to provide a better outcome prediction visually. I then used the boxCox
function to help me determine if a transformation is recommended, and if so, which. I used the powerTransform
function to tell me a more accurate value on the Profile Log-likelihood graph.
boxCox(chr_a1$outcome_crime ~ chr_a1$predictor_mhealth)
powerTransform(chr_a1$outcome_crime ~ chr_a1$predictor_mhealth)
Estimated transformation parameter
Y1
0.04528785
This indicates that I should consider applying a logarithm to my outcome. The following code will create a scatter plot of my model with my outcome transformed with a logarithm, created using ggplot
. I created new subsets to denote outliers: s4
which contains values of the logarithm of my outcomes >2.9 and <3.2, and s5
which contains values of the logarithm of my outcomes <2.9. I will add a linear model and a loess smooth curve.
<- filter(chr_a1, log(outcome_crime) > 2.9, log(outcome_crime) < 3.2)
s4 <- filter(chr_a1, log(outcome_crime) < 2.9)
s5
ggplot(chr_a1, aes(x = predictor_mhealth, y = log(outcome_crime))) +
geom_point(size = 2, alpha = 0.6) +
geom_smooth(method = "lm", formula = y ~ x,
se = TRUE, size = 1, col = "blue") +
geom_smooth(method = "loess", formula = y ~ x,
se = FALSE, size = 1, col = "red") +
geom_point(data = s1, size = 2, col = "slateblue") +
geom_text(data = s1, label = s1$county,
vjust = 1.7, hjust = 0.4, col = "slateblue") +
geom_point(data = s2, size = 2, col = "slateblue") +
geom_text(data = s2, label = s2$county,
vjust = -0.6, col = "slateblue") +
geom_point(data = s4, size = 2, col = "slateblue") +
geom_text(data = s4, label = s4$county,
vjust = -0.6, col = "slateblue") +
geom_point(data = s5, size = 2, col = "slateblue") +
geom_text(data = s5, label = s5$county,
vjust = 0, hjust = -0.1, col = "slateblue") +
labs(title = "Poor Mental Health as a Predictor of Log(Violent Crime)",
subtitle = "with OLS in blue, loess smooth in red, and outliers labeled",
caption = "247 counties in NJ, NY, OH, PA, VT from `chr_a1`",
x = "# Poor Mental Health Days in Past 30 Days",
y = "Log(Violent Crime Rate (per 100,000 people))")
My loess smooth curve here is quite straight and does not have substantial bend. Although still quite scattered, the outliers are not as far from the linear model as in other transformations. The log appears to fix the problem of right skew, now having some outliers on both sides of the range.
4.5 The Fitted Model
4.5.1 Prediction Equation
In the following code, I will create a transformed straight line model fitted by least squares called m1_log
that uses predictor_mhealth
to predict the logarithm of outcome_crime
. I will use the equatiomatic package to create the fitted line equation in an attractive format.
<- lm(log(outcome_crime) ~ predictor_mhealth, data = chr_a1)
m1_log
extract_eq(m1_log, use_coefs = TRUE, coef_digits = 3)
\[
\operatorname{\widehat{log(outcome\_crime)}} = 5.064 - 0.003(\operatorname{predictor\_mhealth})
\] The slope of my predictor is negative, indicating that as predictor_mhealth
increases, the log of outcome_crime
decreases and that as predictor_mhealth
decreases, the log of outcome_crime
increases. For every additional poor mental health day in the past 30 days, the log of the violent crime rate per 100,000 people will be lower by 0.003. This model also implies that if the number of poor mental health days in the past 30 days was ‘0’, the log of the violent crime rate per 100,000 people would be 5.06.
4.5.2 Tidy Summary
I will create a summary of m1_log
’s coefficients with a 90% confidence interval using the tidy
function and made into an attractive format using kable
.
tidy(m1_log, conf.int = TRUE, conf.level = 0.90) %>%
kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 5.064 | 0.445 | 11.377 | 0.000 | 4.329 | 5.798 |
predictor_mhealth | -0.003 | 0.093 | -0.033 | 0.973 | -0.156 | 0.150 |
These data show me that for every additional poor mental health day in the past 30 days, the log of the county’s violent crime rate per 100,000 people will be lower by 0.003 with a confidence interval of (-0.156 - 0.150). The fact that my confidence interval contains a value of ‘0’, and my estimate is so near ‘0’ indicates that my model cannot predict the outcome well, i.e. very slightly more than half the time, my model will estimate that higher number of poor mental health days predicts lower violent crime rates, and very slightly less than half the time, my model will estimate that higher number of poor mental health days predicts higher violent crime rates.
4.5.3 Specified Summary Data
I will use the glance
function to view the model’s R-squared, residual standard error, and the number of observations to which the model was fit. I will use the cor
function to calculate the model’s Pearson correlation coefficient.
glance(m1_log) %>%
select(r.squared, sigma, nobs) %>%
kable(digits = 3)
r.squared | sigma | nobs |
---|---|---|
0 | 0.641 | 247 |
%$% cor(outcome_crime, predictor_mhealth) chr_a1
[1] -0.04183222
The multiple R-squared is 0, when rounded to a reasonable number of digits. This implies that none of the variation in the log of my outcome is explained by my predictor per my model. The residual standard error is 0.641, indicating that 95% of my error will be between -12.82 and +12.82, which is not impressive given the range of my log(outcome) data values. The number of observations is 247. The Pearson correlation coefficient is -0.042, which implies a weak correlation that is unlikely meaningful in this context. I used R to calculate this rather than taking the square root of r squared because r squared is 0.
4.6 Residual Analysis
In this section, I will analyze the residuals from my transformed linear model. I will use augment
to obtain the residuals.
<- augment(m1_log, data = chr_a1)
m1_log_aug
m1_log_aug
# A tibble: 247 x 11
outcome_crime predictor_mhealth county state id .fitted .resid .hat
<dbl> <dbl> <chr> <fct> <int> <dbl> <dbl> <dbl>
1 373. 4.72 Atlantic~ NJ 1 5.05 0.873 0.00416
2 84.3 4.00 Bergen C~ NJ 2 5.05 -0.618 0.0171
3 163. 4.45 Burlingt~ NJ 3 5.05 0.0423 0.00644
4 465. 4.60 Camden C~ NJ 4 5.05 1.09 0.00483
5 236. 4.55 Cape May~ NJ 5 5.05 0.414 0.00529
6 516. 4.93 Cumberla~ NJ 6 5.05 1.20 0.00448
7 606. 4.37 Essex Co~ NJ 7 5.05 1.36 0.00764
8 116. 4.51 Gloucest~ NJ 8 5.05 -0.300 0.00571
9 342. 4.28 Hudson C~ NJ 9 5.05 0.784 0.00945
10 42.3 4.02 Hunterdo~ NJ 10 5.05 -1.31 0.0165
# ... with 237 more rows, and 3 more variables: .sigma <dbl>, .cooksd <dbl>,
# .std.resid <dbl>
The residual values are stored as .resid
for each row (county) in the above tibble.
4.6.1 Assessment of linearity and normality
I will create a pair of residual plots: a scatter plot of residuals vs. fitted values to assess linearity and a normal Q-Q plot to assess normality of the residuals. I will use ggplot
and the patchwork package and include a linear model and a loess smooth curve. I will also use slice
to identify the counties and states of the 5 outliers I have coded my plot to show. The plot is too crowded to attractively show the county names on it.
<- ggplot(data = m1_log_aug, aes(x = .fitted, y = .resid)) +
p1 geom_point(size = 2, alpha = 0.6) +
geom_point(data = m1_log_aug %>%
slice_max(abs(.resid), n = 5),
col = "red", size = 2) +
geom_text_repel(data = m1_log_aug %>%
slice_max(abs(.resid), n = 5),
aes(label = id), col = "red") +
geom_smooth(method = "lm", formula = y ~ x,
se = FALSE, size = 1, col = "blue") +
geom_smooth(method = "loess", formula = y ~ x,
se = FALSE, size = 1, col = "red") +
theme(aspect.ratio = 0.5) +
labs(title = "Residuals vs. Fitted",
subtitle = "with OLS in blue, loess smooth in red",
x = "Fitted Values",
y = "Residuals")
<- ggplot(data = m1_log_aug, aes(sample = .resid)) +
p2 geom_qq() +
geom_qq_line(col = "red") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q plot",
x = "Theoretical Quantiles",
y = "Residuals")
+ p2 +
p1 plot_annotation(title = "Assessment of Residuals",
subtitle = "for the linear model using `predictor_mhealth` to predict log(`outcome_crime`)",
caption = "247 counties in NJ, NY, OH, PA, and VT in m1_log_aug")
slice(m1_log_aug, c(119, 128, 217, 238, 242)) %>%
summarise(id, county, state, .resid) %>%
kable (digits = 2)
id | county | state | .resid |
---|---|---|---|
119 | Holmes County | OH | -2.02 |
128 | Lucas County | OH | 1.67 |
217 | Philadelphia County | PA | 1.86 |
238 | Essex County | VT | -2.26 |
242 | Orange County | VT | -2.10 |
This shows a scatter plot of residuals vs. fitted values which helps me determine linearity of my transformed model, and a normal Q-Q plot of residuals which helps me determine normality of my transformed model. A description of this pair of plots is below.
4.6.2 Residual plot interpretation
Aside from a significant number of outliers (i.e. high residual values), my scatter plot of residuals vs. fitted values does appear somewhat linear, showing a weak “fuzzy football” shape. There is no distinct pattern to it. The loess smooth curve does not have very significant bend in it. The normal Q-Q plot shows that my model follows a normal distribution well, aside from 5 outliers. I wrote my code to show me what these 5 outliers are. They are county id
119, which is Holmes County in Ohio, county id
128, which is Lucas County in Ohio, county id
217, which is Philadelphia County in Pennsylvania, county id
238, which is Essex County in Vermont, and county id
242, which is Orange County in Vermont.
4.6.3 Ability of m1
to predict violent crime rate in Cuyahoga County
I will create an untransformed model and display its prediction for violent crime in Cuyahoga County in Ohio, where CWRU’s campus is located, and compare the prediction, or fitted value, to the actual value of this outcome.
<- lm(outcome_crime ~ predictor_mhealth, data = chr_a1)
m1
<- augment(m1, data = chr_a1)
m1_aug
%>%
m1_aug filter(county == "Cuyahoga County") %>%
summarise(id, county, state, outcome_crime, .fitted, .resid) %>%
kable (digits = 2)
id | county | state | outcome_crime | .fitted | .resid |
---|---|---|---|---|---|
100 | Cuyahoga County | OH | 645.99 | 188.56 | 457.43 |
My model’s prediction for violent crime in Cuyahoga County is 188.56, while the actual value is 645.99, creating a residual of 457.43. This residual is much too high in this context to confidently state that my model creates a good prediction for Cuyahoga County.
4.6.4 Identification of counties for which m1
is least successful
I will use the arrange
function to create a tibble of my untransformed data set that is arranged by absolute value of the residuals.
%>% select(id, county, state, .resid) %>%
m1_aug arrange(desc(abs(.resid))) %>%
head() %>%
kable (digits = 2)
id | county | state | .resid |
---|---|---|---|
217 | Philadelphia County | PA | 821.15 |
128 | Lucas County | OH | 637.56 |
100 | Cuyahoga County | OH | 457.43 |
7 | Essex County | NJ | 409.93 |
64 | Richmond County | NY | 399.56 |
62 | Queens County | NY | 390.95 |
The 2 counties for which my model m1
is least successful at predicting the outcome are Philadelphia County in Pennsylvania and Lucas County in Ohio, as these 2 counties have the highest absolute value of the residual. The residual for Lucas County in Ohio is 637.56 and the residual for Philadelphia County in Pennsylvania is 821.15.
4.7 Conclusions and Limitations
I ran this analysis in order to determine if populations with a higher rate of poor mental health days also have a higher rate of violent crime, in 247 counties in the states of New Jersey, New York, Ohio, Pennsylvania, and Vermont. The answer is no, counties with higher rates of poor mental health days do not have higher rates of violent crime in these counties. My model m1_log
indicates that, in these counties, more frequent poor mental health is associated with lower violent crime, albeit with a very weak correlation. In fact, my model implies that for every additional poor mental health day in the past 30 days, the log of the violent crime rate per 100,000 people will be lower by 0.003. I am 90% confident that the true slope is between (-0.156, 0.150), which contains the value ‘0’, implying that knowing the number of poor mental health days in the past 30 days would provide no meaningful information in predicting violent crime.
One major limitation of my work in Analysis 1 is that all my counties are within northeastern states, i.e. they are not randomly selected. Ohio is sometimes considered northeast and sometimes considered Midwest, but this grey area does not add any meaningful variation to my selected counties because 4 of 5 are definitely northeastern (NJ, NY, PA, and VT). This affects my external validity and application to my target population, all counties in the US. My model was not limited by any of the 4 key regression assumptions, which are linearity, independence, constant variance, and normality. The transformation of my data improved linearity, depicted above with a scatter plot with a loess smooth curve that did not show substantial bend and points which had no pattern when plotted against the fitted values, and improved normality, depicted above in my normal Q-Q plot. These counties were measured independently from each other and are only related by region and state. Finally, there is no fan or funnel shape on my residual plot, which would indicate non-constant variance.
5 Analysis 2: Comparing Groups on a Quantitative Outcome Using Independent Samples
In this section, I will build a linear regression model to predict violent crime rate using unemployment rate, a binary variable, in the counties in NJ, NY, OH, PA, and VT.
5.1 The Variables
My outcome variable is violent_crime
and my predictor variable is unemp_cat
which is a binary categorized variable based on the variable unemployment
. violent_crime
is a rate that describes the number of reported violent crime offenses per 100,000 of the county’s population. In this data set, a violent crime is an offense that involves face-to-face confrontation between a victim and a perpetrator and includes only those crimes reported to the FBI. My outcome variable is missing in 4 counties, all of which are in Ohio. unemp_cat
describes the percentage of the county’s civilian labor force who are age 16 or older and unemployed but seeking work. This variable is the result of splitting the variable unemployment
into two binary variables based on the median. The counties are designated “low” if they have an unemployment percentage below 4.3%, or “high”. My predictor variable is not missing for any counties in my states of interest which are New Jersey, New York, Ohio, Pennsylvania, and Vermont.
I will restrict my data set to the complete cases and create a tibble called chr_a2
. I will rename unemp_cat
to predictor_unemp
and violent_crime
to outcome_crime
.
<- chr_2021 %>%
chr_a2 filter(complete.cases(violent_crime, unemp_cat)) %>%
select(violent_crime, unemp_cat, county, state) %>%
rename(outcome_crime = violent_crime,
predictor_unemp = unemp_cat) %>%
clean_names()
chr_a2
# A tibble: 248 x 4
outcome_crime predictor_unemp county state
<dbl> <fct> <chr> <fct>
1 373. high Atlantic County NJ
2 84.3 low Bergen County NJ
3 163. low Burlington County NJ
4 465. low Camden County NJ
5 236. high Cape May County NJ
6 516. high Cumberland County NJ
7 606. high Essex County NJ
8 116. low Gloucester County NJ
9 342. low Hudson County NJ
10 42.3 low Hunterdon County NJ
# ... with 238 more rows
The above tibble has 248 rows (counties) which I expected because 248 counties have complete data on both violent_crime
, my outcome, and unemp_cat
, my predictor, now predictor_unemp
. My predictor is listed appropriately as a factor variable.
I will specify the values of my outcome, outcome_crime
, and predictor, predictor_unemp
for Cuyahoga County in Ohio, where CWRU’s campus is located.
%>%
chr_a2 filter(county == "Cuyahoga County") %>%
summarise(outcome_crime, predictor_unemp, county, state) %>%
kable(digits = 2)
outcome_crime | predictor_unemp | county | state |
---|---|---|---|
645.99 | low | Cuyahoga County | OH |
In Ohio’s Cuyahoga County, the rate of violent crime is 645.99 per 100,000 people and the unemployment percentage is low, i.e. below the median for the chr_2021
data set.
5.2 Research Question
Do populations with a percentage of unemployed residents above the median have a higher rate of violent crime, in 247 counties in the states of New Jersey, New York, Ohio, Pennsylvania, and Vermont?
5.3 Visualizing the Data
I will visualize my data in a box plot with violins using the ggplot
function, with the addition of a point to designate the mean for each category. I will identify the two furthest outliers similarly to my method in analysis 1, by creating two subsets of the data. One subset comprises the county with a violent crime rate >1000 (s1) and the other comprises the counties with a violent crime rate >750 and <1000 (s2). In order to more easily summarize the county and state of these outliers, I will create a third subset which includes both outliers by using a standard of violent crime rate >750. Finally, I will create a data summary to determine which states my two outliers are in.
<- filter(chr_a2, outcome_crime > 1000)
s1 <- filter(chr_a2, outcome_crime > 750 & outcome_crime < 1000)
s2 <- filter(chr_a2, outcome_crime > 750)
s3
ggplot(chr_a2, aes(x = predictor_unemp, y = outcome_crime)) +
geom_violin(aes(fill = predictor_unemp)) +
geom_boxplot(width = 0.3, outlier.size = 2, notch = T) +
stat_summary(fun = "mean", geom = "point",
shape = 23, size = 3, fill = "slateblue2") +
coord_flip() +
guides(fill = "none") +
scale_fill_manual(values = c("plum", "lightskyblue")) +
geom_point(data = s1, size = 2, col = "forestgreen") +
geom_text(data = s1, label = s1$county,
vjust = -1, hjust = 0.8, col = "forestgreen") +
geom_point(data = s2, size = 2, col = "tomato") +
geom_text(data = s2, label = s2$county,
vjust = 1.7, col = "tomato") +
labs(title = "Unemployment as a Predictor of Violent Crime",
subtitle = "stratified by unemployment percentage above and below the median, with outliers labeled",
caption = "248 counties in NJ, NY, OH, PA, VT from `chr_a2`",
x = "Unemployment Percentage",
y = "Violent Crime Rate (per 100,000 people)")
%>% summarise(county, state) %>%
s3 kable()
county | state |
---|---|
Lucas County | OH |
Philadelphia County | PA |
This plot of unemployment vs. violent crime rate shows a right skew for both counties with unemployment percentages below and above the median, i.e. “low” and “high”, in these 247 counties. The center of the data for both is pushed to the lower scale for violent crime rate. The counties with a high unemployment percentage have a wider range of violent crime rates and a skew that is further right. For both unemployment categories, the mean is higher than the median, because mean is more affected by the right skew than median is. I highlighted the two counties furthest outliers. These are Lucas County in Ohio and Philadelphia County in Pennsylvania. These are the same outliers as in analysis 1 where I compared poor mental health to violent crime, which makes sense.
5.4 Transformation Assessment
One of the counties in my data set has a ‘0’ value for violent_crime
. In order to be able to properly test transformations of my data, particularly to take the logarithm of my outcome, I need to omit this county from my data set. To do this, I will create a data set that filters my data to include only counties with an outcome value greater than ‘0’.
<- chr_a2 %>%
chr_a2 filter(outcome_crime > 0) %>%
mutate(id = row_number())
nrow(chr_a2)
[1] 247
There are 247 rows (counties) remaining, which is correct.
I applied a logarithm, an inverse, and a square to my outcome and then to my predictor. None of these visually improved by plots. I then used the boxCox
function to help me determine if a transformation is recommended, and if so, which. I used the powerTransform
function to tell me a more accurate value on the Profile Log-likelihood graph.
boxCox(chr_a2$outcome_crime ~ chr_a2$predictor_unemp)
powerTransform(chr_a2$outcome_crime ~ chr_a2$predictor_unemp)
Estimated transformation parameter
Y1
0.0268951
This indicates that I should consider applying a logarithm to my outcome. The following code will create a box plot with violins of my model with my outcome transformed with a logarithm, created using ggplot
.
ggplot(chr_a2, aes(x = predictor_unemp, y = log(outcome_crime))) +
geom_violin(aes(fill = predictor_unemp)) +
geom_boxplot(width = 0.3, outlier.size = 2, notch = T) +
stat_summary(fun = "mean", geom = "point",
shape = 23, size = 3, fill = "slateblue2") +
coord_flip() +
guides(fill = "none") +
scale_fill_manual(values = c("plum", "lightskyblue")) +
labs(title = "Unemployment as a Predictor of Log(Violent Crime)",
subtitle = "stratified by unemployment percentage above and below the median",
caption = "247 counties in NJ, NY, OH, PA, VT from `chr_a2`",
x = "Unemployment Percentage",
y = "Violent Crime Rate (per 100,000 people)")
The first change I notice here is that the means are closer to the medians for both categories, indicating there is less skew and my data is likely more normally distributed. The ranges are more similar for the two categories. Low unemployment percentage has 3 outliers to the left and high unemployment percentage has 2 outliers to the right, however all outliers are less far from the centers of the data. The log appears to have improved the right skew.
5.5 The Fitted Model
5.5.1 Prediction Equation
In the following code, I will create a transformed linear model fitted by least squares called m2_log
that uses predictor_unemp
to predict the logarithm of outcome_crime
. I will use the equatiomatic package to create the fitted line equation in an attractive format.
<- lm(log(outcome_crime) ~ predictor_unemp, data = chr_a2)
m2_log
extract_eq(m2_log, use_coefs = TRUE, coef_digits = 3)
\[
\operatorname{\widehat{log(outcome\_crime)}} = 5.153 - 0.206(\operatorname{predictor\_unemp}_{\operatorname{low}})
\] The slope of my predictor is negative, indicating that as predictor_unemp
increases, the log of outcome_crime
decreases and that as predictor_unemp
decreases, the log of outcome_crime
increases. For every additional percentage point of unemployment percentage, the log of the violent crime rate per 100,000 people will be lower by 0.206. This is a weak to moderate correlation. This model also implies that if the unemployment percentage was ‘0’, the log of the violent crime rate would be 5.153.
5.5.2 Tidy Summary
I will create a summary of m2_log
’s coefficients with a 90% confidence interval using the tidy
function and made into an attractive format using kable
.
tidy(m2_log, conf.int = TRUE, conf.level = 0.90) %>%
kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 5.153 | 0.057 | 90.008 | 0.000 | 5.058 | 5.247 |
predictor_unemplow | -0.206 | 0.080 | -2.556 | 0.011 | -0.339 | -0.073 |
These data show me that for every additional percentage point of unemployment percentage, the log of the county’s violent crime rate will be lower by 0.206 with a confidence interval of (-0.339, -0.073). The fact that my confidence interval does not contain a value of ‘0’ indicates that my model will predict the same direction of association in 95% of cases.
5.5.3 Specified Summary Data
I will use the glance
function to view the model’s R-squared, residual standard error, and the number of observations to which the model was fit.
glance(m2_log) %>%
select(r.squared, sigma, nobs) %>%
kable(digits = 3)
r.squared | sigma | nobs |
---|---|---|
0.026 | 0.632 | 247 |
The multiple R-squared is 0.026, implying that 2.6% of the variation in log(outcome_crime
) is explained by predictor_unemp
as per my model. The residual standard error is 0.632 and the number of observations is 247.
5.6 Prediction Analysis
I will start this section by using mutate
to recode my binary variable with more descriptive names, “below median” and “above median”. Then I will run a count of how many rows (counties) in each category.
<- chr_a2 %>%
chr_a2 mutate(unemp2 = fct_recode(predictor_unemp,
"below median" = "low",
"above median" = "high"))
%>% count(predictor_unemp, unemp2) %>%
chr_a2 kable()
predictor_unemp | unemp2 | n |
---|---|---|
high | above median | 122 |
low | below median | 125 |
There are 122 rows (counties) with an unemployment percentage above the median and 125 rows (counties) with an unemployment percentage below the median. The reason these category sizes are different, despite being split at the median, is because I omitted 4 counties with missing outcome data and 1 county with an outcome data value of ‘0’.
5.6.1 Normality Assessment
In order to plot the residuals against the categorical predictor to assess normality of the residuals within each category, I will need to create a model m2_log_aug
for which I’ve used augment
to show the residuals and fitted values for each row (county).
<- augment(m2_log, data = chr_a2)
m2_log_aug
m2_log_aug
# A tibble: 247 x 12
outcome_crime predictor_unemp county state id unemp2 .fitted .resid
<dbl> <fct> <chr> <fct> <int> <fct> <dbl> <dbl>
1 373. high Atlantic C~ NJ 1 above m~ 5.15 0.769
2 84.3 low Bergen Cou~ NJ 2 below m~ 4.95 -0.513
3 163. low Burlington~ NJ 3 below m~ 4.95 0.145
4 465. low Camden Cou~ NJ 4 below m~ 4.95 1.19
5 236. high Cape May C~ NJ 5 above m~ 5.15 0.311
6 516. high Cumberland~ NJ 6 above m~ 5.15 1.09
7 606. high Essex Coun~ NJ 7 above m~ 5.15 1.25
8 116. low Gloucester~ NJ 8 below m~ 4.95 -0.198
9 342. low Hudson Cou~ NJ 9 below m~ 4.95 0.887
10 42.3 low Hunterdon ~ NJ 10 below m~ 4.95 -1.20
# ... with 237 more rows, and 4 more variables: .hat <dbl>, .sigma <dbl>,
# .cooksd <dbl>, .std.resid <dbl>
The residual values are stored as .resid
for each row (county) in the above tibble.
I will create a pair of residual plots: a normal Q-Q plot of the residuals for the “low” unemployment category and a normal Q-Q plot of the residuals for the “high” unemployment category. This will allow me to assess normality of the residuals within each category. In order to do this I will create 2 temporary data sets: m2_log_aug_low
which contains all the rows (counties) with an unemployment percentage below the median, and m2_log_aug_high
which contains all the rows (counties) with an unemployment percentage above the median. I will use ggplot
and the patchwork package.
<- m2_log_aug %>%
m2_log_aug_low filter(predictor_unemp == "low")
<- m2_log_aug %>%
m2_log_aug_high filter(predictor_unemp == "high")
<- ggplot(data = m2_log_aug_low, aes(sample = .resid)) +
p1 geom_qq() +
geom_qq_line(col = "red") +
labs(title = "Normal Q-Q plot",
subtitle = "low unemployment category",
x = "Theoretical Quantiles",
y = "Residuals")
<- ggplot(data = m2_log_aug_high, aes(sample = .resid)) +
p2 geom_qq() +
geom_qq_line(col = "red") +
labs(title = "Normal Q-Q plot",
subtitle = "high unemployment category",
x = "Theoretical Quantiles",
y = "Residuals")
+ p2 +
p1 plot_annotation(title = "Assessment of Residuals",
subtitle = "for the linear model using `predictor_unemp` to predict log(`outcome_crime`)",
caption = "247 counties in NJ, NY, OH, PA, and VT in `m2_log_aug`")
The data for the high unemployment category appears closer to a normal distribution than for the low unemployment category. The low unemployment category contains 3 outliers on the lower end of the scale and the high unemployment category contains 2 outliers on the higher end of the scale, which makes sense given these are 2 categories split at the median.
5.6.2 Ability of the model to predict violent crime rate in Cuyahoga County
I will display an untransformed model’s prediction for violent crime in Cuyahoga County in Ohio, where CWRU’s campus is located, and compare the prediction, or fitted value, to the actual value of this outcome. To do this, I will create an untransformed model called m2
that shows outcome_crime
as a function of predictor_unemp
from the chr_a2
data set. Then I will augment this model to show me the residuals. Then I will filter to Cuyahoga County and select the columns I am interested in.
<- lm(outcome_crime ~ predictor_unemp, data = chr_a2)
m2
<- augment(m2, data = chr_a2)
m2_aug
%>%
m2_aug filter(county == "Cuyahoga County") %>%
summarise(id, county, state, outcome_crime, .fitted, .resid) %>%
kable(digits = 2)
id | county | state | outcome_crime | .fitted | .resid |
---|---|---|---|---|---|
100 | Cuyahoga County | OH | 645.99 | 180.77 | 465.22 |
My model’s prediction for violent crime in Cuyahoga County is 180.77 crimes per 100,000 people, while the actual value is 645.99, creating a residual of 465.22. This residual is much too high in this context to confidently state that my model creates a good prediction for Cuyahoga County.
5.6.3 Identification of counties for which m2
is least successful
I will use the arrange
function to create a tibble of my untransformed data set that is arranged by absolute value of the residuals.
%>% select(id, county, state, .resid) %>%
m2_aug arrange(desc(abs(.resid))) %>%
head() %>%
kable(digits = 2)
id | county | state | .resid |
---|---|---|---|
217 | Philadelphia County | PA | 800.22 |
128 | Lucas County | OH | 623.14 |
100 | Cuyahoga County | OH | 465.22 |
64 | Richmond County | NY | 417.12 |
62 | Queens County | NY | 417.01 |
52 | New York County | NY | 405.64 |
The 2 counties for which my model m2
is least successful at predicting the outcome are Philadelphia County in Pennsylvania and Lucas County in Ohio, as these 2 counties have the highest absolute value of the residual. The residual for Lucas County in Ohio is 623.14 and the residual for Philadelphia County in Pennsylvania is 800.23.
5.7 Conclusions and Limitations
I ran these analyses in order to determine if populations with a percentage of unemployed residents above the median have a higher rate of violent crime, in 247 counties in the states of New Jersey, New York, Ohio, Pennsylvania, and Vermont. The answer is no. In fact, counties with an unemployment percentage above the median have very slightly lower violent crime rates. My model indicates that for every additional percentage point of unemployment, the log of the violent crime rate per 100,000 people decreases by 0.206. I am 90% confident that the true slope of the log transformed model is between (-0.339, -0.073), which is negative and does not contain the value ‘0’.
The logarithm transformation improved the limitations in my data in regards to normality, as this improved the right skew in the original data set. There were no issues with linearity, as my Q-Q plot of residuals did not show any pattern. The major limitation of my work in Analysis 2 was that all my counties are within northeastern states, i.e. they are not randomly selected. Ohio is sometimes considered northeast and sometimes considered Midwest, but this grey area does not add any meaningful variation to my selected counties because 4 of 5 are definitely northeastern (NJ, NY, PA, and VT). This affects my external validity and application to my target population, all counties in the US.
6 Analysis 3: Adjusting for the Impact of State on my Linear Model
In this section, I will build a linear regression model to predict violent crime using number of poor mental health days in the past 30 days and the state in which the county is located.
6.1 The Variables
My outcome variable is violent_crime
and my predictor variable is poor_mental_health_days
. violent_crime
is a rate that describes the number of reported violent crime offenses per 100,000 of the county’s population. In this data set, a violent crime is an offense that involves face-to-face confrontation between a victim and a perpetrator and includes only those crimes reported to the FBI. My outcome variable is missing in 4 counties, all of which are in Ohio. poor_mental_health_days
is an age-adjusted rate that describes the average number of mentally unhealthy days reported in the past 30 days among the non-institutionalized adult population. My predictor variable is not missing for any counties in my states of interest which are New Jersey, New York, Ohio, Pennsylvania, and Vermont. Finally, state
describes the US state that the county is located in.
I will restrict my data set to the complete cases of these three variables and create a tibble called chr_a3
. This data set will contain values for violent_crime
, poor_mental_health_days
, county
, and state
. violent_crime
will be renamed to outcome_crime
and poor_mental_health_days
will be renamed to predictor_mhealth
. I will add in a new variable called id
to give each row a unique variable, to help with my analyses later on.
<- chr_2021 %>%
chr_a3 filter(complete.cases(violent_crime, poor_mental_health_days, state)) %>%
select(violent_crime, poor_mental_health_days, county, state) %>%
rename(outcome_crime = violent_crime,
predictor_mhealth = poor_mental_health_days) %>%
mutate(state = fct_relevel(state, "OH")) %>%
mutate(id = row_number()) %>%
clean_names()
chr_a3
# A tibble: 248 x 5
outcome_crime predictor_mhealth county state id
<dbl> <dbl> <chr> <fct> <int>
1 373. 4.72 Atlantic County NJ 1
2 84.3 4.00 Bergen County NJ 2
3 163. 4.45 Burlington County NJ 3
4 465. 4.60 Camden County NJ 4
5 236. 4.55 Cape May County NJ 5
6 516. 4.93 Cumberland County NJ 6
7 606. 4.37 Essex County NJ 7
8 116. 4.51 Gloucester County NJ 8
9 342. 4.28 Hudson County NJ 9
10 42.3 4.02 Hunterdon County NJ 10
# ... with 238 more rows
The above tibble has 248 rows (counties) which I expected because 248 counties have complete data on violent_crime
, my outcome, and excess_drinking
and ‘state’, my predictors.
One of the counties in my data set has a ‘0’ value for violent_crime. In order to be able to properly test transformations of my data, particularly to take the logarithm of my outcome, I need to omit this county from my data set. To do this, I will filter my data to include only counties with an outcome value greater than ‘0’.
<- chr_a3 %>%
chr_a3 filter(outcome_crime > 0)
nrow(chr_a3)
[1] 247
There are 247 rows (counties) remaining, which is correct.
I will specify the values of my outcome, outcome_crime
, and predictor, predictor_mhealth
for Cuyahoga County in Ohio, where CWRU’s campus is located.
%>%
chr_a3 filter(county == "Cuyahoga County") %>%
summarise(outcome_crime, predictor_mhealth, county, state) %>%
kable(digits = 2)
outcome_crime | predictor_mhealth | county | state |
---|---|---|---|
645.99 | 4.97 | Cuyahoga County | OH |
In Ohio’s Cuyahoga County, the rate of violent crime is 645.99 per 100,000 people and the rate of poor mental health days is 4.97 days in the past 30 days.
6.2 Research Question
Do populations with a higher rate of poor mental health days also have a higher rate of violent crime, adjusting for the impact of state, in 247 counties in the states of New Jersey, New York, Ohio, Pennsylvania, and Vermont?
6.3 Visualizing the Data
I will visualize my data in a scatter plot using the ggplot
function, with the addition of a linear model and a loess smooth curve to assess how well predictor_mhealth
predicts the log of outcome_crime
with results separated by state
. I will use the same transformation I used in analysis 1, which is the logarithm of my outcome.
ggplot(data = chr_a3, aes(x = predictor_mhealth, y = log(outcome_crime), col = state, group = state)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE) +
facet_wrap(~ state) +
labs(title = "Poor Mental Health as a Predictor of Log(Violent Crime)",
subtitle = "separated by state, with OLS",
caption = "247 counties in NJ, NY, OH, PA, VT from `chr_a3`",
x = "# Poor Mental Health Days in Past 30 Days",
y = "Log(Violent Crime Rate (per 100,000 people))")
The number of poor mental health days in the past 30 days appears to predict the violent crime rate better in some states than others in these 247 counties. The relationship appears the strongest in Pennsylvania, where the numbers are most clustered around the line of best fit. The relationship appears the weakest in Ohio and New York, where the points are really just randomly scattered. New Jersey’s plot has the highest slope, indicating that poor mental health adds the most predictive value to the outcome in this state compared to the other states.
6.4 The Fitted Model
6.4.1 Prediction Equation
In this section, I will create a model called m3_log
that uses predictor_mhealth
to predict the log of outcome_crime
, after accounting for state
. I will use the equatiomatic package to create the fitted line equation in an attractive format.
<- lm(log(outcome_crime) ~
m3_log * state, data = chr_a3)
predictor_mhealth
extract_eq(m3_log, use_coefs = TRUE, coef_digits = 3,
wrap = TRUE, terms_per_line = 1)
\[ \begin{aligned} \operatorname{\widehat{log(outcome\_crime)}} &= 4.069\ + \\ &\quad 0.158(\operatorname{predictor\_mhealth})\ - \\ &\quad 3.878(\operatorname{state}_{\operatorname{NJ}})\ + \\ &\quad 1.558(\operatorname{state}_{\operatorname{NY}})\ - \\ &\quad 0.046(\operatorname{state}_{\operatorname{PA}})\ + \\ &\quad 0.976(\operatorname{state}_{\operatorname{VT}})\ + \\ &\quad 0.975(\operatorname{predictor\_mhealth} \times \operatorname{state}_{\operatorname{NJ}})\ - \\ &\quad 0.242(\operatorname{predictor\_mhealth} \times \operatorname{state}_{\operatorname{NY}})\ + \\ &\quad 0.072(\operatorname{predictor\_mhealth} \times \operatorname{state}_{\operatorname{PA}})\ - \\ &\quad 0.298(\operatorname{predictor\_mhealth} \times \operatorname{state}_{\operatorname{VT}}) \end{aligned} \] The state I will discuss is New York. For New York, the equation will be as follows:
log(outcome_crime) = 4.069 + 0.158(0) − 3.878(0) + 1.558(1) − 0.046(0) + 0.976(0) + 0.975(0) − 0.242(predictor_mhealth*1) + 0.072(0) − 0.298(0)
or more simply:
log(outcome_crime)
= 4.069 + 1.558 - 0.242(predictor_mhealth)
= 5.627 - 0.242(predictor_mhealth)
The slope of the predictor for counties in New York is negative, indicating that as predictor_mhealth
increases, the log of outcome_crime
decreases, and vice versa. Within counties in the state of New York, for every additional poor mental health day in the past 30 days, the log of the violent crime rate will be lower by 0.242. This model also implies that if the number of poor mental health days in the past 30 days counties in New York was ‘0’, the log of the violent crime rate would be 5.63.
6.4.2 Tidy Summary
I will create a summary of m3_log
’s coefficients with a 90% confidence interval using the tidy
function and made into an attractive format using kable
. I made Ohio my baseline state when I set up my data set.
tidy(m3_log, conf.int = TRUE, conf.level = 0.90) %>%
select(term, estimate, std.error, conf.low, conf.high) %>%
kable(digits = 3)
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
(Intercept) | 4.069 | 0.875 | 2.624 | 5.514 |
predictor_mhealth | 0.158 | 0.174 | -0.128 | 0.445 |
stateNJ | -3.878 | 1.910 | -7.032 | -0.725 |
stateNY | 1.558 | 1.370 | -0.705 | 3.821 |
statePA | -0.046 | 1.519 | -2.555 | 2.463 |
stateVT | 0.976 | 2.887 | -3.791 | 5.743 |
predictor_mhealth:stateNJ | 0.975 | 0.423 | 0.277 | 1.673 |
predictor_mhealth:stateNY | -0.242 | 0.295 | -0.729 | 0.245 |
predictor_mhealth:statePA | 0.072 | 0.302 | -0.427 | 0.571 |
predictor_mhealth:stateVT | -0.298 | 0.627 | -1.334 | 0.737 |
These data show me that for every additional poor mental health day in the past 30 days, counties in Ohio will have a higher log of violent crime rate by 0.158 with a confidence interval of (-0.128, 0.445). The fact that my confidence interval contains ‘0’ indicates that my model cannot predict the outcome consistently, i.e. sometimes it will predict a higher log of violent crime rate, but sometimes lower. The only state with a 90% confidence interval that does not contain ‘0’ is New Jersey, and this has a positive slope, which implies that I am 90% confident that when I use my predictor to predict the log of my outcome for counties in New Jersey, it will be positive.
6.4.3 Specified Summary Data
I will use the glance
function to view the model’s R-squared, residual standard error, and the number of observations to which the model was fit.
glance(m3_log) %>%
select(r.squared, sigma, nobs) %>%
kable(digits = 3)
r.squared | sigma | nobs |
---|---|---|
0.163 | 0.596 | 247 |
The multiple R-squared is 0.163, which implies that 16.3% of the variation in the log of outcome_crime
is explained by predictor_mhealth
per state, as per my model. The residual standard error is 0.596 so I would anticipate that 95% of the errors made by my model would be between -1.192 and +1.192. The number of observations is 247.
6.4.4 Making Predictions at the median number of mental health days, for each state
I will determine the median value of my predictor in my data set. Then I will create a new data set where each of my states of interest have a value for the predictor that equals the median for the data set.
median(chr_a3$predictor_mhealth)
[1] 4.789679
<-
new_data tibble(predictor_mhealth = c(4.79, 4.79, 4.79, 4.79, 4.79),
state = c("NJ", "NY", "OH", "PA","VT"))
%>% kable(digits = 2) new_data
predictor_mhealth | state |
---|---|
4.79 | NJ |
4.79 | NY |
4.79 | OH |
4.79 | PA |
4.79 | VT |
The mean predictor value for the data set is 4.79. This is a table that shows each state having this value for the predictor. Now, I will use augment
to calculate the residuals.
augment(m3_log, newdata = new_data) %>%
kable(digits = 2)
predictor_mhealth | state | .fitted |
---|---|---|
4.79 | NJ | 5.62 |
4.79 | NY | 5.23 |
4.79 | OH | 4.83 |
4.79 | PA | 5.13 |
4.79 | VT | 4.37 |
The 2 states with the largest residual here are New Jersey and New York. These 2 states have the largest absolute value of fitted value minus actual value, when the actual value is the median for the data set. This indicates that the counties in these 2 states would be the least successfully predicted by the model that does not account for the impact of state.
6.5 Residual Analysis
In this section, I will analyze the residuals from my transformed linear model. I will use augment
to obtain the residuals.
<- augment(m3_log, data = chr_a3)
m3_log_aug
m3_log_aug
# A tibble: 247 x 11
outcome_crime predictor_mhealth county state id .fitted .resid .hat
<dbl> <dbl> <chr> <fct> <int> <dbl> <dbl> <dbl>
1 373. 4.72 Atlantic C~ NJ 1 5.54 0.387 0.0914
2 84.3 4.00 Bergen Cou~ NJ 2 4.72 -0.288 0.113
3 163. 4.45 Burlington~ NJ 3 5.24 -0.143 0.0491
4 465. 4.60 Camden Cou~ NJ 4 5.40 0.742 0.0650
5 236. 4.55 Cape May C~ NJ 5 5.34 0.121 0.0575
6 516. 4.93 Cumberland~ NJ 6 5.78 0.464 0.170
7 606. 4.37 Essex Coun~ NJ 7 5.15 1.26 0.0477
8 116. 4.51 Gloucester~ NJ 8 5.30 -0.550 0.0532
9 342. 4.28 Hudson Cou~ NJ 9 5.04 0.791 0.0528
10 42.3 4.02 Hunterdon ~ NJ 10 4.74 -0.999 0.106
# ... with 237 more rows, and 3 more variables: .sigma <dbl>, .cooksd <dbl>,
# .std.resid <dbl>
The residual values are stored as .resid
for each row (county) in the above tibble.
6.5.1 Assessment of linearity and normality
I will create a pair of residual plots: a scatter plot of residuals vs. fitted values to assess linearity and a normal Q-Q plot to assess normality of the residuals. I will use ggplot and the patchwork package and include a linear model and a loess smooth curve.
<- ggplot(data = m3_log_aug, aes(x = .fitted, y = .resid)) +
p1 geom_point(size = 2, alpha = 0.6) +
geom_smooth(method = "lm", formula = y ~ x,
se = FALSE, size = 1, col = "blue") +
geom_smooth(method = "loess", formula = y ~ x,
se = FALSE, size = 1, col = "red") +
labs(title = "Residuals vs. Fitted",
subtitle = "with OLS (blue), loess (red)",
x = "Fitted Values",
y = "Residuals")
<- ggplot(data = m3_log_aug, aes(sample = .resid)) +
p2 geom_qq() +
geom_qq_line(col = "red") +
labs(title = "Normal Q-Q plot",
x = "Theoretical Quantiles",
y = "Residuals")
+ p2 +
p1 plot_annotation(title = "Assessment of Residuals",
subtitle = "using mental health to predict the log of violent crime with the impact of state",
caption = "247 counties in NJ, NY, OH, PA, and VT in m3_log_aug")
This shows a scatterplot of residuals vs. fitted values which helps me determine linearity of my transformed model, and a normal Q-Q plot of residuals which helps me determine normality of my transformed model. A description of this pair of plots is below.
6.5.2 Residual plot interpretation
My scatter plot appears to have 2 clusters of data at around a fitted value of 4.9 and 5.25. This is very minor and there is no substantial bend in the curve. This plot depicts that I can assume linearity of my transformed model. The normal Q-Q plot to the right shows that I can assume normality of my transformed model. There are 2-4 slight outliers on the Q-Q plot but I do not expect something this minor to affect my model meaningfully.
6.5.3 Ability of m3
to predict violent crime rate in Cuyahoga County
I will create an untransformed model and display its prediction for violent crime in Cuyahoga County in Ohio, where CWRU’s campus is located, and compare the prediction, or fitted value, to the actual value of this outcome.
<- lm(outcome_crime ~ predictor_mhealth * state, data = chr_a3)
m3
<- augment(m3, data = chr_a3)
m3_aug
%>%
m3_aug filter(county == "Cuyahoga County") %>%
summarise(id, county, state, outcome_crime, .fitted, .resid) %>%
kable(digits = 2)
id | county | state | outcome_crime | .fitted | .resid |
---|---|---|---|---|---|
100 | Cuyahoga County | OH | 645.99 | 160.15 | 485.84 |
After adjusting for state, the model becomes less successful at this prediction for Cuyahoga County. My model’s prediction for violent crime here is 160.15, while the actual value is 645.99, creating a residual of 485.84 (which is higher than in the model which did not account for the impact of state).
6.5.4 Identification of counties for which m3
is least successful
I will use the arrange
function to create a tibble of my untransformed data set that is arranged by absolute value of the residuals.
%>% select(id, county, state, .resid) %>%
m3_aug arrange(desc(abs(.resid))) %>%
head() %>%
kable (digits =2)
id | county | state | .resid |
---|---|---|---|
218 | Philadelphia County | PA | 752.84 |
128 | Lucas County | OH | 662.92 |
100 | Cuyahoga County | OH | 485.84 |
7 | Essex County | NJ | 373.14 |
24 | Bronx County | NY | 364.87 |
64 | Richmond County | NY | 359.40 |
The 2 counties for which my model m3
is least successful at predicting the outcome are Philadelphia County in Pennsylvania and Lucas County in Ohio, as these 2 counties have the highest absolute value of the residual. The residual for Philadelphia County is 753 and the residual for Lucas County is 663.
6.6 Conclusion and Limitations
I sought to determine if populations with a higher rate of poor mental health days also have a higher rate of violent crime after adjusting for the impact of state, in 247 counties in the states of New Jersey, New York, Ohio, Pennsylvania, and Vermont. The answer is yes for counties in New Jersey, and no for counties in New York, Pennsylvania, Ohio, and Vermont. In counties in New Jersey, higher numbers of poor mental health days is associated with higher violent crime rate with a 90% confidence interval which does not contain ‘0’. The confidence intervals for the counties in the other states does contain ‘0’.
Again, the major limitation here is the lack of randomization of my sample. All my counties are within northeastern states. Ohio is sometimes considered northeast and sometimes considered Midwest, but this grey area does not add any meaningful variation to my selected counties. This affects my external validity and application to my target population, all counties in the US. As you can see by my residual vs. fitted plot and normal Q-Q plot above, normality and linearity were not an issue in my transformed data set.
7 Session Information
sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggridges_0.5.3 mosaicData_0.20.2 ggformula_0.10.1 ggstance_0.3.5
[5] Matrix_1.3-4 lattice_0.20-44 forcats_0.5.1 stringr_1.4.0
[9] dplyr_1.0.7 purrr_0.3.4 readr_2.0.2 tidyr_1.1.4
[13] tibble_3.1.4 tidyverse_1.3.1 broom_0.7.9 car_3.0-11
[17] carData_3.0-4 patchwork_1.1.1 equatiomatic_0.3.0 ggrepel_0.9.1
[21] ggplot2_3.3.5 naniar_0.6.1 magrittr_2.0.1 janitor_2.1.0
[25] rmdformats_1.0.3 knitr_1.36
loaded via a namespace (and not attached):
[1] colorspace_2.0-2 ellipsis_0.3.2 rio_0.5.27
[4] visdat_0.5.3 leaflet_2.0.4.1 htmlTable_2.3.0
[7] snakecase_0.11.0 base64enc_0.1-3 ggdendro_0.1.22
[10] fs_1.5.0 rstudioapi_0.13 farver_2.1.0
[13] bit64_4.0.5 fansi_0.5.0 lubridate_1.7.10
[16] xml2_1.3.2 mosaic_1.8.3 splines_4.1.1
[19] polyclip_1.10-0 Formula_1.2-4 jsonlite_1.7.2
[22] cluster_2.1.2 dbplyr_2.1.1 png_0.1-7
[25] ggforce_0.3.3 shiny_1.7.1 compiler_4.1.1
[28] httr_1.4.2 backports_1.2.1 assertthat_0.2.1
[31] fastmap_1.1.0 cli_3.0.1 later_1.3.0
[34] tweenr_1.0.2 htmltools_0.5.2 tools_4.1.1
[37] gtable_0.3.0 glue_1.4.2 Rcpp_1.0.7
[40] cellranger_1.1.0 jquerylib_0.1.4 vctrs_0.3.8
[43] nlme_3.1-152 crosstalk_1.1.1 xfun_0.26
[46] openxlsx_4.2.4 rvest_1.0.2 mime_0.12
[49] lifecycle_1.0.1 mosaicCore_0.9.0 MASS_7.3-54
[52] scales_1.1.1 vroom_1.5.5 hms_1.1.1
[55] promises_1.2.0.1 parallel_4.1.1 RColorBrewer_1.1-2
[58] yaml_2.2.1 curl_4.3.2 gridExtra_2.3
[61] sass_0.4.0 rpart_4.1-15 labelled_2.9.0
[64] latticeExtra_0.6-29 stringi_1.7.4 highr_0.9
[67] checkmate_2.0.0 zip_2.2.0 rlang_0.4.11
[70] pkgconfig_2.0.3 evaluate_0.14 labeling_0.4.2
[73] htmlwidgets_1.5.4 bit_4.0.4 tidyselect_1.1.1
[76] plyr_1.8.6 bookdown_0.24 R6_2.5.1
[79] generics_0.1.1 Hmisc_4.6-0 DBI_1.1.1
[82] mgcv_1.8-36 pillar_1.6.4 haven_2.4.3
[85] foreign_0.8-81 withr_2.4.2 nnet_7.3-16
[88] survival_3.2-11 abind_1.4-5 modelr_0.1.8
[91] crayon_1.4.2 utf8_1.2.2 tzdb_0.1.2
[94] rmarkdown_2.11 jpeg_0.1-9 grid_4.1.1
[97] readxl_1.3.1 data.table_1.14.2 reprex_2.0.1
[100] digest_0.6.28
[ reached getOption("max.print") -- omitted 4 entries ]