Impact of Mental Health and Unemployment on Violent Crime in Select US Counties

1 Preliminaries

1.1 My R Packages

I will load my packages here.

library(janitor)
library(magrittr)
library(naniar)
library(ggrepel)
library(equatiomatic)
library(patchwork)
library(car)
library(broom)
library(tidyverse)

source("Love-boost.R.rmd")

theme_set(theme_bw())

1.2 Data Ingest

I will read in the raw data here.

data_url <- 
  "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2021.csv"
chr_2021_raw <- read_csv(data_url, skip = 1, guess_max = 4000)

2 Data Development

2.1 Selecting My Data

The following code is creating a data set called chr_2021 which contains select parts of the chr_2021_raw data set. I will use the filter function to filter out all counties that are not ranked, and filter out counties in all states except my states of interest: New Jersey, New York, Ohio, Pennsylvania, and Vermont. I will select my variables of interest, then rename them to more descriptive names. My variables are fipscode, state, county, v043_rawvalue renamed to violent_crime, v042_rawvalue renamed to poor_mental_health_days, v049_rawvalue renamed to excess_drinking, v023_rawvalue renamed to unemployment, and v063_rawvalue renamed to med_household_income. I then changed med_household_income from dollars to thousands of dollars by dividing by 1000 and unemployment and excess_drinking from proportions to percentages by multiplying by 100.

chr_2021 <- chr_2021_raw %>%
    filter(county_ranked == 1) %>%
    filter(state %in% c("OH", "NY", "PA", "NJ", "VT")) %>%
    select(fipscode, state, county, 
           v043_rawvalue, v042_rawvalue, v049_rawvalue, 
           v023_rawvalue, v063_rawvalue) %>%
    rename(violent_crime = v043_rawvalue,
           poor_mental_health_days = v042_rawvalue,
           excess_drinking = v049_rawvalue,
           unemployment = v023_rawvalue,
           med_household_income = v063_rawvalue) %>%
    mutate(med_household_income = med_household_income / 1000,
           unemployment = 100*unemployment,
           excess_drinking = 100*excess_drinking)

2.2 Repairing the fipscode and factoring the state

I will change the variable fipscode from a numerical variable to a character variable because it contains no numerical meaning and I will add a “0” to any character with fewer than 5 digits. I will change the variable state from a character variable to a factor.

chr_2021 <- chr_2021 %>%
    mutate(fipscode = str_pad(fipscode, 5, pad = "0"),
           state = factor(state))

2.2.1 Checking Initial Work

I will use the glimpse function to check that I have the correct number of rows (which should equal my number of counties), that fipscode is a character, and that state is a factor.

glimpse(chr_2021)
Rows: 252
Columns: 8
$ fipscode                <chr> "34001", "34003", "34005", "34007", "34009", "~
$ state                   <fct> NJ, NJ, NJ, NJ, NJ, NJ, NJ, NJ, NJ, NJ, NJ, NJ~
$ county                  <chr> "Atlantic County", "Bergen County", "Burlingto~
$ violent_crime           <dbl> 373.18731, 84.25050, 162.75355, 464.90041, 235~
$ poor_mental_health_days <dbl> 4.715148, 3.997601, 4.450502, 4.595566, 4.5451~
$ excess_drinking         <dbl> 17.18567, 18.14703, 19.78728, 17.48468, 21.408~
$ unemployment            <dbl> 5.057650, 2.923421, 3.269887, 4.047373, 7.1538~
$ med_household_income    <dbl> 62.678, 107.971, 88.443, 73.168, 66.565, 54.17~

The number of rows equals the expected number of counties (252), fipscode is a character variable, and state is a factor variable.

I will check to see that each state has the number of counties I expect.

chr_2021 %>% tabyl(state) %>% adorn_pct_formatting() 
 state  n percent
    NJ 21    8.3%
    NY 62   24.6%
    OH 88   34.9%
    PA 67   26.6%
    VT 14    5.6%

They do. New Jersey has 21 counties which make up 8.3% of my data, New York has 62 counties which make up 24.6% of my data, Ohio has 88 counties which make up 34.9% of my data, Pennsylvania has 67 counties which make up 26.6% of my data, and Vermont has 14 counties which make up 5.6% of my data.

2.3 Creating My Binary Categorical Variable

2.3.1 Split into two categories based on the median

I will use the mutate function to create a temporary-named variable that is a binary factor variable, separating unemployment at the median. I will call all values below the median “low” and above the median “high”. Then I will run a numerical summary using favstats to check that I’ve created 2 equally sized categories that contain values below and above the median.

chr_2021 <- chr_2021 %>%
    mutate(temp1_ms = case_when(
                   unemployment < median(unemployment) ~ "low",
                   TRUE ~ "high"),
           temp1_ms = factor(temp1_ms))

mosaic::favstats(unemployment ~ temp1_ms, data = chr_2021) %>% 
    kable(digits = 3)
temp1_ms min Q1 median Q3 max mean sd n missing
high 4.281 4.652 4.957 5.457 8.269 5.138 0.692 126 0
low 1.792 3.264 3.686 3.943 4.278 3.570 0.514 126 0

There are equal numbers of counties in each category and the range of values do not overlap. Therefore, I know I have split this variable at the median. I have created 2 mutually exclusive, collectively exhaustive categories. I will keep this variable split at the median, as opposed to a different specific value, because the median is a meaningful value here. In order to compare county to county in my states of interest, I want to know if that county is in the upper or lower half of counties for unemployment rate.

2.3.2 Cleaning up

I will rename the above temporary-named variable to unemp_cat as this is a more meaningful name for my categorized unemployment variable.

chr_2021 <- chr_2021 %>%
    rename(unemp_cat = temp1_ms)

I will check that I now have 9 variables and they all have the names I expect.

names(chr_2021)
[1] "fipscode"                "state"                  
[3] "county"                  "violent_crime"          
[5] "poor_mental_health_days" "excess_drinking"        
[7] "unemployment"            "med_household_income"   
[9] "unemp_cat"              

They do.

I will check that I still have the correct number of rows in my data which correlates with number of counties (252).

nrow(chr_2021)
[1] 252

I do.

2.4 Creating My Multi-Category Variable

2.4.1 Creating a 5-category variable based on quantile

I will use the mutate function to create 5 categories of med_household_income that are split into equal, ordered categories. I will then use favstats to check my work.

chr_2021 <- chr_2021 %>%
    mutate(temp3 = factor(Hmisc::cut2(med_household_income, g = 5)))

mosaic::favstats(med_household_income ~ temp3, data = chr_2021) %>% 
    kable(digits = 3)
temp3 min Q1 median Q3 max mean sd n missing
[41.3, 51.0) 41.267 47.046 48.826 50.154 51.023 48.239 2.543 51 0
[51.0, 55.8) 51.038 51.908 52.781 54.224 55.664 53.090 1.372 50 0
[55.8, 61.0) 55.778 56.922 58.441 60.192 60.918 58.555 1.615 51 0
[61.0, 68.6) 61.036 62.577 64.319 65.662 68.364 64.280 2.062 50 0
[68.6,117.8] 68.611 73.625 83.593 98.270 117.767 86.671 14.764 50 0

I have split med_household_income into 5 mutually exclusive, collectively exhaustive quantiles. I chose 5 quantiles because, per the numerical summary, I suspect my data will have a right skew, so I want to use 5 categories as opposed to 3 or 4 to create a more stratified comparison. I prefer to split into quantiles as opposed to specified cutoff points because income varies state to state, especially between my states of interest, and there are no standardized cutoff points that hold more meaning than the quantile cutoffs.

I will rename these quantiles more meaningfully. In order from lowest to highest, they will be quant_1, quant_2, quant_3, quant_4, and quant_5. I will then check my work by running a numerical summary using favstats.

chr_2021 <- chr_2021 %>%
    mutate(med_income_cat = fct_recode(temp3,
                                   quant_1 = "[41.3, 51.0)",
                                   quant_2 = "[51.0, 55.8)",
                                   quant_3 = "[55.8, 61.0)",
                                   quant_4 = "[61.0, 68.6)",
                                   quant_5 = "[68.6,117.8]"))

mosaic::favstats(med_household_income ~ med_income_cat, data = chr_2021) %>% 
    kable(digits = 3)
med_income_cat min Q1 median Q3 max mean sd n missing
quant_1 41.267 47.046 48.826 50.154 51.023 48.239 2.543 51 0
quant_2 51.038 51.908 52.781 54.224 55.664 53.090 1.372 50 0
quant_3 55.778 56.922 58.441 60.192 60.918 58.555 1.615 51 0
quant_4 61.036 62.577 64.319 65.662 68.364 64.280 2.062 50 0
quant_5 68.611 73.625 83.593 98.270 117.767 86.671 14.764 50 0

This worked as expected.

2.4.2 Cleaning up

I will remove the temporary name that I am no longer using.

chr_2021 <- chr_2021 %>%
    select(-c(temp3))

2.5 Structure of My Tibble

This is a tibble which shows the dimensions of my data, 252 rows (counties) by 10 columns (variables). This the number of rows (counties) I expect. I have my three required variables, fipscode (a character variable), county (a character variable), and state (a factor variable). My five selected variables have the correct names and are all numerical. My two categorical variables are each factor variables as expected, and have the correct number of categories each: 2 categories for my binary variable and 5 for my multi-categorical variable.

str(chr_2021)
tibble [252 x 10] (S3: tbl_df/tbl/data.frame)
 $ fipscode               : chr [1:252] "34001" "34003" "34005" "34007" ...
 $ state                  : Factor w/ 5 levels "NJ","NY","OH",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ county                 : chr [1:252] "Atlantic County" "Bergen County" "Burlington County" "Camden County" ...
 $ violent_crime          : num [1:252] 373.2 84.3 162.8 464.9 235.9 ...
 $ poor_mental_health_days: num [1:252] 4.72 4 4.45 4.6 4.55 ...
 $ excess_drinking        : num [1:252] 17.2 18.1 19.8 17.5 21.4 ...
 $ unemployment           : num [1:252] 5.06 2.92 3.27 4.05 7.15 ...
 $ med_household_income   : num [1:252] 62.7 108 88.4 73.2 66.6 ...
 $ unemp_cat              : Factor w/ 2 levels "high","low": 1 2 2 2 1 1 1 2 2 2 ...
 $ med_income_cat         : Factor w/ 5 levels "quant_1","quant_2",..: 4 5 5 5 4 2 4 5 5 5 ...

3 Codebook

Variable Description
fipscode FIPS code
state State: my five states are NY, NJ, OH, PA, VT
county County Name
violent_crime (v043) Violent Crime Rate, which will be my outcome
poor_mental_health_days (v042) Number of Poor Mental Health Days (age-adjusted)
excess_drinking (v049) Excessive Drinking Percentage
unemployment (v023) Unemployment Rate
med_household_income (v063) Median Household Income in Thousands of Dollars
unemp_cat 2 levels based on the median: low = unemployment below 4.3%, or high
med_income_cat 5 levels based on quantiles: quant_1 = first quantile (41.27-51.02 thousand dollars), quant_2 = second quantile (51.04-55.66 thousand dollars), quant_3 = third quantile (55.78-60.92 thousand dollars), quant_4 = fourth quantile (61.04-68.36 thousand dollars), or quant_5 = fifth quantile (68.61-117.77 thousand dollars) of median_household_income

3.1 Detailed Descriptions

Variable Detailed Description
violent_crime This was originally v043_rawvalue. This will be my outcome variable. It is listed in the Community Safety subcategory under the Social & Economic Factors category of the County Health Rankings. It describes the number of reported violent crime offenses per 100,000 of the county’s population. Violent crimes are defined as offenses that involve face-to-face confrontation between a victim and a perpetrator, which are counted in the precinct in which they occur. This measure only includes crimes the police precincts report to the FBI. Caution should be used when comparing this variable between states due to bias across jurisdictions. There is the potential for under-reporting due to victims’ willingness to report, law enforcement response differences, and obstacles to FBI reporting. It is based on data from the County-Level Detailed Arrest and Offense Data report from the Uniform Crime Reporting Program from 2014 and 2016.
poor_mental_health_days This was originally v042_rawvalue. It is listed in the Quality of Life subcategory under the Health Outcomes category in County Health Rankings. It is an age-adjusted rate that describes the average number of mentally unhealthy days reported in the past 30 days. It is based on data of the non-institutionalized adult population (greater than 18 years of age) from the Behavioral Risk Factor Surveillance System from 2018. These survey-collected data are then re-created using statistical modeling, requiring certain assumptions must be made.
excess_drinking This was originally v049_rawvalue. It is listed in the Alcohol Use section of the Health Behaviors subcategory under the Health Factors category in County Health Rankings. It describes the percentage of adults who report binge drinking or heavy drinking in the past 30 days. It is based on data from the Behavioral Risk Factor Surveillance System from 2018, which contains survey-based data from the non-institutionalized adult population (greater than 18 years of age).
unemployment This was originally v023_rawvalue. It is listed in the Employment section of the Social & Economic Factors subcategory under the Health Factors category in County Health Rankings. It describes the percentage of the county’s civilian labor force who are age 16 or older and unemployed but seeking work. It does not include the population that is unemployed but not seeking work, and it does not differentiate between those who cannot find any work and those who cannot find work at their preferred wage. It is based on data from the Local Area Unemployment Statistics program of the Bureau of Labor Statistics from 2019. These data are created using statistical modeling, requiring certain assumptions must be made.
med_household_income This was originally v063_rawvalue. It is listed in the Income section of the Social & Economic Factors subcategory under the Health Factors category in County Health Rankings. It describes the income for which half the county’s households earn more and half earn less. It includes all sources of usual income, and excludes unexpected or one-time monies, such as lump-sum receipts. It is based on data from the Small Area Income and Poverty Estimates program of the US Census Bureau from 2019. These data are created using statistical modeling, requiring certain assumptions must be made. These model-based data have been shown to provide more consistent and reliable estimates than survey estimates.

3.2 Proposal Requirement 1

I selected New Jersey which has 21 counties, New York which has 62 counties, Ohio which has 88 counties, Pennsylvania which has 67 counties, and Vermont which has 14 counties. My total number of counties is 252 counties. There are two reasons I chose these states: personal and practical. The personal reasons are that I just moved to Ohio from New York and, in doing so, have become intimately familiar with driving back and forth through Pennsylvania and New Jersey, and my best friend lives in Vermont. The practical reason is that these states comprise between 200-400 counties, which meets the project requirement, and have very little missing data for the variables I am looking to study.

chr_2021 %>% tabyl(state) %>% 
  adorn_pct_formatting() %>% 
  adorn_totals %>%
  kable()
state n percent
NJ 21 8.3%
NY 62 24.6%
OH 88 34.9%
PA 67 26.6%
VT 14 5.6%
Total 252 -

3.3 Proposal Requirement 2

I selected 5 variables: Violent Crime Rate (violent_crime, previously v043_rawvalue), Number of Poor Mental Health Days (poor_mental_health_days, previously v042_rawvalue), Proportion of Excessive Drinking (excess_drinking, previously v049), Unemployment Rate (unemployment, previously v023), and Median Household Income in dollars (med_household_income, previously v063). I changed my unit of my Median Household Income variable to thousands of dollars by dividing by 1000. I changed the unit of my Unemployment and Excessive Drinking variables from proportions to percentages by multiplying by 100.

My outcome variable is Violent Crime Rate. My two quantitative predictor variables are Poor Mental Health Days and Excessive Drinking. My binary categorical variable is Unemployment which is split at the median value. I chose Unemployment as my binary variable because at the individual level it is binary (employed vs. unemployed), so it makes logical sense to make it binary at the county level. I split it at the median because this makes the most sense when comparing my data as I can ask the question: Is this county’s unemployment rate more or less than the median unemployment rate for the counties I’m studying? My multi-categorical variable is Median Household Income which I split into 5 categories by quantile. I chose Median Household Income as my multi-categorical variable because small differences are not that meaningful for this variable, the data has a wide range, and we are accustomed to seeing income in categories in the real world, e.g. tax brackets.

My motivation for choosing violent_crime as my outcome variable is that I was recently the victim of a violent crime that occurred only a few months after moving to Ohio. I had never been a victim when I lived in New York. I am curious about whether there is a difference in violent crime rates between counties in these two states and some of the states surrounding them, and if so, I would like to explore some predictor variables that could explain this.

My motivation for choosing my predictor variables is as follows. I chose poor_mental_health_days because I suspect a quality of life measure could predict violent crime, and mental health likely contributes more than physical health given the impulsive nature of many violent crimes. I chose excess_drinking because I suspect a lifestyle/behavior measure could predict violent crime, and this measure affects impulsiveness and decision-making. I chose unemployment and med_household_income because I suspect social & economic factors could predict violent crime; particularly those with a wide range, stratified across differing socioeconomic status, which is why I chose to categorize these variables. Unemployment may cause people to have more available time and use dangerous ways to obtain money to live. Median household income may similarly cause people to feel the need to obtain money in dangerous ways if they are in one of the lower quantiles, and it may be an indirect measure of lower socioeconomic status which will be useful to ascertain for each county.

3.4 Proposal Requirement 3

This is a tibble of my data chr_2021.

chr_2021
# A tibble: 252 x 10
   fipscode state county       violent_crime poor_mental_health~ excess_drinking
   <chr>    <fct> <chr>                <dbl>               <dbl>           <dbl>
 1 34001    NJ    Atlantic Co~         373.                 4.72            17.2
 2 34003    NJ    Bergen Coun~          84.3                4.00            18.1
 3 34005    NJ    Burlington ~         163.                 4.45            19.8
 4 34007    NJ    Camden Coun~         465.                 4.60            17.5
 5 34009    NJ    Cape May Co~         236.                 4.55            21.4
 6 34011    NJ    Cumberland ~         516.                 4.93            18.5
 7 34013    NJ    Essex County         606.                 4.37            17.3
 8 34015    NJ    Gloucester ~         116.                 4.51            18.9
 9 34017    NJ    Hudson Coun~         342.                 4.28            18.1
10 34019    NJ    Hunterdon C~          42.3                4.02            20.7
# ... with 242 more rows, and 4 more variables: unemployment <dbl>,
#   med_household_income <dbl>, unemp_cat <fct>, med_income_cat <fct>

3.5 Proposal Requirement 4

Here are numerical summaries for each of my variables.

Hmisc::describe(chr_2021)
chr_2021 

 10  Variables      252  Observations
--------------------------------------------------------------------------------
fipscode 
       n  missing distinct 
     252        0      252 

lowest : 34001 34003 34005 34007 34009, highest: 50019 50021 50023 50025 50027
--------------------------------------------------------------------------------
state 
       n  missing distinct 
     252        0        5 

lowest : NJ NY OH PA VT, highest: NJ NY OH PA VT
                                        
Value         NJ    NY    OH    PA    VT
Frequency     21    62    88    67    14
Proportion 0.083 0.246 0.349 0.266 0.056
--------------------------------------------------------------------------------
county 
       n  missing distinct 
     252        0      200 

lowest : Adams County     Addison County   Albany County    Allegany County  Allegheny County
highest: Wood County      Wyandot County   Wyoming County   Yates County     York County     
--------------------------------------------------------------------------------
violent_crime 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     248        4      248        1    190.1    133.1    57.43    72.15 
     .25      .50      .75      .90      .95 
  106.52   149.06   224.88   369.73   452.51 

lowest :    0.00000   16.33720   19.05950   20.56649   36.14752
highest:  597.89467  606.18998  645.99459  824.39143 1001.47354
--------------------------------------------------------------------------------
poor_mental_health_days 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     252        0      252        1    4.798   0.5013    4.009    4.216 
     .25      .50      .75      .90      .95 
   4.523    4.812    5.122    5.319    5.473 

lowest : 3.555941 3.670392 3.690357 3.696267 3.744690
highest: 5.664264 5.719813 5.808883 5.814942 5.817593
--------------------------------------------------------------------------------
excess_drinking 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     252        0      252        1    19.91    2.284    16.72    17.17 
     .25      .50      .75      .90      .95 
   18.31    20.24    21.41    22.14    22.81 

lowest : 15.32379 15.51032 15.65669 15.85072 15.87386
highest: 24.09274 24.13436 24.33990 24.58178 24.90131
--------------------------------------------------------------------------------
unemployment 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     252        0      252        1    4.354    1.111    2.932    3.175 
     .25      .50      .75      .90      .95 
   3.687    4.280    4.955    5.584    5.964 

lowest : 1.791881 2.096567 2.252619 2.309637 2.339545
highest: 6.839601 6.852522 6.944904 7.153895 8.269411
--------------------------------------------------------------------------------
med_household_income 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     252        0      251        1     62.1     15.1    47.05    48.84 
     .25      .50      .75      .90      .95 
   51.90    58.43    65.59    82.86    96.40 

lowest :  41.267  41.470  43.145  43.621  43.704
highest: 110.252 112.722 116.328 117.275 117.767
--------------------------------------------------------------------------------
unemp_cat 
       n  missing distinct 
     252        0        2 
                    
Value      high  low
Frequency   126  126
Proportion  0.5  0.5
--------------------------------------------------------------------------------
med_income_cat 
       n  missing distinct 
     252        0        5 

lowest : quant_1 quant_2 quant_3 quant_4 quant_5
highest: quant_1 quant_2 quant_3 quant_4 quant_5
                                                  
Value      quant_1 quant_2 quant_3 quant_4 quant_5
Frequency       51      50      51      50      50
Proportion   0.202   0.198   0.202   0.198   0.198
--------------------------------------------------------------------------------

3.6 Three Important Checks

3.6.1 Each of my variables must have data for 75+% of counties in each state I plan to study

I will create a tibble which summarizes any missing data.

chr_2021 %>% 
    miss_var_summary()
# A tibble: 10 x 3
   variable                n_miss pct_miss
   <chr>                    <int>    <dbl>
 1 violent_crime                4     1.59
 2 fipscode                     0     0   
 3 state                        0     0   
 4 county                       0     0   
 5 poor_mental_health_days      0     0   
 6 excess_drinking              0     0   
 7 unemployment                 0     0   
 8 med_household_income         0     0   
 9 unemp_cat                    0     0   
10 med_income_cat               0     0   

I see that violent_crime is missing from 4 counties (1.6% of counties). I will run favstats to determine the distribution of my missing data between states. I will use mutate to view the percentages correctly.

mosaic::favstats(violent_crime ~ state, data = chr_2021) %>%
    select(state, n, missing) %>%
    mutate(pct_available = 100*(n - missing)/n) %>%
    kable()
state n missing pct_available
NJ 21 0 100.0000
NY 62 0 100.0000
OH 84 4 95.2381
PA 67 0 100.0000
VT 14 0 100.0000

All 4 of the counties that are missing violent_crime are in the state of Ohio. I have this data available for 95.2% counties in the state of Ohio and 100% of counties in my other 4 states (NJ, NY, PA, VT).

3.6.2 Each of my 5 selected variables must have 10+ distinct non-missing values

I will create a tibble which shows the number of distinct non-missing values for each variable.

chr_2021 %>% 
    summarize(across(violent_crime:med_household_income, ~ n_distinct(.)))
# A tibble: 1 x 5
  violent_crime poor_mental_healt~ excess_drinking unemployment med_household_i~
          <int>              <int>           <int>        <int>            <int>
1           249                252             252          252              251

I have at least 10 distinct values per variable.

3.6.3 Each of my categorical variables must include 10+ distinct counties in every category

I will create a table to show the number of distinct counties in category of my two categorical variables.

chr_2021 %>% tabyl(unemp_cat)
 unemp_cat   n percent
      high 126     0.5
       low 126     0.5
chr_2021 %>% tabyl(med_income_cat)
 med_income_cat  n   percent
        quant_1 51 0.2023810
        quant_2 50 0.1984127
        quant_3 51 0.2023810
        quant_4 50 0.1984127
        quant_5 50 0.1984127

I have at least 10 counties in each category for each of the categorical variables that I created.

3.7 Saving the Tibble

Here I will save my tibble as a file.

saveRDS(chr_2021, file = "chr_2021_Catherine_Heinzinger.Rds")

3.8 Proposal Requirement 5

The most challenging part of this work so far has been determining exactly what each line of code means. Throughout the data ingestion and tidying, I have had to keep in mind which variable was my outcome measure and which were my predictors. I had to think creatively about variables that may differ across these states and counties and could potentially predict violent crime. In order to overcome this difficulty, I used Dr. Love’s example, but I rewrote each chunk of code myself and I wrote one line at a time to see how each line changed the output. I made sure to say each line of code as an English sentence so I could start to really understand what it would do to the data and output. This took a lot of extra reading and internet searches, but made this proposal an incredible learning experience.

4 Analysis 1: Linear Regression Model

In this section, I will build a linear regression model to predict a quantitative outcome, violent crime rate, using a quantitative predictor, number of poor mental health days.

4.1 The Variables

My outcome variable is violent_crime and my predictor variable is poor_mental_health_days. violent_crime is a rate that describes the number of reported violent crime offenses per 100,000 of the county’s population. In this data set, a violent crime is an offense that involves face-to-face confrontation between a victim and a perpetrator and includes only those crimes reported to the FBI. My outcome variable is missing in 4 counties, all of which are in Ohio. poor_mental_health_days is an age-adjusted rate that describes the average number of mentally unhealthy days reported in the past 30 days among the non-institutionalized adult population. My predictor variable is not missing for any counties in my states of interest which are New Jersey, New York, Ohio, Pennsylvania, and Vermont.

I will restrict my data set to the complete cases of these two variables and create a tibble called chr_a1. This data set will contain values for violent_crime, poor_mental_health_days, county, and state. violent_crime will be renamed to outcome_crime and poor_mental_health_days will be renamed to predictor_mhealth. I will add in a new variable called id to give each row a unique variable, to help with my analyses later on.

chr_a1 <- chr_2021 %>%
  filter(complete.cases(violent_crime, poor_mental_health_days)) %>%
  select(violent_crime, poor_mental_health_days, county, state) %>%
  rename(outcome_crime = violent_crime,
         predictor_mhealth = poor_mental_health_days) %>%
  mutate(id = row_number()) %>%
  clean_names()

chr_a1
# A tibble: 248 x 5
   outcome_crime predictor_mhealth county            state    id
           <dbl>             <dbl> <chr>             <fct> <int>
 1         373.               4.72 Atlantic County   NJ        1
 2          84.3              4.00 Bergen County     NJ        2
 3         163.               4.45 Burlington County NJ        3
 4         465.               4.60 Camden County     NJ        4
 5         236.               4.55 Cape May County   NJ        5
 6         516.               4.93 Cumberland County NJ        6
 7         606.               4.37 Essex County      NJ        7
 8         116.               4.51 Gloucester County NJ        8
 9         342.               4.28 Hudson County     NJ        9
10          42.3              4.02 Hunterdon County  NJ       10
# ... with 238 more rows

The above tibble has 248 rows (counties) which I expected because 248 counties have complete data on both outcome_crime and predictor_mhealth.

One of the counties in my data set has a ‘0’ value for violent_crime. In order to be able to properly test transformations of my data, particularly to take the logarithm of my outcome, I need to omit this county from my data set. To do this, I will filter my data to include only counties with an outcome value greater than ‘0’.

chr_a1 <- chr_a1 %>%
  filter(outcome_crime > 0) %>%
  mutate(id = row_number())

nrow(chr_a1)
[1] 247

There are 247 rows (counties) remaining, which is correct.

I will specify the values of my outcome, outcome_crime, and predictor, predictor_mhealth for Cuyahoga County in Ohio, where CWRU’s campus is located.

chr_a1 %>%
  filter(county == "Cuyahoga County") %>%
  summarise(outcome_crime, predictor_mhealth, county, state) %>%
  kable(digits = 2)
outcome_crime predictor_mhealth county state
645.99 4.97 Cuyahoga County OH

In Ohio’s Cuyahoga County, the rate of violent crime is 645.99 per 100,000 people and the rate of poor mental health days is 4.97 days in the past 30 days.

4.2 Research Question

Do populations with a higher rate of poor mental health days also have a higher rate of violent crime, in 247 counties in the states of New Jersey, New York, Ohio, Pennsylvania, and Vermont?

4.3 Visualizing the Data

I will visualize my data in a scatter plot using the ggplot function, with the addition of a linear model and a loess smooth curve to assess how well predictor_mhealth predicts outcome_crime. I will identify the two furthest outliers by creating two subsets of the data: one with a violent crime rate >1000 (s1) and one with a violent crime rate >750 and <1000 (s2). In order to more easily summarize the county and state of these outliers, I will create a third subset which includes both outliers by using a standard of violent crime rate >750. Finally, I will create a data summary to determine which states my two outliers are in.

s1 <- filter(chr_a1, outcome_crime > 1000)
s2 <- filter(chr_a1, outcome_crime > 750 & outcome_crime < 1000)
s3 <- filter(chr_a1, outcome_crime > 750)

ggplot(chr_a1, aes(x = predictor_mhealth, y = outcome_crime)) +
    geom_point(size = 2, alpha = 0.6) + 
    geom_smooth(method = "lm", formula = y ~ x,
                se = TRUE, size = 1, col = "blue") + 
    geom_smooth(method = "loess", formula = y ~ x, 
                se = FALSE, size = 1, col = "red") +
    geom_point(data = s1, size = 2, col = "forestgreen") +
    geom_text(data = s1, label = s1$county, 
              vjust = 1.7, col = "forestgreen") +
    geom_point(data = s2, size = 2, col = "tomato") +
    geom_text(data = s2, label = s2$county,
               vjust = 1.7, col = "tomato") +
    labs(title = "Poor Mental Health as a Predictor of Violent Crime",
         subtitle = "with OLS in blue, loess smooth in red, and outliers labeled",
         caption = "247 counties in NJ, NY, OH, PA, VT from `chr_a1`",
         x = "# Poor Mental Health Days in Past 30 Days",
         y = "Violent Crime Rate (per 100,000 people)")

s3 %>% summarise(id, county, state) %>%
  kable()
id county state
128 Lucas County OH
217 Philadelphia County PA

The number of poor mental health days in the past 30 days does not predict the violent crime rate well in these 247 counties. The relationship is very weak and slightly negative. The fact that it is nearly a horizontal line indicates that it is essentially depicting the mean of the y axis values, which are quite randomly scattered. Although this model does not work well for any counties, I highlighted the two counties for which this model is least able to predict violent crime, by identifying the two points with the furthest vertical distance from the linear model. These are Lucas County in Ohio and Philadelphia County in Pennsylvania.

4.4 Transformation Assessment

I applied a logarithm, an inverse, and a square to my outcome and then to my predictor. None of these appeared to provide a better outcome prediction visually. I then used the boxCox function to help me determine if a transformation is recommended, and if so, which. I used the powerTransform function to tell me a more accurate value on the Profile Log-likelihood graph.

boxCox(chr_a1$outcome_crime ~ chr_a1$predictor_mhealth)

powerTransform(chr_a1$outcome_crime ~ chr_a1$predictor_mhealth)
Estimated transformation parameter 
        Y1 
0.04528785 

This indicates that I should consider applying a logarithm to my outcome. The following code will create a scatter plot of my model with my outcome transformed with a logarithm, created using ggplot. I created new subsets to denote outliers: s4 which contains values of the logarithm of my outcomes >2.9 and <3.2, and s5 which contains values of the logarithm of my outcomes <2.9. I will add a linear model and a loess smooth curve.

s4 <- filter(chr_a1, log(outcome_crime) > 2.9, log(outcome_crime) < 3.2)
s5 <- filter(chr_a1, log(outcome_crime) < 2.9)

ggplot(chr_a1, aes(x = predictor_mhealth, y = log(outcome_crime))) +
    geom_point(size = 2, alpha = 0.6) + 
    geom_smooth(method = "lm", formula = y ~ x,
                se = TRUE, size = 1, col = "blue") + 
    geom_smooth(method = "loess", formula = y ~ x, 
                se = FALSE, size = 1, col = "red") +
    geom_point(data = s1, size = 2, col = "slateblue") +
    geom_text(data = s1, label = s1$county, 
              vjust = 1.7, hjust = 0.4, col = "slateblue") +
    geom_point(data = s2, size = 2, col = "slateblue") +
    geom_text(data = s2, label = s2$county,
               vjust = -0.6, col = "slateblue") +
    geom_point(data = s4, size = 2, col = "slateblue") +
    geom_text(data = s4, label = s4$county,
               vjust = -0.6, col = "slateblue") +
    geom_point(data = s5, size = 2, col = "slateblue") +
    geom_text(data = s5, label = s5$county,
               vjust = 0, hjust = -0.1, col = "slateblue") +
    labs(title = "Poor Mental Health as a Predictor of Log(Violent Crime)",
         subtitle = "with OLS in blue, loess smooth in red, and outliers labeled",
         caption = "247 counties in NJ, NY, OH, PA, VT from `chr_a1`",
         x = "# Poor Mental Health Days in Past 30 Days",
         y = "Log(Violent Crime Rate (per 100,000 people))")

My loess smooth curve here is quite straight and does not have substantial bend. Although still quite scattered, the outliers are not as far from the linear model as in other transformations. The log appears to fix the problem of right skew, now having some outliers on both sides of the range.

4.5 The Fitted Model

4.5.1 Prediction Equation

In the following code, I will create a transformed straight line model fitted by least squares called m1_log that uses predictor_mhealth to predict the logarithm of outcome_crime. I will use the equatiomatic package to create the fitted line equation in an attractive format.

m1_log <- lm(log(outcome_crime) ~ predictor_mhealth, data = chr_a1)

extract_eq(m1_log, use_coefs = TRUE, coef_digits = 3)

\[ \operatorname{\widehat{log(outcome\_crime)}} = 5.064 - 0.003(\operatorname{predictor\_mhealth}) \] The slope of my predictor is negative, indicating that as predictor_mhealth increases, the log of outcome_crime decreases and that as predictor_mhealth decreases, the log of outcome_crime increases. For every additional poor mental health day in the past 30 days, the log of the violent crime rate per 100,000 people will be lower by 0.003. This model also implies that if the number of poor mental health days in the past 30 days was ‘0’, the log of the violent crime rate per 100,000 people would be 5.06.

4.5.2 Tidy Summary

I will create a summary of m1_log’s coefficients with a 90% confidence interval using the tidy function and made into an attractive format using kable.

tidy(m1_log, conf.int = TRUE, conf.level = 0.90) %>%
    kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 5.064 0.445 11.377 0.000 4.329 5.798
predictor_mhealth -0.003 0.093 -0.033 0.973 -0.156 0.150

These data show me that for every additional poor mental health day in the past 30 days, the log of the county’s violent crime rate per 100,000 people will be lower by 0.003 with a confidence interval of (-0.156 - 0.150). The fact that my confidence interval contains a value of ‘0’, and my estimate is so near ‘0’ indicates that my model cannot predict the outcome well, i.e. very slightly more than half the time, my model will estimate that higher number of poor mental health days predicts lower violent crime rates, and very slightly less than half the time, my model will estimate that higher number of poor mental health days predicts higher violent crime rates.

4.5.3 Specified Summary Data

I will use the glance function to view the model’s R-squared, residual standard error, and the number of observations to which the model was fit. I will use the cor function to calculate the model’s Pearson correlation coefficient.

glance(m1_log) %>% 
  select(r.squared, sigma, nobs) %>%
  kable(digits = 3)
r.squared sigma nobs
0 0.641 247
chr_a1 %$% cor(outcome_crime, predictor_mhealth)
[1] -0.04183222

The multiple R-squared is 0, when rounded to a reasonable number of digits. This implies that none of the variation in the log of my outcome is explained by my predictor per my model. The residual standard error is 0.641, indicating that 95% of my error will be between -12.82 and +12.82, which is not impressive given the range of my log(outcome) data values. The number of observations is 247. The Pearson correlation coefficient is -0.042, which implies a weak correlation that is unlikely meaningful in this context. I used R to calculate this rather than taking the square root of r squared because r squared is 0.

4.6 Residual Analysis

In this section, I will analyze the residuals from my transformed linear model. I will use augment to obtain the residuals.

m1_log_aug <- augment(m1_log, data = chr_a1)

m1_log_aug
# A tibble: 247 x 11
   outcome_crime predictor_mhealth county    state    id .fitted  .resid    .hat
           <dbl>             <dbl> <chr>     <fct> <int>   <dbl>   <dbl>   <dbl>
 1         373.               4.72 Atlantic~ NJ        1    5.05  0.873  0.00416
 2          84.3              4.00 Bergen C~ NJ        2    5.05 -0.618  0.0171 
 3         163.               4.45 Burlingt~ NJ        3    5.05  0.0423 0.00644
 4         465.               4.60 Camden C~ NJ        4    5.05  1.09   0.00483
 5         236.               4.55 Cape May~ NJ        5    5.05  0.414  0.00529
 6         516.               4.93 Cumberla~ NJ        6    5.05  1.20   0.00448
 7         606.               4.37 Essex Co~ NJ        7    5.05  1.36   0.00764
 8         116.               4.51 Gloucest~ NJ        8    5.05 -0.300  0.00571
 9         342.               4.28 Hudson C~ NJ        9    5.05  0.784  0.00945
10          42.3              4.02 Hunterdo~ NJ       10    5.05 -1.31   0.0165 
# ... with 237 more rows, and 3 more variables: .sigma <dbl>, .cooksd <dbl>,
#   .std.resid <dbl>

The residual values are stored as .resid for each row (county) in the above tibble.

4.6.1 Assessment of linearity and normality

I will create a pair of residual plots: a scatter plot of residuals vs. fitted values to assess linearity and a normal Q-Q plot to assess normality of the residuals. I will use ggplot and the patchwork package and include a linear model and a loess smooth curve. I will also use slice to identify the counties and states of the 5 outliers I have coded my plot to show. The plot is too crowded to attractively show the county names on it.

p1 <- ggplot(data = m1_log_aug, aes(x = .fitted, y = .resid)) +
    geom_point(size = 2, alpha = 0.6) + 
    geom_point(data = m1_log_aug %>% 
                   slice_max(abs(.resid), n = 5),
               col = "red", size = 2) +
    geom_text_repel(data = m1_log_aug %>% 
                        slice_max(abs(.resid), n = 5),
                    aes(label = id), col = "red") +
    geom_smooth(method = "lm", formula = y ~ x,
                se = FALSE, size = 1, col = "blue") + 
    geom_smooth(method = "loess", formula = y ~ x, 
                se = FALSE, size = 1, col = "red") +
    theme(aspect.ratio = 0.5) +
    labs(title = "Residuals vs. Fitted",
         subtitle = "with OLS in blue, loess smooth in red",
         x = "Fitted Values",
         y = "Residuals")

p2 <- ggplot(data = m1_log_aug, aes(sample = .resid)) +
    geom_qq() + 
    geom_qq_line(col = "red") +
    theme(aspect.ratio = 1) +
    labs(title = "Normal Q-Q plot",
         x = "Theoretical Quantiles",
         y = "Residuals")

p1 + p2 +
  plot_annotation(title = "Assessment of Residuals",
                  subtitle = "for the linear model using `predictor_mhealth` to predict log(`outcome_crime`)",
                  caption = "247 counties in NJ, NY, OH, PA, and VT in m1_log_aug")

slice(m1_log_aug, c(119, 128, 217, 238, 242)) %>%
  summarise(id, county, state, .resid) %>%
  kable (digits = 2)
id county state .resid
119 Holmes County OH -2.02
128 Lucas County OH 1.67
217 Philadelphia County PA 1.86
238 Essex County VT -2.26
242 Orange County VT -2.10

This shows a scatter plot of residuals vs. fitted values which helps me determine linearity of my transformed model, and a normal Q-Q plot of residuals which helps me determine normality of my transformed model. A description of this pair of plots is below.

4.6.2 Residual plot interpretation

Aside from a significant number of outliers (i.e. high residual values), my scatter plot of residuals vs. fitted values does appear somewhat linear, showing a weak “fuzzy football” shape. There is no distinct pattern to it. The loess smooth curve does not have very significant bend in it. The normal Q-Q plot shows that my model follows a normal distribution well, aside from 5 outliers. I wrote my code to show me what these 5 outliers are. They are county id 119, which is Holmes County in Ohio, county id 128, which is Lucas County in Ohio, county id 217, which is Philadelphia County in Pennsylvania, county id 238, which is Essex County in Vermont, and county id 242, which is Orange County in Vermont.

4.6.3 Ability of m1 to predict violent crime rate in Cuyahoga County

I will create an untransformed model and display its prediction for violent crime in Cuyahoga County in Ohio, where CWRU’s campus is located, and compare the prediction, or fitted value, to the actual value of this outcome.

m1 <- lm(outcome_crime ~ predictor_mhealth, data = chr_a1)

m1_aug <- augment(m1, data = chr_a1)

m1_aug %>%
  filter(county == "Cuyahoga County") %>%
  summarise(id, county, state, outcome_crime, .fitted, .resid) %>%
  kable (digits = 2)
id county state outcome_crime .fitted .resid
100 Cuyahoga County OH 645.99 188.56 457.43

My model’s prediction for violent crime in Cuyahoga County is 188.56, while the actual value is 645.99, creating a residual of 457.43. This residual is much too high in this context to confidently state that my model creates a good prediction for Cuyahoga County.

4.6.4 Identification of counties for which m1 is least successful

I will use the arrange function to create a tibble of my untransformed data set that is arranged by absolute value of the residuals.

m1_aug %>% select(id, county, state, .resid) %>%
  arrange(desc(abs(.resid))) %>%
  head() %>%
  kable (digits = 2)
id county state .resid
217 Philadelphia County PA 821.15
128 Lucas County OH 637.56
100 Cuyahoga County OH 457.43
7 Essex County NJ 409.93
64 Richmond County NY 399.56
62 Queens County NY 390.95

The 2 counties for which my model m1 is least successful at predicting the outcome are Philadelphia County in Pennsylvania and Lucas County in Ohio, as these 2 counties have the highest absolute value of the residual. The residual for Lucas County in Ohio is 637.56 and the residual for Philadelphia County in Pennsylvania is 821.15.

4.7 Conclusions and Limitations

I ran this analysis in order to determine if populations with a higher rate of poor mental health days also have a higher rate of violent crime, in 247 counties in the states of New Jersey, New York, Ohio, Pennsylvania, and Vermont. The answer is no, counties with higher rates of poor mental health days do not have higher rates of violent crime in these counties. My model m1_log indicates that, in these counties, more frequent poor mental health is associated with lower violent crime, albeit with a very weak correlation. In fact, my model implies that for every additional poor mental health day in the past 30 days, the log of the violent crime rate per 100,000 people will be lower by 0.003. I am 90% confident that the true slope is between (-0.156, 0.150), which contains the value ‘0’, implying that knowing the number of poor mental health days in the past 30 days would provide no meaningful information in predicting violent crime.

One major limitation of my work in Analysis 1 is that all my counties are within northeastern states, i.e. they are not randomly selected. Ohio is sometimes considered northeast and sometimes considered Midwest, but this grey area does not add any meaningful variation to my selected counties because 4 of 5 are definitely northeastern (NJ, NY, PA, and VT). This affects my external validity and application to my target population, all counties in the US. My model was not limited by any of the 4 key regression assumptions, which are linearity, independence, constant variance, and normality. The transformation of my data improved linearity, depicted above with a scatter plot with a loess smooth curve that did not show substantial bend and points which had no pattern when plotted against the fitted values, and improved normality, depicted above in my normal Q-Q plot. These counties were measured independently from each other and are only related by region and state. Finally, there is no fan or funnel shape on my residual plot, which would indicate non-constant variance.

5 Analysis 2: Comparing Groups on a Quantitative Outcome Using Independent Samples

In this section, I will build a linear regression model to predict violent crime rate using unemployment rate, a binary variable, in the counties in NJ, NY, OH, PA, and VT.

5.1 The Variables

My outcome variable is violent_crime and my predictor variable is unemp_cat which is a binary categorized variable based on the variable unemployment. violent_crime is a rate that describes the number of reported violent crime offenses per 100,000 of the county’s population. In this data set, a violent crime is an offense that involves face-to-face confrontation between a victim and a perpetrator and includes only those crimes reported to the FBI. My outcome variable is missing in 4 counties, all of which are in Ohio. unemp_cat describes the percentage of the county’s civilian labor force who are age 16 or older and unemployed but seeking work. This variable is the result of splitting the variable unemploymentinto two binary variables based on the median. The counties are designated “low” if they have an unemployment percentage below 4.3%, or “high”. My predictor variable is not missing for any counties in my states of interest which are New Jersey, New York, Ohio, Pennsylvania, and Vermont.

I will restrict my data set to the complete cases and create a tibble called chr_a2. I will rename unemp_cat to predictor_unemp and violent_crime to outcome_crime.

chr_a2 <- chr_2021 %>%
  filter(complete.cases(violent_crime, unemp_cat)) %>%
  select(violent_crime, unemp_cat, county, state) %>%
  rename(outcome_crime = violent_crime,
         predictor_unemp = unemp_cat) %>%
  clean_names()

chr_a2
# A tibble: 248 x 4
   outcome_crime predictor_unemp county            state
           <dbl> <fct>           <chr>             <fct>
 1         373.  high            Atlantic County   NJ   
 2          84.3 low             Bergen County     NJ   
 3         163.  low             Burlington County NJ   
 4         465.  low             Camden County     NJ   
 5         236.  high            Cape May County   NJ   
 6         516.  high            Cumberland County NJ   
 7         606.  high            Essex County      NJ   
 8         116.  low             Gloucester County NJ   
 9         342.  low             Hudson County     NJ   
10          42.3 low             Hunterdon County  NJ   
# ... with 238 more rows

The above tibble has 248 rows (counties) which I expected because 248 counties have complete data on both violent_crime, my outcome, and unemp_cat, my predictor, now predictor_unemp. My predictor is listed appropriately as a factor variable.

I will specify the values of my outcome, outcome_crime, and predictor, predictor_unemp for Cuyahoga County in Ohio, where CWRU’s campus is located.

chr_a2 %>%
  filter(county == "Cuyahoga County") %>%
  summarise(outcome_crime, predictor_unemp, county, state) %>%
  kable(digits = 2)
outcome_crime predictor_unemp county state
645.99 low Cuyahoga County OH

In Ohio’s Cuyahoga County, the rate of violent crime is 645.99 per 100,000 people and the unemployment percentage is low, i.e. below the median for the chr_2021 data set.

5.2 Research Question

Do populations with a percentage of unemployed residents above the median have a higher rate of violent crime, in 247 counties in the states of New Jersey, New York, Ohio, Pennsylvania, and Vermont?

5.3 Visualizing the Data

I will visualize my data in a box plot with violins using the ggplot function, with the addition of a point to designate the mean for each category. I will identify the two furthest outliers similarly to my method in analysis 1, by creating two subsets of the data. One subset comprises the county with a violent crime rate >1000 (s1) and the other comprises the counties with a violent crime rate >750 and <1000 (s2). In order to more easily summarize the county and state of these outliers, I will create a third subset which includes both outliers by using a standard of violent crime rate >750. Finally, I will create a data summary to determine which states my two outliers are in.

s1 <- filter(chr_a2, outcome_crime > 1000)
s2 <- filter(chr_a2, outcome_crime > 750 & outcome_crime < 1000)
s3 <- filter(chr_a2, outcome_crime > 750)

ggplot(chr_a2, aes(x = predictor_unemp, y = outcome_crime)) +
    geom_violin(aes(fill = predictor_unemp)) +
    geom_boxplot(width = 0.3, outlier.size = 2, notch = T) +
    stat_summary(fun = "mean", geom = "point",
               shape = 23, size = 3, fill = "slateblue2") +
    coord_flip() +
    guides(fill = "none") +
    scale_fill_manual(values = c("plum", "lightskyblue")) +
    geom_point(data = s1, size = 2, col = "forestgreen") +
    geom_text(data = s1, label = s1$county, 
              vjust = -1, hjust = 0.8, col = "forestgreen") +
    geom_point(data = s2, size = 2, col = "tomato") +
    geom_text(data = s2, label = s2$county,
               vjust = 1.7, col = "tomato") +
    labs(title = "Unemployment as a Predictor of Violent Crime",
         subtitle = "stratified by unemployment percentage above and below the median, with outliers labeled",
         caption = "248 counties in NJ, NY, OH, PA, VT from `chr_a2`",
         x = "Unemployment Percentage",
         y = "Violent Crime Rate (per 100,000 people)")

s3 %>% summarise(county, state) %>%
  kable()
county state
Lucas County OH
Philadelphia County PA

This plot of unemployment vs. violent crime rate shows a right skew for both counties with unemployment percentages below and above the median, i.e. “low” and “high”, in these 247 counties. The center of the data for both is pushed to the lower scale for violent crime rate. The counties with a high unemployment percentage have a wider range of violent crime rates and a skew that is further right. For both unemployment categories, the mean is higher than the median, because mean is more affected by the right skew than median is. I highlighted the two counties furthest outliers. These are Lucas County in Ohio and Philadelphia County in Pennsylvania. These are the same outliers as in analysis 1 where I compared poor mental health to violent crime, which makes sense.

5.4 Transformation Assessment

One of the counties in my data set has a ‘0’ value for violent_crime. In order to be able to properly test transformations of my data, particularly to take the logarithm of my outcome, I need to omit this county from my data set. To do this, I will create a data set that filters my data to include only counties with an outcome value greater than ‘0’.

chr_a2 <- chr_a2 %>%
  filter(outcome_crime > 0) %>%
  mutate(id = row_number())

nrow(chr_a2)
[1] 247

There are 247 rows (counties) remaining, which is correct.

I applied a logarithm, an inverse, and a square to my outcome and then to my predictor. None of these visually improved by plots. I then used the boxCox function to help me determine if a transformation is recommended, and if so, which. I used the powerTransform function to tell me a more accurate value on the Profile Log-likelihood graph.

boxCox(chr_a2$outcome_crime ~ chr_a2$predictor_unemp)

powerTransform(chr_a2$outcome_crime ~ chr_a2$predictor_unemp)
Estimated transformation parameter 
       Y1 
0.0268951 

This indicates that I should consider applying a logarithm to my outcome. The following code will create a box plot with violins of my model with my outcome transformed with a logarithm, created using ggplot.

ggplot(chr_a2, aes(x = predictor_unemp, y = log(outcome_crime))) +
    geom_violin(aes(fill = predictor_unemp)) +
    geom_boxplot(width = 0.3, outlier.size = 2, notch = T) +
    stat_summary(fun = "mean", geom = "point",
               shape = 23, size = 3, fill = "slateblue2") +
    coord_flip() +
    guides(fill = "none") +
    scale_fill_manual(values = c("plum", "lightskyblue")) +
    labs(title = "Unemployment as a Predictor of Log(Violent Crime)",
         subtitle = "stratified by unemployment percentage above and below the median",
         caption = "247 counties in NJ, NY, OH, PA, VT from `chr_a2`",
         x = "Unemployment Percentage",
         y = "Violent Crime Rate (per 100,000 people)")

The first change I notice here is that the means are closer to the medians for both categories, indicating there is less skew and my data is likely more normally distributed. The ranges are more similar for the two categories. Low unemployment percentage has 3 outliers to the left and high unemployment percentage has 2 outliers to the right, however all outliers are less far from the centers of the data. The log appears to have improved the right skew.

5.5 The Fitted Model

5.5.1 Prediction Equation

In the following code, I will create a transformed linear model fitted by least squares called m2_log that uses predictor_unemp to predict the logarithm of outcome_crime. I will use the equatiomatic package to create the fitted line equation in an attractive format.

m2_log <- lm(log(outcome_crime) ~ predictor_unemp, data = chr_a2)

extract_eq(m2_log, use_coefs = TRUE, coef_digits = 3)

\[ \operatorname{\widehat{log(outcome\_crime)}} = 5.153 - 0.206(\operatorname{predictor\_unemp}_{\operatorname{low}}) \] The slope of my predictor is negative, indicating that as predictor_unemp increases, the log of outcome_crime decreases and that as predictor_unemp decreases, the log of outcome_crime increases. For every additional percentage point of unemployment percentage, the log of the violent crime rate per 100,000 people will be lower by 0.206. This is a weak to moderate correlation. This model also implies that if the unemployment percentage was ‘0’, the log of the violent crime rate would be 5.153.

5.5.2 Tidy Summary

I will create a summary of m2_log’s coefficients with a 90% confidence interval using the tidy function and made into an attractive format using kable.

tidy(m2_log, conf.int = TRUE, conf.level = 0.90) %>%
    kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 5.153 0.057 90.008 0.000 5.058 5.247
predictor_unemplow -0.206 0.080 -2.556 0.011 -0.339 -0.073

These data show me that for every additional percentage point of unemployment percentage, the log of the county’s violent crime rate will be lower by 0.206 with a confidence interval of (-0.339, -0.073). The fact that my confidence interval does not contain a value of ‘0’ indicates that my model will predict the same direction of association in 95% of cases.

5.5.3 Specified Summary Data

I will use the glance function to view the model’s R-squared, residual standard error, and the number of observations to which the model was fit.

glance(m2_log) %>%
  select(r.squared, sigma, nobs) %>%
  kable(digits = 3)
r.squared sigma nobs
0.026 0.632 247

The multiple R-squared is 0.026, implying that 2.6% of the variation in log(outcome_crime) is explained by predictor_unemp as per my model. The residual standard error is 0.632 and the number of observations is 247.

5.6 Prediction Analysis

I will start this section by using mutate to recode my binary variable with more descriptive names, “below median” and “above median”. Then I will run a count of how many rows (counties) in each category.

chr_a2 <- chr_a2 %>%
    mutate(unemp2 = fct_recode(predictor_unemp,
                               "below median" = "low",
                               "above median" = "high"))

chr_a2 %>% count(predictor_unemp, unemp2) %>%
  kable()
predictor_unemp unemp2 n
high above median 122
low below median 125

There are 122 rows (counties) with an unemployment percentage above the median and 125 rows (counties) with an unemployment percentage below the median. The reason these category sizes are different, despite being split at the median, is because I omitted 4 counties with missing outcome data and 1 county with an outcome data value of ‘0’.

5.6.1 Normality Assessment

In order to plot the residuals against the categorical predictor to assess normality of the residuals within each category, I will need to create a model m2_log_aug for which I’ve used augment to show the residuals and fitted values for each row (county).

m2_log_aug <- augment(m2_log, data = chr_a2)

m2_log_aug
# A tibble: 247 x 12
   outcome_crime predictor_unemp county      state    id unemp2   .fitted .resid
           <dbl> <fct>           <chr>       <fct> <int> <fct>      <dbl>  <dbl>
 1         373.  high            Atlantic C~ NJ        1 above m~    5.15  0.769
 2          84.3 low             Bergen Cou~ NJ        2 below m~    4.95 -0.513
 3         163.  low             Burlington~ NJ        3 below m~    4.95  0.145
 4         465.  low             Camden Cou~ NJ        4 below m~    4.95  1.19 
 5         236.  high            Cape May C~ NJ        5 above m~    5.15  0.311
 6         516.  high            Cumberland~ NJ        6 above m~    5.15  1.09 
 7         606.  high            Essex Coun~ NJ        7 above m~    5.15  1.25 
 8         116.  low             Gloucester~ NJ        8 below m~    4.95 -0.198
 9         342.  low             Hudson Cou~ NJ        9 below m~    4.95  0.887
10          42.3 low             Hunterdon ~ NJ       10 below m~    4.95 -1.20 
# ... with 237 more rows, and 4 more variables: .hat <dbl>, .sigma <dbl>,
#   .cooksd <dbl>, .std.resid <dbl>

The residual values are stored as .resid for each row (county) in the above tibble.

I will create a pair of residual plots: a normal Q-Q plot of the residuals for the “low” unemployment category and a normal Q-Q plot of the residuals for the “high” unemployment category. This will allow me to assess normality of the residuals within each category. In order to do this I will create 2 temporary data sets: m2_log_aug_low which contains all the rows (counties) with an unemployment percentage below the median, and m2_log_aug_high which contains all the rows (counties) with an unemployment percentage above the median. I will use ggplot and the patchwork package.

m2_log_aug_low <- m2_log_aug %>%
  filter(predictor_unemp == "low")

m2_log_aug_high <- m2_log_aug %>%
  filter(predictor_unemp == "high")

p1 <- ggplot(data = m2_log_aug_low, aes(sample = .resid)) +
    geom_qq() + 
    geom_qq_line(col = "red") +
    labs(title = "Normal Q-Q plot",
         subtitle = "low unemployment category",
         x = "Theoretical Quantiles",
         y = "Residuals")

p2 <- ggplot(data = m2_log_aug_high, aes(sample = .resid)) +
  geom_qq() +
  geom_qq_line(col = "red") +
  labs(title = "Normal Q-Q plot",
       subtitle = "high unemployment category",
       x = "Theoretical Quantiles",
       y = "Residuals")

p1 + p2 +
  plot_annotation(title = "Assessment of Residuals",
                  subtitle = "for the linear model using `predictor_unemp` to predict log(`outcome_crime`)",
                  caption = "247 counties in NJ, NY, OH, PA, and VT in `m2_log_aug`")

The data for the high unemployment category appears closer to a normal distribution than for the low unemployment category. The low unemployment category contains 3 outliers on the lower end of the scale and the high unemployment category contains 2 outliers on the higher end of the scale, which makes sense given these are 2 categories split at the median.

5.6.2 Ability of the model to predict violent crime rate in Cuyahoga County

I will display an untransformed model’s prediction for violent crime in Cuyahoga County in Ohio, where CWRU’s campus is located, and compare the prediction, or fitted value, to the actual value of this outcome. To do this, I will create an untransformed model called m2 that shows outcome_crime as a function of predictor_unemp from the chr_a2 data set. Then I will augment this model to show me the residuals. Then I will filter to Cuyahoga County and select the columns I am interested in.

m2 <- lm(outcome_crime ~ predictor_unemp, data = chr_a2)

m2_aug <- augment(m2, data = chr_a2)

m2_aug %>%
  filter(county == "Cuyahoga County") %>%
  summarise(id, county, state, outcome_crime, .fitted, .resid) %>%
  kable(digits = 2)
id county state outcome_crime .fitted .resid
100 Cuyahoga County OH 645.99 180.77 465.22

My model’s prediction for violent crime in Cuyahoga County is 180.77 crimes per 100,000 people, while the actual value is 645.99, creating a residual of 465.22. This residual is much too high in this context to confidently state that my model creates a good prediction for Cuyahoga County.

5.6.3 Identification of counties for which m2 is least successful

I will use the arrange function to create a tibble of my untransformed data set that is arranged by absolute value of the residuals.

m2_aug %>% select(id, county, state, .resid) %>%
  arrange(desc(abs(.resid))) %>%
  head() %>%
  kable(digits = 2)
id county state .resid
217 Philadelphia County PA 800.22
128 Lucas County OH 623.14
100 Cuyahoga County OH 465.22
64 Richmond County NY 417.12
62 Queens County NY 417.01
52 New York County NY 405.64

The 2 counties for which my model m2 is least successful at predicting the outcome are Philadelphia County in Pennsylvania and Lucas County in Ohio, as these 2 counties have the highest absolute value of the residual. The residual for Lucas County in Ohio is 623.14 and the residual for Philadelphia County in Pennsylvania is 800.23.

5.7 Conclusions and Limitations

I ran these analyses in order to determine if populations with a percentage of unemployed residents above the median have a higher rate of violent crime, in 247 counties in the states of New Jersey, New York, Ohio, Pennsylvania, and Vermont. The answer is no. In fact, counties with an unemployment percentage above the median have very slightly lower violent crime rates. My model indicates that for every additional percentage point of unemployment, the log of the violent crime rate per 100,000 people decreases by 0.206. I am 90% confident that the true slope of the log transformed model is between (-0.339, -0.073), which is negative and does not contain the value ‘0’.

The logarithm transformation improved the limitations in my data in regards to normality, as this improved the right skew in the original data set. There were no issues with linearity, as my Q-Q plot of residuals did not show any pattern. The major limitation of my work in Analysis 2 was that all my counties are within northeastern states, i.e. they are not randomly selected. Ohio is sometimes considered northeast and sometimes considered Midwest, but this grey area does not add any meaningful variation to my selected counties because 4 of 5 are definitely northeastern (NJ, NY, PA, and VT). This affects my external validity and application to my target population, all counties in the US.

6 Analysis 3: Adjusting for the Impact of State on my Linear Model

In this section, I will build a linear regression model to predict violent crime using number of poor mental health days in the past 30 days and the state in which the county is located.

6.1 The Variables

My outcome variable is violent_crime and my predictor variable is poor_mental_health_days. violent_crime is a rate that describes the number of reported violent crime offenses per 100,000 of the county’s population. In this data set, a violent crime is an offense that involves face-to-face confrontation between a victim and a perpetrator and includes only those crimes reported to the FBI. My outcome variable is missing in 4 counties, all of which are in Ohio. poor_mental_health_days is an age-adjusted rate that describes the average number of mentally unhealthy days reported in the past 30 days among the non-institutionalized adult population. My predictor variable is not missing for any counties in my states of interest which are New Jersey, New York, Ohio, Pennsylvania, and Vermont. Finally, state describes the US state that the county is located in.

I will restrict my data set to the complete cases of these three variables and create a tibble called chr_a3. This data set will contain values for violent_crime, poor_mental_health_days, county, and state. violent_crime will be renamed to outcome_crime and poor_mental_health_days will be renamed to predictor_mhealth. I will add in a new variable called id to give each row a unique variable, to help with my analyses later on.

chr_a3 <- chr_2021 %>%
  filter(complete.cases(violent_crime, poor_mental_health_days, state)) %>%
  select(violent_crime, poor_mental_health_days, county, state) %>%
  rename(outcome_crime = violent_crime,
         predictor_mhealth = poor_mental_health_days) %>%
  mutate(state = fct_relevel(state, "OH")) %>%
  mutate(id = row_number()) %>%
  clean_names()

chr_a3
# A tibble: 248 x 5
   outcome_crime predictor_mhealth county            state    id
           <dbl>             <dbl> <chr>             <fct> <int>
 1         373.               4.72 Atlantic County   NJ        1
 2          84.3              4.00 Bergen County     NJ        2
 3         163.               4.45 Burlington County NJ        3
 4         465.               4.60 Camden County     NJ        4
 5         236.               4.55 Cape May County   NJ        5
 6         516.               4.93 Cumberland County NJ        6
 7         606.               4.37 Essex County      NJ        7
 8         116.               4.51 Gloucester County NJ        8
 9         342.               4.28 Hudson County     NJ        9
10          42.3              4.02 Hunterdon County  NJ       10
# ... with 238 more rows

The above tibble has 248 rows (counties) which I expected because 248 counties have complete data on violent_crime, my outcome, and excess_drinking and ‘state’, my predictors.

One of the counties in my data set has a ‘0’ value for violent_crime. In order to be able to properly test transformations of my data, particularly to take the logarithm of my outcome, I need to omit this county from my data set. To do this, I will filter my data to include only counties with an outcome value greater than ‘0’.

chr_a3 <- chr_a3 %>%
  filter(outcome_crime > 0)

nrow(chr_a3)
[1] 247

There are 247 rows (counties) remaining, which is correct.

I will specify the values of my outcome, outcome_crime, and predictor, predictor_mhealth for Cuyahoga County in Ohio, where CWRU’s campus is located.

chr_a3 %>%
  filter(county == "Cuyahoga County") %>%
  summarise(outcome_crime, predictor_mhealth, county, state) %>%
  kable(digits = 2)
outcome_crime predictor_mhealth county state
645.99 4.97 Cuyahoga County OH

In Ohio’s Cuyahoga County, the rate of violent crime is 645.99 per 100,000 people and the rate of poor mental health days is 4.97 days in the past 30 days.

6.2 Research Question

Do populations with a higher rate of poor mental health days also have a higher rate of violent crime, adjusting for the impact of state, in 247 counties in the states of New Jersey, New York, Ohio, Pennsylvania, and Vermont?

6.3 Visualizing the Data

I will visualize my data in a scatter plot using the ggplot function, with the addition of a linear model and a loess smooth curve to assess how well predictor_mhealth predicts the log of outcome_crime with results separated by state. I will use the same transformation I used in analysis 1, which is the logarithm of my outcome.

ggplot(data = chr_a3, aes(x = predictor_mhealth, y = log(outcome_crime), col = state, group = state)) +
    geom_point() +
    geom_smooth(method = "lm", formula = y ~ x, se = FALSE) + 
    facet_wrap(~ state) +
    labs(title = "Poor Mental Health as a Predictor of Log(Violent Crime)",
         subtitle = "separated by state, with OLS",
         caption = "247 counties in NJ, NY, OH, PA, VT from `chr_a3`",
         x = "# Poor Mental Health Days in Past 30 Days",
         y = "Log(Violent Crime Rate (per 100,000 people))")

The number of poor mental health days in the past 30 days appears to predict the violent crime rate better in some states than others in these 247 counties. The relationship appears the strongest in Pennsylvania, where the numbers are most clustered around the line of best fit. The relationship appears the weakest in Ohio and New York, where the points are really just randomly scattered. New Jersey’s plot has the highest slope, indicating that poor mental health adds the most predictive value to the outcome in this state compared to the other states.

6.4 The Fitted Model

6.4.1 Prediction Equation

In this section, I will create a model called m3_log that uses predictor_mhealth to predict the log of outcome_crime, after accounting for state. I will use the equatiomatic package to create the fitted line equation in an attractive format.

m3_log <- lm(log(outcome_crime) ~ 
               predictor_mhealth * state, data = chr_a3)

extract_eq(m3_log, use_coefs = TRUE, coef_digits = 3, 
           wrap = TRUE, terms_per_line = 1)

\[ \begin{aligned} \operatorname{\widehat{log(outcome\_crime)}} &= 4.069\ + \\ &\quad 0.158(\operatorname{predictor\_mhealth})\ - \\ &\quad 3.878(\operatorname{state}_{\operatorname{NJ}})\ + \\ &\quad 1.558(\operatorname{state}_{\operatorname{NY}})\ - \\ &\quad 0.046(\operatorname{state}_{\operatorname{PA}})\ + \\ &\quad 0.976(\operatorname{state}_{\operatorname{VT}})\ + \\ &\quad 0.975(\operatorname{predictor\_mhealth} \times \operatorname{state}_{\operatorname{NJ}})\ - \\ &\quad 0.242(\operatorname{predictor\_mhealth} \times \operatorname{state}_{\operatorname{NY}})\ + \\ &\quad 0.072(\operatorname{predictor\_mhealth} \times \operatorname{state}_{\operatorname{PA}})\ - \\ &\quad 0.298(\operatorname{predictor\_mhealth} \times \operatorname{state}_{\operatorname{VT}}) \end{aligned} \] The state I will discuss is New York. For New York, the equation will be as follows:

log(outcome_crime) = 4.069 + 0.158(0) − 3.878(0) + 1.558(1) − 0.046(0) + 0.976(0) + 0.975(0) − 0.242(predictor_mhealth*1) + 0.072(0) − 0.298(0)

or more simply:

log(outcome_crime)

= 4.069 + 1.558 - 0.242(predictor_mhealth)

= 5.627 - 0.242(predictor_mhealth)

The slope of the predictor for counties in New York is negative, indicating that as predictor_mhealth increases, the log of outcome_crime decreases, and vice versa. Within counties in the state of New York, for every additional poor mental health day in the past 30 days, the log of the violent crime rate will be lower by 0.242. This model also implies that if the number of poor mental health days in the past 30 days counties in New York was ‘0’, the log of the violent crime rate would be 5.63.

6.4.2 Tidy Summary

I will create a summary of m3_log’s coefficients with a 90% confidence interval using the tidy function and made into an attractive format using kable. I made Ohio my baseline state when I set up my data set.

tidy(m3_log, conf.int = TRUE, conf.level = 0.90) %>%
    select(term, estimate, std.error, conf.low, conf.high) %>%
    kable(digits = 3)
term estimate std.error conf.low conf.high
(Intercept) 4.069 0.875 2.624 5.514
predictor_mhealth 0.158 0.174 -0.128 0.445
stateNJ -3.878 1.910 -7.032 -0.725
stateNY 1.558 1.370 -0.705 3.821
statePA -0.046 1.519 -2.555 2.463
stateVT 0.976 2.887 -3.791 5.743
predictor_mhealth:stateNJ 0.975 0.423 0.277 1.673
predictor_mhealth:stateNY -0.242 0.295 -0.729 0.245
predictor_mhealth:statePA 0.072 0.302 -0.427 0.571
predictor_mhealth:stateVT -0.298 0.627 -1.334 0.737

These data show me that for every additional poor mental health day in the past 30 days, counties in Ohio will have a higher log of violent crime rate by 0.158 with a confidence interval of (-0.128, 0.445). The fact that my confidence interval contains ‘0’ indicates that my model cannot predict the outcome consistently, i.e. sometimes it will predict a higher log of violent crime rate, but sometimes lower. The only state with a 90% confidence interval that does not contain ‘0’ is New Jersey, and this has a positive slope, which implies that I am 90% confident that when I use my predictor to predict the log of my outcome for counties in New Jersey, it will be positive.

6.4.3 Specified Summary Data

I will use the glance function to view the model’s R-squared, residual standard error, and the number of observations to which the model was fit.

glance(m3_log) %>% 
  select(r.squared, sigma, nobs) %>%
  kable(digits = 3)
r.squared sigma nobs
0.163 0.596 247

The multiple R-squared is 0.163, which implies that 16.3% of the variation in the log of outcome_crime is explained by predictor_mhealth per state, as per my model. The residual standard error is 0.596 so I would anticipate that 95% of the errors made by my model would be between -1.192 and +1.192. The number of observations is 247.

6.4.4 Making Predictions at the median number of mental health days, for each state

I will determine the median value of my predictor in my data set. Then I will create a new data set where each of my states of interest have a value for the predictor that equals the median for the data set.

median(chr_a3$predictor_mhealth)
[1] 4.789679
new_data <- 
  tibble(predictor_mhealth = c(4.79, 4.79, 4.79, 4.79, 4.79),
         state = c("NJ", "NY", "OH", "PA","VT"))

new_data %>% kable(digits = 2)
predictor_mhealth state
4.79 NJ
4.79 NY
4.79 OH
4.79 PA
4.79 VT

The mean predictor value for the data set is 4.79. This is a table that shows each state having this value for the predictor. Now, I will use augment to calculate the residuals.

augment(m3_log, newdata = new_data) %>%
  kable(digits = 2)
predictor_mhealth state .fitted
4.79 NJ 5.62
4.79 NY 5.23
4.79 OH 4.83
4.79 PA 5.13
4.79 VT 4.37

The 2 states with the largest residual here are New Jersey and New York. These 2 states have the largest absolute value of fitted value minus actual value, when the actual value is the median for the data set. This indicates that the counties in these 2 states would be the least successfully predicted by the model that does not account for the impact of state.

6.5 Residual Analysis

In this section, I will analyze the residuals from my transformed linear model. I will use augment to obtain the residuals.

m3_log_aug <- augment(m3_log, data = chr_a3)

m3_log_aug
# A tibble: 247 x 11
   outcome_crime predictor_mhealth county      state    id .fitted .resid   .hat
           <dbl>             <dbl> <chr>       <fct> <int>   <dbl>  <dbl>  <dbl>
 1         373.               4.72 Atlantic C~ NJ        1    5.54  0.387 0.0914
 2          84.3              4.00 Bergen Cou~ NJ        2    4.72 -0.288 0.113 
 3         163.               4.45 Burlington~ NJ        3    5.24 -0.143 0.0491
 4         465.               4.60 Camden Cou~ NJ        4    5.40  0.742 0.0650
 5         236.               4.55 Cape May C~ NJ        5    5.34  0.121 0.0575
 6         516.               4.93 Cumberland~ NJ        6    5.78  0.464 0.170 
 7         606.               4.37 Essex Coun~ NJ        7    5.15  1.26  0.0477
 8         116.               4.51 Gloucester~ NJ        8    5.30 -0.550 0.0532
 9         342.               4.28 Hudson Cou~ NJ        9    5.04  0.791 0.0528
10          42.3              4.02 Hunterdon ~ NJ       10    4.74 -0.999 0.106 
# ... with 237 more rows, and 3 more variables: .sigma <dbl>, .cooksd <dbl>,
#   .std.resid <dbl>

The residual values are stored as .resid for each row (county) in the above tibble.

6.5.1 Assessment of linearity and normality

I will create a pair of residual plots: a scatter plot of residuals vs. fitted values to assess linearity and a normal Q-Q plot to assess normality of the residuals. I will use ggplot and the patchwork package and include a linear model and a loess smooth curve.

p1 <- ggplot(data = m3_log_aug, aes(x = .fitted, y = .resid)) +
    geom_point(size = 2, alpha = 0.6) + 
    geom_smooth(method = "lm", formula = y ~ x,
                se = FALSE, size = 1, col = "blue") + 
    geom_smooth(method = "loess", formula = y ~ x, 
                se = FALSE, size = 1, col = "red") +
    labs(title = "Residuals vs. Fitted",
         subtitle = "with OLS (blue), loess (red)",
         x = "Fitted Values",
         y = "Residuals")

p2 <- ggplot(data = m3_log_aug, aes(sample = .resid)) +
    geom_qq() + 
    geom_qq_line(col = "red") +
    labs(title = "Normal Q-Q plot",
         x = "Theoretical Quantiles",
         y = "Residuals")

p1 + p2 +
   plot_annotation(title = "Assessment of Residuals",
                  subtitle = "using mental health to predict the log of violent crime with the impact of state",
                  caption = "247 counties in NJ, NY, OH, PA, and VT in m3_log_aug")

This shows a scatterplot of residuals vs. fitted values which helps me determine linearity of my transformed model, and a normal Q-Q plot of residuals which helps me determine normality of my transformed model. A description of this pair of plots is below.

6.5.2 Residual plot interpretation

My scatter plot appears to have 2 clusters of data at around a fitted value of 4.9 and 5.25. This is very minor and there is no substantial bend in the curve. This plot depicts that I can assume linearity of my transformed model. The normal Q-Q plot to the right shows that I can assume normality of my transformed model. There are 2-4 slight outliers on the Q-Q plot but I do not expect something this minor to affect my model meaningfully.

6.5.3 Ability of m3 to predict violent crime rate in Cuyahoga County

I will create an untransformed model and display its prediction for violent crime in Cuyahoga County in Ohio, where CWRU’s campus is located, and compare the prediction, or fitted value, to the actual value of this outcome.

m3 <- lm(outcome_crime ~ predictor_mhealth * state, data = chr_a3)

m3_aug <- augment(m3, data = chr_a3)

m3_aug %>%
  filter(county == "Cuyahoga County") %>%
  summarise(id, county, state, outcome_crime, .fitted, .resid) %>%
  kable(digits = 2)
id county state outcome_crime .fitted .resid
100 Cuyahoga County OH 645.99 160.15 485.84

After adjusting for state, the model becomes less successful at this prediction for Cuyahoga County. My model’s prediction for violent crime here is 160.15, while the actual value is 645.99, creating a residual of 485.84 (which is higher than in the model which did not account for the impact of state).

6.5.4 Identification of counties for which m3 is least successful

I will use the arrange function to create a tibble of my untransformed data set that is arranged by absolute value of the residuals.

m3_aug %>% select(id, county, state, .resid) %>%
  arrange(desc(abs(.resid))) %>%
  head() %>%
  kable (digits =2)
id county state .resid
218 Philadelphia County PA 752.84
128 Lucas County OH 662.92
100 Cuyahoga County OH 485.84
7 Essex County NJ 373.14
24 Bronx County NY 364.87
64 Richmond County NY 359.40

The 2 counties for which my model m3 is least successful at predicting the outcome are Philadelphia County in Pennsylvania and Lucas County in Ohio, as these 2 counties have the highest absolute value of the residual. The residual for Philadelphia County is 753 and the residual for Lucas County is 663.

6.6 Conclusion and Limitations

I sought to determine if populations with a higher rate of poor mental health days also have a higher rate of violent crime after adjusting for the impact of state, in 247 counties in the states of New Jersey, New York, Ohio, Pennsylvania, and Vermont. The answer is yes for counties in New Jersey, and no for counties in New York, Pennsylvania, Ohio, and Vermont. In counties in New Jersey, higher numbers of poor mental health days is associated with higher violent crime rate with a 90% confidence interval which does not contain ‘0’. The confidence intervals for the counties in the other states does contain ‘0’.

Again, the major limitation here is the lack of randomization of my sample. All my counties are within northeastern states. Ohio is sometimes considered northeast and sometimes considered Midwest, but this grey area does not add any meaningful variation to my selected counties. This affects my external validity and application to my target population, all counties in the US. As you can see by my residual vs. fitted plot and normal Q-Q plot above, normality and linearity were not an issue in my transformed data set.

7 Session Information

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggridges_0.5.3     mosaicData_0.20.2  ggformula_0.10.1   ggstance_0.3.5    
 [5] Matrix_1.3-4       lattice_0.20-44    forcats_0.5.1      stringr_1.4.0     
 [9] dplyr_1.0.7        purrr_0.3.4        readr_2.0.2        tidyr_1.1.4       
[13] tibble_3.1.4       tidyverse_1.3.1    broom_0.7.9        car_3.0-11        
[17] carData_3.0-4      patchwork_1.1.1    equatiomatic_0.3.0 ggrepel_0.9.1     
[21] ggplot2_3.3.5      naniar_0.6.1       magrittr_2.0.1     janitor_2.1.0     
[25] rmdformats_1.0.3   knitr_1.36        

loaded via a namespace (and not attached):
  [1] colorspace_2.0-2    ellipsis_0.3.2      rio_0.5.27         
  [4] visdat_0.5.3        leaflet_2.0.4.1     htmlTable_2.3.0    
  [7] snakecase_0.11.0    base64enc_0.1-3     ggdendro_0.1.22    
 [10] fs_1.5.0            rstudioapi_0.13     farver_2.1.0       
 [13] bit64_4.0.5         fansi_0.5.0         lubridate_1.7.10   
 [16] xml2_1.3.2          mosaic_1.8.3        splines_4.1.1      
 [19] polyclip_1.10-0     Formula_1.2-4       jsonlite_1.7.2     
 [22] cluster_2.1.2       dbplyr_2.1.1        png_0.1-7          
 [25] ggforce_0.3.3       shiny_1.7.1         compiler_4.1.1     
 [28] httr_1.4.2          backports_1.2.1     assertthat_0.2.1   
 [31] fastmap_1.1.0       cli_3.0.1           later_1.3.0        
 [34] tweenr_1.0.2        htmltools_0.5.2     tools_4.1.1        
 [37] gtable_0.3.0        glue_1.4.2          Rcpp_1.0.7         
 [40] cellranger_1.1.0    jquerylib_0.1.4     vctrs_0.3.8        
 [43] nlme_3.1-152        crosstalk_1.1.1     xfun_0.26          
 [46] openxlsx_4.2.4      rvest_1.0.2         mime_0.12          
 [49] lifecycle_1.0.1     mosaicCore_0.9.0    MASS_7.3-54        
 [52] scales_1.1.1        vroom_1.5.5         hms_1.1.1          
 [55] promises_1.2.0.1    parallel_4.1.1      RColorBrewer_1.1-2 
 [58] yaml_2.2.1          curl_4.3.2          gridExtra_2.3      
 [61] sass_0.4.0          rpart_4.1-15        labelled_2.9.0     
 [64] latticeExtra_0.6-29 stringi_1.7.4       highr_0.9          
 [67] checkmate_2.0.0     zip_2.2.0           rlang_0.4.11       
 [70] pkgconfig_2.0.3     evaluate_0.14       labeling_0.4.2     
 [73] htmlwidgets_1.5.4   bit_4.0.4           tidyselect_1.1.1   
 [76] plyr_1.8.6          bookdown_0.24       R6_2.5.1           
 [79] generics_0.1.1      Hmisc_4.6-0         DBI_1.1.1          
 [82] mgcv_1.8-36         pillar_1.6.4        haven_2.4.3        
 [85] foreign_0.8-81      withr_2.4.2         nnet_7.3-16        
 [88] survival_3.2-11     abind_1.4-5         modelr_0.1.8       
 [91] crayon_1.4.2        utf8_1.2.2          tzdb_0.1.2         
 [94] rmarkdown_2.11      jpeg_0.1-9          grid_4.1.1         
 [97] readxl_1.3.1        data.table_1.14.2   reprex_2.0.1       
[100] digest_0.6.28      
 [ reached getOption("max.print") -- omitted 4 entries ]