hw1
desriptive statistics
probability
Intro to Quantitative Analysis
Author

Ollie Murphy

Published

February 28, 2023

Question 1: Use the LungCapData to answer the following questions.

Require packages and import data

Code
library(here)
here() starts at C:/Users/Ollie/OneDrive - University of Massachusetts/Spring_2023/Quant_Analysis/603_Spring_2023
Code
library(readxl)
Warning: package 'readxl' was built under R version 4.1.3
Code
library(dplyr)
Warning: package 'dplyr' was built under R version 4.1.3

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
df <- read_excel("_data/LungCapData.xls")

a) What does the distributuion of LungCap look like?

Code
hist(df$LungCap, main = "Histogram of Lung Capacity", xlab = "Lung Capacity")

Based on the histogram, lung capacity appears to be distributed normally with a mean around 7 to 8, a minimum of 0, and a maximum of 15

b) Compare the probability distribution of the LungCap with responst to Males and Females.

Code
boxplot(LungCap ~ Gender, data = df)

The distributions are of a similar size, but male capacity is higher in mean as well as quartiles and min/max.

c) Compare the mean lung capacities for smokers and non-smokers. Does it make sense?

Code
df %>%
  group_by(Smoke) %>%
  summarise(LungCap = (mean(LungCap,na.rm = TRUE)))
# A tibble: 2 x 2
  Smoke LungCap
  <chr>   <dbl>
1 no       7.77
2 yes      8.65

The mean lung capacities likely don’t make sense, as you might expect for smoking to diminish lung capacity. However, it could be that people with better lung capacity are better at smoking and therefore do it more, or the inclusion of children, who are less likely to smoke and also have smaller lung capacities.

d) Examine the relationship between Smoking and Lung Capacity within age groups: “Less than or equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”.

Code
smoke_df = df%>%
  mutate(
   AgeRange = case_when(
    Age < 14 ~ "13 and Under",
    Age == 14 ~ "14 to 15",
    Age == 15 ~ "14 to 15",
    Age == 16 ~ "16 to 17",
    Age == 17 ~ "16 to 17",
    Age > 17 ~ "18 and Over"
  )) %>%
  filter(Smoke == "yes")%>%
  group_by(AgeRange)%>%
    summarize(mean(LungCap, na.rm = TRUE))

nosmoke_df  = df%>%
  mutate(
   AgeRange = case_when(
    Age < 14 ~ "13 and Under",
    Age == 14 ~ "14 to 15",
    Age == 15 ~ "14 to 15",
    Age == 16 ~ "16 to 17",
    Age == 17 ~ "16 to 17",
    Age > 17 ~ "18 and Over"
  )) %>%
  filter(Smoke == "no")%>%
  group_by(AgeRange)%>%
    summarize(mean(LungCap, na.rm = TRUE))

smokeByAge = left_join(smoke_df, nosmoke_df, by = ("AgeRange"))
names(smokeByAge) = c("AgeRange", "Smoker", "NonSmoker")

smokeByAge
# A tibble: 4 x 3
  AgeRange     Smoker NonSmoker
  <chr>         <dbl>     <dbl>
1 13 and Under   7.20      6.36
2 14 to 15       8.39      9.14
3 16 to 17       9.38     10.5 
4 18 and Over   10.5      11.1 

e) Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c? What could possibly be going on here?

While the youngest age group follows the same trend observed in part c (smokers with higher lung capacity than non-smokers), the inverse is true for all older groups. I suspect that, as I mentioned above, the data is skewed by virture of the fact that older groups are more likely to smoke as well as have a larger lung capacity. When compared within their age range (therefore at similar states of development), the results reflect that smoking likely diminishes lung capacity.

Question 2: Let X = number of prior convictions for prisoners at a state prison at which there are 810 prisoners

Code
convictionsfreq = data.frame(convictions = c(0,1,2,3,4), frequency = c(128,434,160,64,24))
convictionsfreq
  convictions frequency
1           0       128
2           1       434
3           2       160
4           3        64
5           4        24

a) What is the probability that a randomly selected inmate has exactly 2 prior convictions?

Code
160/(128+434+160+64+24)
[1] 0.1975309

b) What is the probability that a randomly selected inmate has fewer than 2 prior convictions?

Code
(128+434)/(128+434+160+64+24)
[1] 0.6938272

c) What is the probability that a randomly selected inmate has 2 or fewer prior convictions?

Code
(128+434+160)/(128+434+160+64+24)
[1] 0.891358

d) What is the probability that a randomly selected inmate has more than 2 prior convictions?

Code
(64+24)/(128+434+160+64+24)
[1] 0.108642

e) What is the expected value1 for the number of prior convictions?

Code
(0*(128/(128+434+160+64+24)))+(1*(434/(128+434+160+64+24)))+(2*(160/(128+434+160+64+24)))+(3*(64/(128+434+160+64+24)))+(4*(24/(128+434+160+64+24)))
[1] 1.28642

f) Calculate the variance and the standard deviation for the Prior Convictions.

Code
var(convictionsfreq$convictions)
[1] 2.5
Code
sd(convictionsfreq$convictions)
[1] 1.581139