Homework 1

hw1

desriptive statistics

probability

Intro to Quantitative Analysis

Author

Ollie Murphy

Published

February 28, 2023

Question 1: Use the LungCapData to answer the following questions.

Require packages and import data

Code

library(here)

here() starts at C:/Users/Ollie/OneDrive - University of Massachusetts/Spring_2023/Quant_Analysis/603_Spring_2023

Code

library(readxl)

Warning: package 'readxl' was built under R version 4.1.3

Code

library(dplyr)

Warning: package 'dplyr' was built under R version 4.1.3


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Code

df <- read_excel("_data/LungCapData.xls")

a) What does the distributuion of LungCap look like?

Code

hist(df$LungCap, main = "Histogram of Lung Capacity", xlab = "Lung Capacity")

Based on the histogram, lung capacity appears to be distributed normally with a mean around 7 to 8, a minimum of 0, and a maximum of 15

b) Compare the probability distribution of the LungCap with responst to Males and Females.

Code

boxplot(LungCap ~ Gender, data = df)

The distributions are of a similar size, but male capacity is higher in mean as well as quartiles and min/max.

c) Compare the mean lung capacities for smokers and non-smokers. Does it make sense?

Code

df %>%
  group_by(Smoke) %>%
  summarise(LungCap = (mean(LungCap,na.rm = TRUE)))

# A tibble: 2 x 2
  Smoke LungCap
  <chr>   <dbl>
1 no       7.77
2 yes      8.65

The mean lung capacities likely don’t make sense, as you might expect for smoking to diminish lung capacity. However, it could be that people with better lung capacity are better at smoking and therefore do it more, or the inclusion of children, who are less likely to smoke and also have smaller lung capacities.

d) Examine the relationship between Smoking and Lung Capacity within age groups: “Less than or equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”.

Code

smoke_df = df%>%
  mutate(
   AgeRange = case_when(
    Age < 14 ~ "13 and Under",
    Age == 14 ~ "14 to 15",
    Age == 15 ~ "14 to 15",
    Age == 16 ~ "16 to 17",
    Age == 17 ~ "16 to 17",
    Age > 17 ~ "18 and Over"
  )) %>%
  filter(Smoke == "yes")%>%
  group_by(AgeRange)%>%
    summarize(mean(LungCap, na.rm = TRUE))

nosmoke_df  = df%>%
  mutate(
   AgeRange = case_when(
    Age < 14 ~ "13 and Under",
    Age == 14 ~ "14 to 15",
    Age == 15 ~ "14 to 15",
    Age == 16 ~ "16 to 17",
    Age == 17 ~ "16 to 17",
    Age > 17 ~ "18 and Over"
  )) %>%
  filter(Smoke == "no")%>%
  group_by(AgeRange)%>%
    summarize(mean(LungCap, na.rm = TRUE))

smokeByAge = left_join(smoke_df, nosmoke_df, by = ("AgeRange"))
names(smokeByAge) = c("AgeRange", "Smoker", "NonSmoker")

smokeByAge

# A tibble: 4 x 3
  AgeRange     Smoker NonSmoker
  <chr>         <dbl>     <dbl>
1 13 and Under   7.20      6.36
2 14 to 15       8.39      9.14
3 16 to 17       9.38     10.5 
4 18 and Over   10.5      11.1

e) Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c? What could possibly be going on here?

While the youngest age group follows the same trend observed in part c (smokers with higher lung capacity than non-smokers), the inverse is true for all older groups. I suspect that, as I mentioned above, the data is skewed by virture of the fact that older groups are more likely to smoke as well as have a larger lung capacity. When compared within their age range (therefore at similar states of development), the results reflect that smoking likely diminishes lung capacity.

Question 2: Let X = number of prior convictions for prisoners at a state prison at which there are 810 prisoners

Code

convictionsfreq = data.frame(convictions = c(0,1,2,3,4), frequency = c(128,434,160,64,24))
convictionsfreq

  convictions frequency
1           0       128
2           1       434
3           2       160
4           3        64
5           4        24

a) What is the probability that a randomly selected inmate has exactly 2 prior convictions?

Code

160/(128+434+160+64+24)

[1] 0.1975309

b) What is the probability that a randomly selected inmate has fewer than 2 prior convictions?

Code

(128+434)/(128+434+160+64+24)

[1] 0.6938272

c) What is the probability that a randomly selected inmate has 2 or fewer prior convictions?

Code

(128+434+160)/(128+434+160+64+24)

[1] 0.891358

d) What is the probability that a randomly selected inmate has more than 2 prior convictions?

Code

(64+24)/(128+434+160+64+24)

[1] 0.108642

e) What is the expected value1 for the number of prior convictions?

Code

(0*(128/(128+434+160+64+24)))+(1*(434/(128+434+160+64+24)))+(2*(160/(128+434+160+64+24)))+(3*(64/(128+434+160+64+24)))+(4*(24/(128+434+160+64+24)))

[1] 1.28642

f) Calculate the variance and the standard deviation for the Prior Convictions.

Code

var(convictionsfreq$convictions)

[1] 2.5

Code

sd(convictionsfreq$convictions)

[1] 1.581139

--- title: "Homework 1" author: "Ollie Murphy" description: "Intro to Quantitative Analysis" date: "02/28/2023" format: html: toc: true code-fold: true code-copy: true code-tools: true categories: - hw1 - desriptive statistics - probability --- # Question 1: Use the LungCapData to answer the following questions. Require packages and import data ```{r, echo=T} library(here) library(readxl) library(dplyr) df <- read_excel("_data/LungCapData.xls") ``` ### a) What does the distributuion of LungCap look like? ```{r, echo=T} hist(df$LungCap, main = "Histogram of Lung Capacity", xlab = "Lung Capacity") ``` Based on the histogram, lung capacity appears to be distributed normally with a mean around 7 to 8, a minimum of 0, and a maximum of 15 ### b) Compare the probability distribution of the LungCap with responst to Males and Females. ```{r} boxplot(LungCap ~ Gender, data = df) ``` The distributions are of a similar size, but male capacity is higher in mean as well as quartiles and min/max. ### c) Compare the mean lung capacities for smokers and non-smokers. Does it make sense? ```{r} df %>% group_by(Smoke) %>% summarise(LungCap = (mean(LungCap,na.rm = TRUE))) ``` The mean lung capacities likely don't make sense, as you might expect for smoking to diminish lung capacity. However, it could be that people with better lung capacity are better at smoking and therefore do it more, or the inclusion of children, who are less likely to smoke and also have smaller lung capacities. ### d) Examine the relationship between Smoking and Lung Capacity within age groups: "Less than or equal to 13", "14 to 15", "16 to 17", and "greater than or equal to 18". ```{r} smoke_df = df%>% mutate( AgeRange = case_when( Age < 14 ~ "13 and Under", Age == 14 ~ "14 to 15", Age == 15 ~ "14 to 15", Age == 16 ~ "16 to 17", Age == 17 ~ "16 to 17", Age > 17 ~ "18 and Over" )) %>% filter(Smoke == "yes")%>% group_by(AgeRange)%>% summarize(mean(LungCap, na.rm = TRUE)) nosmoke_df = df%>% mutate( AgeRange = case_when( Age < 14 ~ "13 and Under", Age == 14 ~ "14 to 15", Age == 15 ~ "14 to 15", Age == 16 ~ "16 to 17", Age == 17 ~ "16 to 17", Age > 17 ~ "18 and Over" )) %>% filter(Smoke == "no")%>% group_by(AgeRange)%>% summarize(mean(LungCap, na.rm = TRUE)) smokeByAge = left_join(smoke_df, nosmoke_df, by = ("AgeRange")) names(smokeByAge) = c("AgeRange", "Smoker", "NonSmoker") smokeByAge ``` ### e) Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c? What could possibly be going on here? While the youngest age group follows the same trend observed in part c (smokers with higher lung capacity than non-smokers), the inverse is true for all older groups. I suspect that, as I mentioned above, the data is skewed by virture of the fact that older groups are more likely to smoke as well as have a larger lung capacity. When compared within their age range (therefore at similar states of development), the results reflect that smoking likely diminishes lung capacity. # Question 2: Let X = number of prior convictions for prisoners at a state prison at which there are 810 prisoners ```{r} convictionsfreq = data.frame(convictions = c(0,1,2,3,4), frequency = c(128,434,160,64,24)) convictionsfreq ``` ### a) What is the probability that a randomly selected inmate has exactly 2 prior convictions? ```{r} 160/(128+434+160+64+24) ``` ### b) What is the probability that a randomly selected inmate has fewer than 2 prior convictions? ```{r} (128+434)/(128+434+160+64+24) ``` ### c) What is the probability that a randomly selected inmate has 2 or fewer prior convictions? ```{r} (128+434+160)/(128+434+160+64+24) ``` ### d) What is the probability that a randomly selected inmate has more than 2 prior convictions? ```{r} (64+24)/(128+434+160+64+24) ``` ### e) What is the expected value1 for the number of prior convictions? ```{r} (0*(128/(128+434+160+64+24)))+(1*(434/(128+434+160+64+24)))+(2*(160/(128+434+160+64+24)))+(3*(64/(128+434+160+64+24)))+(4*(24/(128+434+160+64+24))) ``` ### f) Calculate the variance and the standard deviation for the Prior Convictions. ```{r} var(convictionsfreq$convictions) sd(convictionsfreq$convictions) ```