Homework 1

hw1

desriptive statistics

probability

Author

Ken Docekal

Published

October 3, 2022

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE)

Question 1

a

Read in the data from the Excel file:

Code

library(readr)
library(readxl)

LungCapData <- read_excel("_data/LungCapData.xls")
View(LungCapData)

The distribution of LungCap looks as follows:

Code

hist(LungCapData$LungCap)

b

Probability distribution of the LungCap, Males and Females, in a box plot:

Code

boxplot(LungCapData$LungCap ~ LungCapData$Gender)

c

Lung capacities for smokers and non-smokers, mean and standard deviation:

Code

LungCapData %>% 
  group_by(Smoke) %>% 
  summarise(mean = mean(LungCap, na.rm = TRUE), sd = sd(LungCap, na.rm = TRUE))

# A tibble: 2 × 3
  Smoke  mean    sd
  <chr> <dbl> <dbl>
1 no     7.77  2.73
2 yes    8.65  1.88

Results seem to point to smokers having greater lung capacity which is odd and could indicate factors other than age are influencing lung capacity

d

The relationship between Smoking and Lung Capacity within age groups: “less than or equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”:

age 13 and lower:

Code

LungCapData %>% 
  group_by(Smoke) %>% 
 dplyr::filter(Age <=13)%>% 
  summarise(mean = mean(LungCap, na.rm = TRUE),sd = sd(LungCap, na.rm = TRUE))

# A tibble: 2 × 3
  Smoke  mean    sd
  <chr> <dbl> <dbl>
1 no     6.36  2.21
2 yes    7.20  1.58

age 14 to 15:

Code

LungCapData %>% 
  group_by(Smoke) %>% 
 dplyr::filter(Age == 14:15)%>% 
  summarise(mean = mean(LungCap, na.rm = TRUE),sd = sd(LungCap, na.rm = TRUE))

Warning in Age == 14:15: longer object length is not a multiple of shorter
object length

# A tibble: 2 × 3
  Smoke  mean    sd
  <chr> <dbl> <dbl>
1 no     8.84 1.36 
2 yes    8.91 0.865

age 16 to 17:

Code

LungCapData %>% 
  group_by(Smoke) %>% 
 dplyr::filter(Age == 16:17)%>% 
  summarise(mean = mean(LungCap, na.rm = TRUE),sd = sd(LungCap, na.rm = TRUE))

Warning in Age == 16:17: longer object length is not a multiple of shorter
object length

# A tibble: 2 × 3
  Smoke  mean    sd
  <chr> <dbl> <dbl>
1 no    10.4   1.73
2 yes    9.60  1.41

age 18 and over:

Code

LungCapData %>% 
  group_by(Smoke) %>% 
 dplyr::filter(Age >=18)%>% 
  summarise(mean = mean(LungCap, na.rm = TRUE),sd = sd(LungCap, na.rm = TRUE))

# A tibble: 2 × 3
  Smoke  mean    sd
  <chr> <dbl> <dbl>
1 no     11.1  1.56
2 yes    10.5  1.25

e

When looking at mean lung capacity of smokers versus non-smokers by age groups we can see lung capacity increasing consistently as age increases. For the two lowest age groups mean capacity is lower for non-smokers although the difference decreases as age increases; this trend is reversed from age 16 onwards as non-smokers overtake smokers in lung capacity. Across all age groups non-smokers also have a greater standard deviation in lung capacity compared to smokers with the age 13 and under non-smoker group having the greatest standard deviation. It is likely that the greater number of age 13 and under respondents is the reason why overall results mirror the distribution seen in the youngest age group.

f

Covariance between lung capacity and age:

Code

cov(LungCapData$Age,LungCapData$LungCap)

[1] 8.738289

A positive covariance is shown which lets us know that as age increases lung capacity also increases.

Correlation between lung capacity and age:

Code

cor(LungCapData$Age,LungCapData$LungCap)

[1] 0.8196749

The correlation coefficient is also positive; similar to the covariance this lets us know that there is a positive relationship between age and lung capacity. Additionally, since .819 is a relatively high score, as a score of 1 would indicate a perfect positive relationship, we know there is a strong relationship where a older respondent would be highly likely to have higher lung capacity and a younger respondent would likely have lower lung capacity.

Question 2

a

The probability that a randomly selected inmate has exactly 2 prior convictions:

Create data frame:

Code

convictions<- c(0,1,2,3,4)
prisoners<- c(128, 434, 160, 64, 24)

df <- data.frame(convictions, prisoners)

tibble(df)

# A tibble: 5 × 2
  convictions prisoners
        <dbl>     <dbl>
1           0       128
2           1       434
3           2       160
4           3        64
5           4        24

Probability of exactly 2 prior convictions:

Code

160/sum(prisoners)

[1] 0.1975309

b

Probability of fewer than 2 prior convictions (total # of prisoners with less than 2 prior convictions = 562):

Code

562/sum(prisoners)

[1] 0.6938272

c

Probability of 2 or fewer prior convictions (total # of prisoners with 2 or fewer prior convictions = 722):

Code

722/sum(prisoners)

[1] 0.891358

d

Probability of more than 2 prior convictions (total # of prisoners with more than 2 prior convictions = 88):

Code

88/sum(prisoners)

[1] 0.108642

e

The expected value for the number of prior convictions (using the probability of observing each prisoner prior conviction group):

Code

con1<- c(0,1,2,3,4)
pprob<- c(.158,.536,.198,.079,.028)


sum(con1*pprob)

[1] 1.281

f

Variance and standard deviation for prior convictions:

Code

var(prisoners)

[1] 25948

Code

sd(prisoners)

[1] 161.0838