hw1
desriptive statistics
probability
The first homework on descriptive statistics and probability
Author

Ethan Campbell

Published

September 21, 2022

Question 1

First, let’s read in the data from the Excel file:

Code
library(readxl)
Warning: package 'readxl' was built under R version 4.1.3
Code
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.1.3
-- Attaching packages --------------------------------------- tidyverse 1.3.2 --
v ggplot2 3.3.6     v purrr   0.3.4
v tibble  3.1.8     v dplyr   1.0.9
v tidyr   1.2.0     v stringr 1.4.1
v readr   2.1.2     v forcats 0.5.2
Warning: package 'ggplot2' was built under R version 4.1.3
Warning: package 'tibble' was built under R version 4.1.3
Warning: package 'tidyr' was built under R version 4.1.3
Warning: package 'readr' was built under R version 4.1.3
Warning: package 'dplyr' was built under R version 4.1.3
Warning: package 'stringr' was built under R version 4.1.3
Warning: package 'forcats' was built under R version 4.1.3
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
Code
library(dplyr)
df <- read_excel("_data/LungCapData.xls")

1.a

(The distribution of LungCap looks as follows:)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

Code
head(df)
# A tibble: 6 x 6
  LungCap   Age Height Smoke Gender Caesarean
    <dbl> <dbl>  <dbl> <chr> <chr>  <chr>    
1    6.48     6   62.1 no    male   no       
2   10.1     18   74.7 yes   female no       
3    9.55    16   69.7 no    female yes      
4   11.1     14   71   no    male   no       
5    4.8      5   56.9 no    male   no       
6    6.22    11   58.7 no    female no       
Code
hist(df$LungCap)

1.b

(Comparing lung cap by gender)

Here we notice that males tend to have a higher lung cap compared to females. Females average tends to sit around 8 while males seems to sit closer to 9

Code
boxplot(df$LungCap~df$Gender)

1.c

(smoker vs non-smoker lung cap)

Interestingly, none smokers tend to have a lower lung capacity however, I believe this might be due to age. No this does not make sense at first glance and does betray my expectation.

Code
df %>%
  group_by(Smoke) %>%
  summarize_at(vars(LungCap), list(mean = mean))
# A tibble: 2 x 2
  Smoke  mean
  <chr> <dbl>
1 no     7.77
2 yes    8.65

1.d

(relation between smoking and lung cap at different age groups)

The lung cap starts off higher but takes and dip then rises as the age continues to grow. I believe the trend is the higher age grows the higher the lung cap until it reaches a certain point.

Code
# lung cap is 9.62
df %>%
  select(Age, LungCap) %>%
  filter(Age >= 13) %>%
  colMeans()
      Age   LungCap 
15.609290  9.628757 
Code
# lung cap is 9.04
df %>%
  select(Age, LungCap) %>%
  filter(Age >= 14 & Age <= 15) %>%
  colMeans()
      Age   LungCap 
14.533333  9.045417 
Code
# lung cap is 10.24
df %>%
  select(Age, LungCap) %>%
  filter(Age >= 16 & Age <= 17) %>%
  colMeans()
     Age  LungCap 
16.44330 10.24588 
Code
# lung cap is 11.26
df %>%
  select(Age, LungCap) %>%
  filter(Age > 18) %>%
  colMeans()
     Age  LungCap 
19.00000 11.26149 

1.e

(lung cap for smokers and non smokers broken into age groups)

We notice a clear trend that smokers have a lower lung capacity compared to non-smokers

Code
df %>%
  select(Age, LungCap, Smoke) %>%
  group_by(Smoke) %>%
  filter(Age >= 13) %>%
  summarize_at(vars(LungCap), list(mean = mean))
# A tibble: 2 x 2
  Smoke  mean
  <chr> <dbl>
1 no     9.71
2 yes    9.21
Code
df %>%
  select(Age, LungCap, Smoke) %>%
  group_by(Smoke) %>%
  filter(Age >= 14 & Age <= 15) %>%
  summarize_at(vars(LungCap), list(mean = mean))
# A tibble: 2 x 2
  Smoke  mean
  <chr> <dbl>
1 no     9.14
2 yes    8.39
Code
df %>%
  select(Age, LungCap, Smoke) %>%
  group_by(Smoke) %>%
  filter(Age >= 16 & Age <= 17) %>%
  summarize_at(vars(LungCap), list(mean = mean))
# A tibble: 2 x 2
  Smoke  mean
  <chr> <dbl>
1 no    10.5 
2 yes    9.38
Code
df %>%
  select(Age, LungCap, Smoke) %>%
  group_by(Smoke) %>%
  filter(Age > 18) %>%
  summarize_at(vars(LungCap), list(mean = mean))
# A tibble: 2 x 2
  Smoke  mean
  <chr> <dbl>
1 no     11.3
2 yes    11.3

1.f

(correlation and covariance between lung capacity and age)

correlation is at .819 meaning they have a positive correlation of about 82%. This means that there is a connection between the two and when one goes up so does the other.

Code
cov(df$LungCap, df$Age)
[1] 8.738289
Code
cor(df$LungCap, df$Age)
[1] 0.8196749

Question 2

Code
# creating the Tibble
df <- tibble(X=c(0,1,2,3,4), Freq=c(128,434,160,64,26))

# Creating the probability of an event occurring
df1 <- df %>%
  select(X, Freq) %>%
  mutate(Probability = Freq/sum(Freq))
df1
# A tibble: 5 x 3
      X  Freq Probability
  <dbl> <dbl>       <dbl>
1     0   128      0.158 
2     1   434      0.534 
3     2   160      0.197 
4     3    64      0.0788
5     4    26      0.0320

2.a

probability of exactly 2 convictions probability = 19.7%

Code
df1 %>%
  select(X, Freq, Probability) %>%
  filter(X == 2)
# A tibble: 1 x 3
      X  Freq Probability
  <dbl> <dbl>       <dbl>
1     2   160       0.197

2.b

probability of fewer than 2 convictions probability = 69.2%

Code
sum(df1$Probability[1:2])
[1] 0.6921182

2.c

Probability of having 2 or fewer convictions probability = 88.9%

Code
sum(df1$Probability[1:3])
[1] 0.8891626

2.d

probability of having more than 2 convictions probability = 11.08%

Code
sum(df1$Probability[4:5])
[1] 0.1108374

2.e

What is the expected value expected value is 1.29 convictions

Code
df1 %>%
  select(X, Freq, Probability) %>%
  mutate(expected_value = (0*0.15763547)+(1*0.53448276)+(2*0.19704433)+(3*0.07881773)+(4*0.03201970))
# A tibble: 5 x 4
      X  Freq Probability expected_value
  <dbl> <dbl>       <dbl>          <dbl>
1     0   128      0.158            1.29
2     1   434      0.534            1.29
3     2   160      0.197            1.29
4     3    64      0.0788           1.29
5     4    26      0.0320           1.29

2.f

What is the variance and standard deviation of the prior convictions Variance = 25810.8 standard deviation = 160.6574

Code
var(df$Freq)
[1] 25810.8
Code
sd(df$Freq)
[1] 160.6574