hw1
desriptive statistics
probability
The first homework on descriptive statistics and probability
Author

Steve O’Neill

Published

September 20, 2022

Question 1: Lung Capacity

This exercise focuses on lung capacity data (LungCapData.xls). First, the data must be “read into” R:

Code
library(readxl)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
df <- read_excel("_data/LungCapData.xls")
df
# A tibble: 725 × 6
   LungCap   Age Height Smoke Gender Caesarean
     <dbl> <dbl>  <dbl> <chr> <chr>  <chr>    
 1    6.48     6   62.1 no    male   no       
 2   10.1     18   74.7 yes   female no       
 3    9.55    16   69.7 no    female yes      
 4   11.1     14   71   no    male   no       
 5    4.8      5   56.9 no    male   no       
 6    6.22    11   58.7 no    female no       
 7    4.95     8   63.3 no    male   yes      
 8    7.32    11   70.4 no    male   no       
 9    8.88    15   70.5 no    male   no       
10    6.8     11   59.2 no    male   no       
# … with 715 more rows
# ℹ Use `print(n = ...)` to see more rows

1a.: What does the distribution look like?

Here is the distribution of LungCap with probability density on the y-axis (instead of frequency, the default):

Code
hist(df$LungCap, freq = FALSE)

It looks to approach a normal distribution.

1b. Comparing Males and Females

Males seem to have a higher LungCap in general:

Code
boxplot(df$LungCap ~ df$Gender)

1c. Smokers

Surprisingly, smokers in this dataset overall are shown to have a higher average lung capacity. This is not what I expected so there might be other factors at play.

Code
boxplot(df$LungCap ~ df$Smoke)

1d. & 1e.: Smoking, Lung Capacity, and Age

I am approaching this as a compound question. I think a clustered bar chart does the best job of representing the relationship between smoking, lung capacity, and age:

Code
grouped_df <- df %>% 
  mutate(group = case_when(
    between(Age, 0, 13) ~ "age_13_or_under",
    between(Age, 14, 15) ~ "age_14_to_15",
    between(Age, 16, 17) ~ "age_16_to_17",
    Age >= '18' ~ "age_18_to_older",
    TRUE ~ NA_character_
  ))

grouped_df %>% ggplot(aes(fill=Smoke, y=LungCap, x=group)) + 
    geom_bar(position="dodge", stat="identity")

As we see, it looks like older individuals tend to have higher lung capacities regardless of whether they smoke or not. Age seems to be the bigger determiner of lung capacity than smoking status.

Here is a 100% stacked bar chart which shows that those over 18 also smoke more than other cohorts (which is expected):

Code
grouped_df %>% ggplot(aes(x=group, fill = Smoke)) +
  geom_bar(position = "fill")

It seems that because the percentage of smokers queried are more likely to be older, they will naturally have larger lung capacities than others.

Perhaps more importantly, this dataset focuses significantly more on young people - so smaller lung capacities make up much more of the overall sample:

Code
grouped_df %>% ggplot(aes(x=group, fill = Smoke)) +
  geom_bar()

1f.: Correlation & Covariance

Code
cor(grouped_df$LungCap, grouped_df$Age)
[1] 0.8196749
Code
cov(grouped_df$LungCap, grouped_df$Age)
[1] 8.738289

In this case, the Pearson’s correlation value is .81. Generally two variables are considered strong when their r-value is larger than .7, so I will say that age and lung capacity are strongly correlated.

Their covariance is also positive, so an increase in one results in an increase in the other. That means lung capacity goes up with age rather than the other way around.

Question 2

I will make a dataframe from the values provided:

Code
prior_convictions=c(0,1,2,3,4)
freq=c(128, 434, 160, 64, 24)
prisondata <- data.frame(prior_convictions, freq)
prisondata
  prior_convictions freq
1                 0  128
2                 1  434
3                 2  160
4                 3   64
5                 4   24

And add a probability column:

Code
prison_prob <- prisondata %>% mutate(prob = freq/sum(freq))
prison_prob
  prior_convictions freq       prob
1                 0  128 0.15802469
2                 1  434 0.53580247
3                 2  160 0.19753086
4                 3   64 0.07901235
5                 4   24 0.02962963

2a.

What is the probability that a randomly selected inmate has exactly 2 prior convictions?

From the table above, the probability is 0.19753086, nearly 20 percent.

2b.

What is the probability that a randomly selected inmate has fewer than 2 prior convictions?

Code
head(prison_prob,2) %>% summarise(sum(prob))
  sum(prob)
1 0.6938272

The probability a randomly selected inmate has has fewer than 2 prior convictions is ~69%.

2c.

What is the probability that a randomly selected inmate has 2 or fewer prior convictions?

Code
head(prison_prob,3) %>% summarise(sum(prob))
  sum(prob)
1  0.891358

The probability a randomly selected inmate has 2 or fewer convictions is ~89%

2d.

What is the probability that a randomly selected inmate has more than 2 prior convictions?

Code
tail(prison_prob,3) %>% summarise(sum(prob))
  sum(prob)
1 0.3061728

The probability a randomly selected inmate has more than 2 prior convictions is ~30.6%

2e.

What is the expected value of the number of prior convictions?

Code
sum(prison_prob$prior_convictions*prison_prob$prob)
[1] 1.28642
Code
#Or another way,

weighted.mean(prison_prob$prior_convictions,prison_prob$prob)
[1] 1.28642

The expected value of prior convictions is 1.28642

2f

Code
prison_prob
  prior_convictions freq       prob
1                 0  128 0.15802469
2                 1  434 0.53580247
3                 2  160 0.19753086
4                 3   64 0.07901235
5                 4   24 0.02962963
Code
var(prison_prob$freq)
[1] 25948
Code
sd(prison_prob$freq)
[1] 161.0838

The variance among all prior convictions is 25948. The standard deviation among all prior convictions is 161.0838.