Homework 1

hw1

desriptive statistics

probability

The first homework on descriptive statistics and probability

Author

Steve O’Neill

Published

September 20, 2022

Question 1: Lung Capacity

This exercise focuses on lung capacity data (LungCapData.xls). First, the data must be “read into” R:

Code

library(readxl)
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

df <- read_excel("_data/LungCapData.xls")
df

# A tibble: 725 × 6
   LungCap   Age Height Smoke Gender Caesarean
     <dbl> <dbl>  <dbl> <chr> <chr>  <chr>    
 1    6.48     6   62.1 no    male   no       
 2   10.1     18   74.7 yes   female no       
 3    9.55    16   69.7 no    female yes      
 4   11.1     14   71   no    male   no       
 5    4.8      5   56.9 no    male   no       
 6    6.22    11   58.7 no    female no       
 7    4.95     8   63.3 no    male   yes      
 8    7.32    11   70.4 no    male   no       
 9    8.88    15   70.5 no    male   no       
10    6.8     11   59.2 no    male   no       
# … with 715 more rows
# ℹ Use `print(n = ...)` to see more rows

1a.: What does the distribution look like?

Here is the distribution of LungCap with probability density on the y-axis (instead of frequency, the default):

Code

hist(df$LungCap, freq = FALSE)

It looks to approach a normal distribution.

1b. Comparing Males and Females

Males seem to have a higher LungCap in general:

Code

boxplot(df$LungCap ~ df$Gender)

1c. Smokers

Surprisingly, smokers in this dataset overall are shown to have a higher average lung capacity. This is not what I expected so there might be other factors at play.

Code

boxplot(df$LungCap ~ df$Smoke)

1d. & 1e.: Smoking, Lung Capacity, and Age

I am approaching this as a compound question. I think a clustered bar chart does the best job of representing the relationship between smoking, lung capacity, and age:

Code

grouped_df <- df %>% 
  mutate(group = case_when(
    between(Age, 0, 13) ~ "age_13_or_under",
    between(Age, 14, 15) ~ "age_14_to_15",
    between(Age, 16, 17) ~ "age_16_to_17",
    Age >= '18' ~ "age_18_to_older",
    TRUE ~ NA_character_
  ))

grouped_df %>% ggplot(aes(fill=Smoke, y=LungCap, x=group)) + 
    geom_bar(position="dodge", stat="identity")

As we see, it looks like older individuals tend to have higher lung capacities regardless of whether they smoke or not. Age seems to be the bigger determiner of lung capacity than smoking status.

Here is a 100% stacked bar chart which shows that those over 18 also smoke more than other cohorts (which is expected):

Code

grouped_df %>% ggplot(aes(x=group, fill = Smoke)) +
  geom_bar(position = "fill")

It seems that because the percentage of smokers queried are more likely to be older, they will naturally have larger lung capacities than others.

Perhaps more importantly, this dataset focuses significantly more on young people - so smaller lung capacities make up much more of the overall sample:

Code

grouped_df %>% ggplot(aes(x=group, fill = Smoke)) +
  geom_bar()

1f.: Correlation & Covariance

Code

cor(grouped_df$LungCap, grouped_df$Age)

[1] 0.8196749

Code

cov(grouped_df$LungCap, grouped_df$Age)

[1] 8.738289

In this case, the Pearson’s correlation value is .81. Generally two variables are considered strong when their r-value is larger than .7, so I will say that age and lung capacity are strongly correlated.

Their covariance is also positive, so an increase in one results in an increase in the other. That means lung capacity goes up with age rather than the other way around.

Question 2

I will make a dataframe from the values provided:

Code

prior_convictions=c(0,1,2,3,4)
freq=c(128, 434, 160, 64, 24)
prisondata <- data.frame(prior_convictions, freq)
prisondata

  prior_convictions freq
1                 0  128
2                 1  434
3                 2  160
4                 3   64
5                 4   24

And add a probability column:

Code

prison_prob <- prisondata %>% mutate(prob = freq/sum(freq))
prison_prob

  prior_convictions freq       prob
1                 0  128 0.15802469
2                 1  434 0.53580247
3                 2  160 0.19753086
4                 3   64 0.07901235
5                 4   24 0.02962963

2a.

What is the probability that a randomly selected inmate has exactly 2 prior convictions?

From the table above, the probability is 0.19753086, nearly 20 percent.

2b.

What is the probability that a randomly selected inmate has fewer than 2 prior convictions?

Code

head(prison_prob,2) %>% summarise(sum(prob))

  sum(prob)
1 0.6938272

The probability a randomly selected inmate has has fewer than 2 prior convictions is ~69%.

2c.

What is the probability that a randomly selected inmate has 2 or fewer prior convictions?

Code

head(prison_prob,3) %>% summarise(sum(prob))

  sum(prob)
1  0.891358

The probability a randomly selected inmate has 2 or fewer convictions is ~89%

2d.

What is the probability that a randomly selected inmate has more than 2 prior convictions?

Code

tail(prison_prob,3) %>% summarise(sum(prob))

  sum(prob)
1 0.3061728

The probability a randomly selected inmate has more than 2 prior convictions is ~30.6%

2e.

What is the expected value of the number of prior convictions?

Code

sum(prison_prob$prior_convictions*prison_prob$prob)

[1] 1.28642

Code

#Or another way,

weighted.mean(prison_prob$prior_convictions,prison_prob$prob)

[1] 1.28642

The expected value of prior convictions is 1.28642

2f

Code

prison_prob

  prior_convictions freq       prob
1                 0  128 0.15802469
2                 1  434 0.53580247
3                 2  160 0.19753086
4                 3   64 0.07901235
5                 4   24 0.02962963

Code

var(prison_prob$freq)

[1] 25948

Code

sd(prison_prob$freq)

[1] 161.0838

The variance among all prior convictions is 25948. The standard deviation among all prior convictions is 161.0838.