hw1
desriptive statistics
probability
nboonstra
The first homework on descriptive statistics and probability
Author

Nick Boonstra

Published

October 5, 2022

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Question 1

a

First, let’s read in the data from the Excel file:

Code
library(readxl)
lungcap <- read_excel("_data/LungCapData.xls")

The distribution of LungCap looks as follows:

Code
hist(lungcap$LungCap)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

b

These are the boxplots of the distributions for the lung capacity of males and females in the sample:

Code
lungcap %>% 
  ggplot(aes(x=Gender,y=LungCap)) +
  geom_boxplot()

According to these boxplots, it appears that males and females have similar median lung capacities, but that males may be more likely to have a higher lung capacity than females.

c

Code
lungcap %>% 
  group_by(Smoke) %>% 
  summarise(mean_lungcap=mean(LungCap))
# A tibble: 2 × 2
  Smoke mean_lungcap
  <chr>        <dbl>
1 no            7.77
2 yes           8.65

According to this sample, it would appear that smokers have a higher lung capacity than non-smokers. This would appear to be counter-intuitive, as one would likely expect smoking to reduce lung functionality and, by extension, capacity.

d

In order to complete this examination by group, we must create a new nominal variable that groups observations by age; this can be accomplished fairly simply using the mutate() and case_when() functions:

Code
lungcap_age <- lungcap %>% 
  mutate(age_group = case_when(
    Age <= 13 ~ "13 and under",
    Age == 14 | Age == 15 ~ "14 to 15",
    Age == 16 | Age == 17 ~ "16 to 17",
    Age >= 18 ~ "18 and older"
  ))

With this new dataframe, we can use the group_by() function to calculate mean lung capacity by age group and smoker status:

Code
lungcap_age %>% 
  group_by(age_group,Smoke) %>% 
  summarise(mean(LungCap))
# A tibble: 8 × 3
# Groups:   age_group [4]
  age_group    Smoke `mean(LungCap)`
  <chr>        <chr>           <dbl>
1 13 and under no               6.36
2 13 and under yes              7.20
3 14 to 15     no               9.14
4 14 to 15     yes              8.39
5 16 to 17     no              10.5 
6 16 to 17     yes              9.38
7 18 and older no              11.1 
8 18 and older yes             10.5 

According to these data, it appears that lung capacity generally increases with age. Interestingly, lung capacity is worse for smokers than it is for non-smokers in every age group except for “13 and under”. This is surprising on the surface, given that, when the data are ungrouped, smokers have a higher lung capacity than non-smokers (see part c). However, this begins to make more sense when we see how much better the “13 and under” group is represented compared to the others in this dataset:

Code
lungcap_age %>% 
  group_by(age_group) %>% 
  count()
# A tibble: 4 × 2
# Groups:   age_group [4]
  age_group        n
  <chr>        <int>
1 13 and under   428
2 14 to 15       120
3 16 to 17        97
4 18 and older    80

This high number of observations compared to other age groups likely plays a significant role in skewing the mean of the entire dataset.

e

It is not clear to me how this part is different from part d; from what I do understand, I believe the question being asked here is addressed in that part.

f

Code
cov(lungcap$LungCap, lungcap$Age)
[1] 8.738289
Code
cor(lungcap$LungCap, lungcap$Age)
[1] 0.8196749

It would appear that lung capacity and age covary together positively, such that a higher age means a higher lung capacity. We can confirm this with a simple visualization:

Code
lungcap %>% 
  ggplot(aes(x=Age,y=LungCap)) +
  geom_point() +
  geom_smooth(method='lm')

Question 2

Before we begin answering the parts of this question, we must create a dataframe in R that represents the necessary data.

Code
priors <- c(0,1,2,3,4)
freq <- c(128,434,160,64,24)
prisoners <- data.frame(priors,freq)
prisoners
  priors freq
1      0  128
2      1  434
3      2  160
4      3   64
5      4   24

a

The probability that a randomly selected inmate has exactly 2 prior convictions is 160 / 810 = 0.1975309.

b

The probability that a randomly selected inmate has less than 2 prior convictions is (128+434) / 810 = 0.6938272.

c

The probability that a randomly selected inmate has 2 or fewer prior convictions is (128+434+160) / 810 = 0.891358.

d

The probability that a randomly selected inmate has more than 2 prior convictions is (64+24) / 810 = 0.108642.

e

Before calculating expected value, we should put together a probability mass function for the prisoners data.

Code
prisoners <- prisoners %>% 
  mutate(prob=freq/810) %>% 
  mutate(expect=prob*priors)

prisoners %>% 
  summarise(sum(expect))
  sum(expect)
1     1.28642

The expected value for the number of prior convictions is about 1.29 priors.

EDIT: There is a much simpler way to compute this! Rather than using the dataframe I created, storing values and their frequencies, I can create one vector that stores each value a certain number of times, according to the given frequencies:

Code
prisoners_full <- rep(c(0,1,2,3,4),times=c(128,434,160,64,24))
prisoners_full
  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[408] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[445] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[482] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[519] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[556] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[593] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[630] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[667] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[704] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[741] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[778] 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

Because each value now appears as frequently as its “probability” of appearing, taking the mean of this vector also provides the correct expected value.

Code
mean(prisoners_full)
[1] 1.28642

f

Creating this numerical vector also makes the standard deviation calculation extremely simple in R.

Code
sd(prisoners_full)
[1] 0.9259016