Homework 1 Solution

hw1

desriptive statistics

probability

The first homework on descriptive statistics and probability

Author

Rosemary Pang

Published

March 1, 2023

Question 1

a

First, let’s read in the data from the Excel file:

Code

library(readxl)
library(dplyr, warn.conflicts = F)
library(magrittr)
df <- read_excel("_data/LungCapData.xls")

The distribution of LungCap looks as follows:

Code

hist(df$LungCap, xlab = 'Lung Capacity', main = '', freq = F)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

b

Code

boxplot(LungCap ~ Gender, data = df)

The shape of the distribution is similar for males and females. The median, first quartile, third quartile lung capacity values all seem to be somewhat higher for males.

c

Code

df %>%
  group_by(Smoke) %>%
  summarize(LungCap = mean(LungCap))

# A tibble: 2 × 2
  Smoke LungCap
  <chr>   <dbl>
1 no       7.77
2 yes      8.65

The lung capacity for smokers seems to be higher than non-smokers. It goes against the common idea that smoking would hurt lung capacity.

d

Less than or equal to 13

Code

df %>%
  filter(Age <= 13) %>%
  group_by(Smoke) %>%
  summarize(LungCap = mean(LungCap))

# A tibble: 2 × 2
  Smoke LungCap
  <chr>   <dbl>
1 no       6.36
2 yes      7.20

14 to 15

Code

df %>%
  filter(Age == 14 | Age == 15) %>%
  group_by(Smoke) %>%
  summarize(LungCap = mean(LungCap))

# A tibble: 2 × 2
  Smoke LungCap
  <chr>   <dbl>
1 no       9.14
2 yes      8.39

16 to 17

Code

df %>%
  filter(Age == 16 | Age == 17) %>%
  group_by(Smoke) %>%
  summarize(LungCap = mean(LungCap))

# A tibble: 2 × 2
  Smoke LungCap
  <chr>   <dbl>
1 no      10.5 
2 yes      9.38

Greater than or equal to 18

Code

df %>%
  filter(Age >= 18) %>%
  group_by(Smoke) %>%
  summarize(LungCap = mean(LungCap))

# A tibble: 2 × 2
  Smoke LungCap
  <chr>   <dbl>
1 no       11.1
2 yes      10.5

e

For three out of the four groups, lung capacity if smaller for smokers. This makes another explanation plausible. Smoking is inversely related to lung capacity, but older people both smoke more and have more lung capacity. Thus, considering the relationship between smoking and lung capacity without looking at age makes the relationship look the opposite of what it is.

Question 2

Let’s recreate the data in the question in R.

Code

tb <- tibble(
  X = c(0, 1, 2, 3, 4),
  Frequency = c(128, 434, 160, 64, 24)
  )

Store the total number of observations, 810, into n

Code

n <- sum(tb$Frequency)

a

Code

tb %>%
  filter(X == 2) %>%
  pull(Frequency) %>%
  divide_by(n)

[1] 0.1975309

b

Code

tb %>%
  filter(X < 2) %>%
  pull(Frequency) %>%
  sum() %>%
  divide_by(n)

[1] 0.6938272

c

Code

tb %>%
  filter(X <= 2) %>%
  pull(Frequency) %>%
  sum() %>%
  divide_by(n)

[1] 0.891358

d

Code

tb %>%
  filter(X > 2) %>%
  pull(Frequency) %>%
  sum() %>%
  divide_by(n)

[1] 0.108642

e

Expected number of prior convictions is just a weighted average of the number of prior convictions.

Method 1: Multiply every value with their frequency, then divide by total frequency i.e. (0 * 128 + 1 * 434 + 2 * 160 ……) / 810.

Code

sum(tb$X * tb$Frequency) / n

[1] 1.28642

Method 2: Multiply every value with their probility, sum them up.

Code

tb %>%
  mutate(probability = Frequency / n) -> tb

print(tb)

# A tibble: 5 × 3
      X Frequency probability
  <dbl>     <dbl>       <dbl>
1     0       128      0.158 
2     1       434      0.536 
3     2       160      0.198 
4     3        64      0.0790
5     4        24      0.0296

Code

sum(tb$X * tb$probability)

[1] 1.28642

Method 3: Recreate the whole sample (a vector that has 128 zeroes, 434 ones, 160 twos, ….) with a total length/size of 810. Take the mean.

Code

sample <- c(rep(0, 128), rep(1, 434), rep(2, 160), rep(3, 64), rep(4, 24))
mean(sample)

[1] 1.28642

f

Method 1: Let’s start from the end: we have the sample, just call var() and sd()

Code

cat('Variance:', var(sample))

Variance: 0.8572937

Code

cat('\nStandard Deviation:', sd(sample))


Standard Deviation: 0.9259016

Method 2: Manually apply the formula using weights.

Standard deviation is square root of variance. So let’s calculate variance first. For that we need the mean. Let’s pull the expected value from the previous section:

Code

m <- sum(tb$X * tb$Frequency) / n

For every observation, we’ll need the squared difference from mean (squared deviation from mean).

Code

tb %>%
  mutate(sq_deviation = (X - m)^2) -> tb 
print(tb)

# A tibble: 5 × 4
      X Frequency probability sq_deviation
  <dbl>     <dbl>       <dbl>        <dbl>
1     0       128      0.158        1.65  
2     1       434      0.536        0.0820
3     2       160      0.198        0.509 
4     3        64      0.0790       2.94  
5     4        24      0.0296       7.36

Then, we can now multiply them with probability.

Code

sum(tb$sq_deviation * tb$probability)

[1] 0.8562353

This gives us the ‘population’ variance. If we wanted the ‘sample’ variance, what the var() function does, we could manually apply the Bessel’s correction:

Code

variance <- sum(tb$sq_deviation * tb$probability) * (n / (n-1))
print(variance)

[1] 0.8572937

Standard deviation is then just the square root:

Code

sqrt(variance)

[1] 0.9259016

This replicated what we found directly using the sample.