hw1
desriptive statistics
probability
The first homework on descriptive statistics and probability
Author

Omer Yalcin

Published

October 4, 2022

Question 1

a

First, let’s read in the data from the Excel file:

Code
library(readxl)
library(dplyr, warn.conflicts = F)
library(magrittr)
df <- read_excel("_data/LungCapData.xls")

The distribution of LungCap looks as follows:

Code
hist(df$LungCap, xlab = 'Lung Capacity', main = '', freq = F)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

b

Code
boxplot(LungCap ~ Gender, data = df)

The shape of the distribution is similar for males and females. The median, first quartile, third quartile lung capacity values all seem to be somewhat higher for males.

c

Code
df %>%
  group_by(Smoke) %>%
  summarize(LungCap = mean(LungCap))
# A tibble: 2 × 2
  Smoke LungCap
  <chr>   <dbl>
1 no       7.77
2 yes      8.65

The lung capacity for smokers seems to be higher than non-smokers. It goes against the common idea that smoking would hurt lung capacity.

d

  • Less than or equal to 13
Code
df %>%
  filter(Age <= 13) %>%
  group_by(Smoke) %>%
  summarize(LungCap = mean(LungCap))
# A tibble: 2 × 2
  Smoke LungCap
  <chr>   <dbl>
1 no       6.36
2 yes      7.20
  • 14 to 15
Code
df %>%
  filter(Age == 14 | Age == 15) %>%
  group_by(Smoke) %>%
  summarize(LungCap = mean(LungCap))
# A tibble: 2 × 2
  Smoke LungCap
  <chr>   <dbl>
1 no       9.14
2 yes      8.39
  • 16 to 17
Code
df %>%
  filter(Age == 16 | Age == 17) %>%
  group_by(Smoke) %>%
  summarize(LungCap = mean(LungCap))
# A tibble: 2 × 2
  Smoke LungCap
  <chr>   <dbl>
1 no      10.5 
2 yes      9.38
  • Greater than or equal to 18
Code
df %>%
  filter(Age >= 18) %>%
  group_by(Smoke) %>%
  summarize(LungCap = mean(LungCap))
# A tibble: 2 × 2
  Smoke LungCap
  <chr>   <dbl>
1 no       11.1
2 yes      10.5

e

For three out of the four groups, lung capacity if smaller for smokers. This makes another explanation plausible. Smoking is inversely related to lung capacity, but older people both smoke more and have more lung capacity. Thus, considering the relationship between smoking and lung capacity without looking at age makes the relationship look the opposite of what it is.

f

Code
cov(df$LungCap, df$Age)
[1] 8.738289
Code
cor(df$LungCap, df$Age)
[1] 0.8196749

Both the correlation and the covariance are positive (when one of them is the other has to). Positive values suggest that people who are older tend to have higher lung capacity, confirming what we found. Since correlation is standardized (needs to be between -1 and 1), its absolute value tells us about the strength of the relationship. 0.82 suggests a pretty strong relationship.

Question 2

Let’s recreate the data in the question in R.

Code
tb <- tibble(
  X = c(0, 1, 2, 3, 4),
  Frequency = c(128, 434, 160, 64, 24)
  )

There is no super elegant way of doing these (as far as I know), here’s one way.

Store the total number of observations, 810, into n

Code
n <- sum(tb$Frequency) 

a

Code
tb %>%
  filter(X == 2) %>%
  pull(Frequency) %>%
  divide_by(n)
[1] 0.1975309

b

Code
tb %>%
  filter(X < 2) %>%
  pull(Frequency) %>%
  sum() %>%
  divide_by(n)
[1] 0.6938272

c

Code
tb %>%
  filter(X <= 2) %>%
  pull(Frequency) %>%
  sum() %>%
  divide_by(n)
[1] 0.891358

d

Code
tb %>%
  filter(X > 2) %>%
  pull(Frequency) %>%
  sum() %>%
  divide_by(n)
[1] 0.108642

e

Expected number of prior convictions is just a weighted average of the number of prior convictions.

  • Method 1: Multiply every value with their frequency, then divide by total frequency i.e. (0 * 128 + 1 * 434 + 2 * 160 ……) / 810.
Code
sum(tb$X * tb$Frequency) / n
[1] 1.28642
  • Method 2: Multiply every value with their probility, sum them up.
Code
tb %>%
  mutate(probability = Frequency / n) -> tb

print(tb)
# A tibble: 5 × 3
      X Frequency probability
  <dbl>     <dbl>       <dbl>
1     0       128      0.158 
2     1       434      0.536 
3     2       160      0.198 
4     3        64      0.0790
5     4        24      0.0296
Code
sum(tb$X * tb$probability)
[1] 1.28642
  • Method 3: Recreate the whole sample (a vector that has 128 zeroes, 434 ones, 160 twos, ….) with a total length/size of 810. Take the mean.
Code
sample <- c(rep(0, 128), rep(1, 434), rep(2, 160), rep(3, 64), rep(4, 24))
mean(sample)
[1] 1.28642

f

  • Method 1: Let’s start from the end: we have the sample, just call var() and sd()
Code
cat('Variance:', var(sample))
Variance: 0.8572937
Code
cat('\nStandard Deviation:', sd(sample))

Standard Deviation: 0.9259016

Method 2: Manually apply the formula using weights.

Standard deviation is square root of variance. So let’s calculate variance first. For that we need the mean. Let’s pull the expected value from the previous section:

Code
m <- sum(tb$X * tb$Frequency) / n

For every observation, we’ll need the squared difference from mean (squared deviation from mean).

Code
tb %>%
  mutate(sq_deviation = (X - m)^2) -> tb 
print(tb)
# A tibble: 5 × 4
      X Frequency probability sq_deviation
  <dbl>     <dbl>       <dbl>        <dbl>
1     0       128      0.158        1.65  
2     1       434      0.536        0.0820
3     2       160      0.198        0.509 
4     3        64      0.0790       2.94  
5     4        24      0.0296       7.36  

Then, we can now multiply them with probability.

Code
sum(tb$sq_deviation * tb$probability)
[1] 0.8562353

This gives us the ‘population’ variance. If we wanted the ‘sample’ variance, what the var() function does, we could manually apply the Bessel’s correction:

Code
variance <- sum(tb$sq_deviation * tb$probability) * (n / (n-1))
print(variance)
[1] 0.8572937

Standard deviation is then just the square root:

Code
sqrt(variance)
[1] 0.9259016

This replicated what we found directly using the sample.