hw1
desriptive statistics
probability
Homework 1
Author

Zhiyuan Zhou

Published

February 28, 2023

Code
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Question 1

a

First, let’s read in the data from the Excel file:

Code
library(readxl)
df <- read_excel("_data/LungCapData.xls")

The distribution of LungCap looks as follows:

Code
hist(df$LungCap)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

##b

Code
boxplot(df$LungCap~df$Gender,
main = "Lung Capacity by Gender",
xlab = "Gender",
ylab = "Lung Capacity",
)

##c

Code
df %>%
  group_by(Smoke) %>%
  summarize(mean = mean(LungCap))
# A tibble: 2 × 2
  Smoke  mean
  <chr> <dbl>
1 no     7.77
2 yes    8.65

This result surprised me that smokers have more lung capacity than non-smokers.

##d

Code
df["AgeGroup"] = 
  cut(df$Age,
      c(0, 13, 15, 17, Inf),
      c("<=13", "14-15","16-17", ">=18"),
      right = T
  )

df%>%
  group_by(AgeGroup, Smoke)%>%
  summarize(meanLungCap = mean(LungCap), meanAge = mean(Age), count = n())
`summarise()` has grouped output by 'AgeGroup'. You can override using the
`.groups` argument.
# A tibble: 8 × 5
# Groups:   AgeGroup [4]
  AgeGroup Smoke meanLungCap meanAge count
  <fct>    <chr>       <dbl>   <dbl> <int>
1 <=13     no           6.36    9.49   401
2 <=13     yes          7.20   11.7     27
3 14-15    no           9.14   14.5    105
4 14-15    yes          8.39   14.6     15
5 16-17    no          10.5    16.4     77
6 16-17    yes          9.38   16.6     20
7 >=18     no          11.1    18.5     65
8 >=18     yes         10.5    18.1     15

##e In age group “0-13”, smokers have higher lung capacity than non-smokers. In all other groups, smokers have less lung capacity than non-smokers. The number of samples under 13 gave it a clue about the interesting finding in 1c. And the mean age difference among smokers and non-smokers pointed out that the age difference is more likely to be the reason of higher lung capacity instead of smoking.

#Question 2

##a

Code
prob_2 <- (160 / 810)
prob_2
[1] 0.1975309

##b

Code
prob_fewer2 <- (128 + 434) / 810
prob_fewer2
[1] 0.6938272

##c

Code
prob_2OrFewer <- (128 + 434 + 160) / 810
prob_2OrFewer
[1] 0.891358

##d

Code
prob_more2 <- (64 + 24) / 810
prob_more2
[1] 0.108642

##e

Code
expectation <- (0 * 128 + 1 * 434 + 2 * 160 + 3 * 64 + 4 * 24) / 810
expectation
[1] 1.28642

##f

Code
variance <- sum(128 * (0 - expectation) ^ 2,
                434 * (1 - expectation) ^ 2,
                160 * (2 - expectation) ^ 2,
                64 * (3 - expectation) ^ 2,
                24 * (4 - expectation) ^ 2) / 810
variance
[1] 0.8562353
Code
sd <- sqrt(variance)
sd
[1] 0.9253298