Code
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Zhiyuan Zhou
February 28, 2023
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
First, let’s read in the data from the Excel file:
The distribution of LungCap looks as follows:
The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).
##b
##c
# A tibble: 2 × 2
Smoke mean
<chr> <dbl>
1 no 7.77
2 yes 8.65
This result surprised me that smokers have more lung capacity than non-smokers.
##d
`summarise()` has grouped output by 'AgeGroup'. You can override using the
`.groups` argument.
# A tibble: 8 × 5
# Groups: AgeGroup [4]
AgeGroup Smoke meanLungCap meanAge count
<fct> <chr> <dbl> <dbl> <int>
1 <=13 no 6.36 9.49 401
2 <=13 yes 7.20 11.7 27
3 14-15 no 9.14 14.5 105
4 14-15 yes 8.39 14.6 15
5 16-17 no 10.5 16.4 77
6 16-17 yes 9.38 16.6 20
7 >=18 no 11.1 18.5 65
8 >=18 yes 10.5 18.1 15
##e In age group “0-13”, smokers have higher lung capacity than non-smokers. In all other groups, smokers have less lung capacity than non-smokers. The number of samples under 13 gave it a clue about the interesting finding in 1c. And the mean age difference among smokers and non-smokers pointed out that the age difference is more likely to be the reason of higher lung capacity instead of smoking.
#Question 2
##a
##b
##c
##d
##e
##f
---
title: "Homework 1"
author: "Zhiyuan Zhou"
description: "Homework 1"
date: "02/28/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw1
- desriptive statistics
- probability
---
```{r}
library(dplyr)
```
# Question 1
## a
First, let's read in the data from the Excel file:
```{r, echo=T}
library(readxl)
df <- read_excel("_data/LungCapData.xls")
```
The distribution of LungCap looks as follows:
```{r, echo=T}
hist(df$LungCap)
```
The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).
##b
```{r, echo=T}
boxplot(df$LungCap~df$Gender,
main = "Lung Capacity by Gender",
xlab = "Gender",
ylab = "Lung Capacity",
)
```
##c
```{r, echo=T}
df %>%
group_by(Smoke) %>%
summarize(mean = mean(LungCap))
```
This result surprised me that smokers have more lung capacity than non-smokers.
##d
```{r}
df["AgeGroup"] =
cut(df$Age,
c(0, 13, 15, 17, Inf),
c("<=13", "14-15","16-17", ">=18"),
right = T
)
df%>%
group_by(AgeGroup, Smoke)%>%
summarize(meanLungCap = mean(LungCap), meanAge = mean(Age), count = n())
```
##e
In age group "0-13", smokers have higher lung capacity than non-smokers. In all other groups, smokers have less lung capacity than non-smokers. The number of samples under 13 gave it a clue about the interesting finding in 1c. And the mean age difference among smokers and non-smokers pointed out that the age difference is more likely to be the reason of higher lung capacity instead of smoking.
#Question 2
##a
```{r}
prob_2 <- (160 / 810)
prob_2
```
##b
```{r}
prob_fewer2 <- (128 + 434) / 810
prob_fewer2
```
##c
```{r}
prob_2OrFewer <- (128 + 434 + 160) / 810
prob_2OrFewer
```
##d
```{r}
prob_more2 <- (64 + 24) / 810
prob_more2
```
##e
```{r}
expectation <- (0 * 128 + 1 * 434 + 2 * 160 + 3 * 64 + 4 * 24) / 810
expectation
```
##f
```{r}
variance <- sum(128 * (0 - expectation) ^ 2,
434 * (1 - expectation) ^ 2,
160 * (2 - expectation) ^ 2,
64 * (3 - expectation) ^ 2,
24 * (4 - expectation) ^ 2) / 810
variance
sd <- sqrt(variance)
sd
```