hw1
desriptive statistics
probability
Homework 1
Author

Tyler Tewksbury

Published

February 28, 2023

Question 1

a

First, let’s read in the data from the Excel file:

Code
library(readxl)
Warning: package 'readxl' was built under R version 4.2.3
Code
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.2.3
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'purrr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
Warning: package 'forcats' was built under R version 4.2.3
Warning: package 'lubridate' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.2.0
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
df <- read_excel("_data/LungCapData.xls")

The distribution of LungCap looks as follows:

Code
hist(df$LungCap)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

b

Code
boxplot(LungCap ~ Gender, df )

The probability distribution with respect to Males and Females is very similar. The min, max, and median are all slightly higher for males.

C

Code
mean(subset(df$LungCap, df$Smoke == "no"))
[1] 7.770188
Code
mean(subset(df$LungCap, df$Smoke == "yes"))
[1] 8.645455

Lung capacity for smokers is higher in this dataset, which does not seem to make sense.

##d

Code
df <- df %>% 
  mutate(
    age_group = dplyr::case_when(
      Age <= 13            ~ "<=13",
      Age == 14 | Age == 15 ~ "14-15",
      Age == 16 | Age == 17 ~ "16-17",
      Age >= 18             ~ ">=18"
    )
)

df2 <- df %>%
  group_by(age_group, Smoke) %>%
  summarise_at(vars(LungCap),  list(AvgLungCap = mean))
 
df2
# A tibble: 8 × 3
# Groups:   age_group [4]
  age_group Smoke AvgLungCap
  <chr>     <chr>      <dbl>
1 14-15     no          9.14
2 14-15     yes         8.39
3 16-17     no         10.5 
4 16-17     yes         9.38
5 <=13      no          6.36
6 <=13      yes         7.20
7 >=18      no         11.1 
8 >=18      yes        10.5 

The relationship between age and lung capacity implies that lung capacity increases as one gets older.

##e

For smokers specifically, their lung capacity is higher for all age groups except >=18. This differs from part C, where all smokers had higher lung capacity. There are a few possible explanations for this.

Code
df %>% group_by(Smoke, age_group) %>% summarise(count = n())
`summarise()` has grouped output by 'Smoke'. You can override using the
`.groups` argument.
# A tibble: 8 × 3
# Groups:   Smoke [2]
  Smoke age_group count
  <chr> <chr>     <int>
1 no    14-15       105
2 no    16-17        77
3 no    <=13        401
4 no    >=18         65
5 yes   14-15        15
6 yes   16-17        20
7 yes   <=13         27
8 yes   >=18         15

There are far more people under 13 in this dataset than those above 18, a majority of whom do not smoke. Above 18 as well there are more nonsmokers than smokers. The large count of those under 13 are likely skewing the analysis from part C.

Question 2

##a

Code
prob = (160/810)
prob
[1] 0.1975309

About 19.75%

##b

Code
prob2 = ((434+128)/810)
prob2
[1] 0.6938272

About 69.38%

##c

Code
prob3 = ((434+128+160)/810)
prob3
[1] 0.891358

About 89.14%

##d

Code
prob4 = ((64+24)/810)
prob4
[1] 0.108642

About 10.86%

##e

Code
convictions <- c(rep(0, 128), rep(1, 434), rep(2, 160), rep(3, 64), rep(4, 24))
mean(convictions)
[1] 1.28642

##f

Code
var(convictions)
[1] 0.8572937
Code
sd(convictions)
[1] 0.9259016

Variance - .857 St Dev - .926