Homework - 1

hw1

desriptive statistics

probability

Homework 1

Author

Tyler Tewksbury

Published

February 28, 2023

Question 1

a

First, let’s read in the data from the Excel file:

Code

library(readxl)

Warning: package 'readxl' was built under R version 4.2.3

Code

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.2.3

Warning: package 'ggplot2' was built under R version 4.2.3

Warning: package 'tibble' was built under R version 4.2.3

Warning: package 'tidyr' was built under R version 4.2.3

Warning: package 'readr' was built under R version 4.2.3

Warning: package 'purrr' was built under R version 4.2.3

Warning: package 'stringr' was built under R version 4.2.3

Warning: package 'forcats' was built under R version 4.2.3

Warning: package 'lubridate' was built under R version 4.2.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.2.0
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

df <- read_excel("_data/LungCapData.xls")

The distribution of LungCap looks as follows:

Code

hist(df$LungCap)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

b

Code

boxplot(LungCap ~ Gender, df )

The probability distribution with respect to Males and Females is very similar. The min, max, and median are all slightly higher for males.

C

Code

mean(subset(df$LungCap, df$Smoke == "no"))

[1] 7.770188

Code

mean(subset(df$LungCap, df$Smoke == "yes"))

[1] 8.645455

Lung capacity for smokers is higher in this dataset, which does not seem to make sense.

##d

Code

df <- df %>% 
  mutate(
    age_group = dplyr::case_when(
      Age <= 13            ~ "<=13",
      Age == 14 | Age == 15 ~ "14-15",
      Age == 16 | Age == 17 ~ "16-17",
      Age >= 18             ~ ">=18"
    )
)

df2 <- df %>%
  group_by(age_group, Smoke) %>%
  summarise_at(vars(LungCap),  list(AvgLungCap = mean))
 
df2

# A tibble: 8 × 3
# Groups:   age_group [4]
  age_group Smoke AvgLungCap
  <chr>     <chr>      <dbl>
1 14-15     no          9.14
2 14-15     yes         8.39
3 16-17     no         10.5 
4 16-17     yes         9.38
5 <=13      no          6.36
6 <=13      yes         7.20
7 >=18      no         11.1 
8 >=18      yes        10.5

The relationship between age and lung capacity implies that lung capacity increases as one gets older.

##e

For smokers specifically, their lung capacity is higher for all age groups except >=18. This differs from part C, where all smokers had higher lung capacity. There are a few possible explanations for this.

Code

df %>% group_by(Smoke, age_group) %>% summarise(count = n())

`summarise()` has grouped output by 'Smoke'. You can override using the
`.groups` argument.

# A tibble: 8 × 3
# Groups:   Smoke [2]
  Smoke age_group count
  <chr> <chr>     <int>
1 no    14-15       105
2 no    16-17        77
3 no    <=13        401
4 no    >=18         65
5 yes   14-15        15
6 yes   16-17        20
7 yes   <=13         27
8 yes   >=18         15

There are far more people under 13 in this dataset than those above 18, a majority of whom do not smoke. Above 18 as well there are more nonsmokers than smokers. The large count of those under 13 are likely skewing the analysis from part C.

Question 2

##a

Code

prob = (160/810)
prob

[1] 0.1975309

About 19.75%

##b

Code

prob2 = ((434+128)/810)
prob2

[1] 0.6938272

About 69.38%

##c

Code

prob3 = ((434+128+160)/810)
prob3

[1] 0.891358

About 89.14%

##d

Code

prob4 = ((64+24)/810)
prob4

[1] 0.108642

About 10.86%

##e

Code

convictions <- c(rep(0, 128), rep(1, 434), rep(2, 160), rep(3, 64), rep(4, 24))
mean(convictions)

[1] 1.28642

##f

Code

var(convictions)

[1] 0.8572937

Code

sd(convictions)

[1] 0.9259016

Variance - .857 St Dev - .926