First, let’s read in the data from the Excel file:
Code
library(readxl)
Warning: package 'readxl' was built under R version 4.2.3
Code
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.2.3
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'purrr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
Warning: package 'forcats' was built under R version 4.2.3
Warning: package 'lubridate' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.0 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.1 ✔ tibble 3.2.0
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
df <-read_excel("_data/LungCapData.xls")
The distribution of LungCap looks as follows:
Code
hist(df$LungCap)
The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).
b
Code
boxplot(LungCap ~ Gender, df )
The probability distribution with respect to Males and Females is very similar. The min, max, and median are all slightly higher for males.
C
Code
mean(subset(df$LungCap, df$Smoke =="no"))
[1] 7.770188
Code
mean(subset(df$LungCap, df$Smoke =="yes"))
[1] 8.645455
Lung capacity for smokers is higher in this dataset, which does not seem to make sense.
##d
Code
df <- df %>%mutate(age_group = dplyr::case_when( Age <=13~"<=13", Age ==14| Age ==15~"14-15", Age ==16| Age ==17~"16-17", Age >=18~">=18" ))df2 <- df %>%group_by(age_group, Smoke) %>%summarise_at(vars(LungCap), list(AvgLungCap = mean))df2
# A tibble: 8 × 3
# Groups: age_group [4]
age_group Smoke AvgLungCap
<chr> <chr> <dbl>
1 14-15 no 9.14
2 14-15 yes 8.39
3 16-17 no 10.5
4 16-17 yes 9.38
5 <=13 no 6.36
6 <=13 yes 7.20
7 >=18 no 11.1
8 >=18 yes 10.5
The relationship between age and lung capacity implies that lung capacity increases as one gets older.
##e
For smokers specifically, their lung capacity is higher for all age groups except >=18. This differs from part C, where all smokers had higher lung capacity. There are a few possible explanations for this.
`summarise()` has grouped output by 'Smoke'. You can override using the
`.groups` argument.
# A tibble: 8 × 3
# Groups: Smoke [2]
Smoke age_group count
<chr> <chr> <int>
1 no 14-15 105
2 no 16-17 77
3 no <=13 401
4 no >=18 65
5 yes 14-15 15
6 yes 16-17 20
7 yes <=13 27
8 yes >=18 15
There are far more people under 13 in this dataset than those above 18, a majority of whom do not smoke. Above 18 as well there are more nonsmokers than smokers. The large count of those under 13 are likely skewing the analysis from part C.
---title: "Homework - 1"author: "Tyler Tewksbury"description: "Homework 1"date: "02/28/2023"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - hw1 - desriptive statistics - probability---# Question 1## aFirst, let's read in the data from the Excel file:```{r, echo=T}library(readxl)library(tidyverse)df <-read_excel("_data/LungCapData.xls")```The distribution of LungCap looks as follows:```{r, echo=T}hist(df$LungCap)```The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).## b```{r}boxplot(LungCap ~ Gender, df )```The probability distribution with respect to Males and Females is very similar. The min, max, and median are all slightly higher for males.## C```{r}mean(subset(df$LungCap, df$Smoke =="no"))mean(subset(df$LungCap, df$Smoke =="yes"))```Lung capacity for smokers is higher in this dataset, which does not seem to make sense.##d ```{r}df <- df %>%mutate(age_group = dplyr::case_when( Age <=13~"<=13", Age ==14| Age ==15~"14-15", Age ==16| Age ==17~"16-17", Age >=18~">=18" ))df2 <- df %>%group_by(age_group, Smoke) %>%summarise_at(vars(LungCap), list(AvgLungCap = mean))df2```The relationship between age and lung capacity implies that lung capacity increases as one gets older.##eFor smokers specifically, their lung capacity is higher for all age groups except >=18. This differs from part C, where all smokers had higher lung capacity. There are a few possible explanations for this. ```{r}df %>%group_by(Smoke, age_group) %>%summarise(count =n())```There are far more people under 13 in this dataset than those above 18, a majority of whom do not smoke. Above 18 as well there are more nonsmokers than smokers. The large count of those under 13 are likely skewing the analysis from part C. # Question 2##a```{r}prob = (160/810)prob```About 19.75%##b```{r}prob2 = ((434+128)/810)prob2```About 69.38%##c```{r}prob3 = ((434+128+160)/810)prob3```About 89.14%##d```{r}prob4 = ((64+24)/810)prob4```About 10.86%##e```{r}convictions <-c(rep(0, 128), rep(1, 434), rep(2, 160), rep(3, 64), rep(4, 24))mean(convictions)```##f```{r}var(convictions)sd(convictions)```Variance - .857St Dev - .926