Homework 1

hw1

homework1

abigailbalint

lungcap

prisoner

ggplot2

Author

Abigail Balint

Published

February 23, 2023

Code

library(tidyverse)
library(ggplot2)
library(dplyr)
library(readxl)
knitr::opts_chunk$set(echo = TRUE)

Question 1 - Lung Capacity

Reading in LungCapData –

Code

lung <- read_excel("_data/LungCapData.xls")
head(lung,2)

# A tibble: 2 × 6
  LungCap   Age Height Smoke Gender Caesarean
    <dbl> <dbl>  <dbl> <chr> <chr>  <chr>    
1    6.48     6   62.1 no    male   no       
2   10.1     18   74.7 yes   female no

Looking at some basic descriptive stats –

Code

glimpse(lung)

Rows: 725
Columns: 6
$ LungCap   <dbl> 6.475, 10.125, 9.550, 11.125, 4.800, 6.225, 4.950, 7.325, 8.…
$ Age       <dbl> 6, 18, 16, 14, 5, 11, 8, 11, 15, 11, 19, 17, 12, 10, 10, 13,…
$ Height    <dbl> 62.1, 74.7, 69.7, 71.0, 56.9, 58.7, 63.3, 70.4, 70.5, 59.2, …
$ Smoke     <chr> "no", "yes", "no", "no", "no", "no", "no", "no", "no", "no",…
$ Gender    <chr> "male", "female", "female", "male", "male", "female", "male"…
$ Caesarean <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…

Code

mean(lung$LungCap, na.rm = T)

[1] 7.863148

Code

var(lung$LungCap, na.rm = T)

[1] 7.086288

Code

sd(lung$LungCap, na.rm = T)

[1] 2.662008

Code

range(lung$LungCap, na.rm = T)

[1]  0.507 14.675

What does the distribution of LungCap look like?

Code

ggplot(lung, aes(x = LungCap)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution looks relatively normal. There is a clear concentration of the sample around 7-8 and the outliers are only a very small portion of the sample.

Compare the probability distribution of the LungCap with respect to Males and Females

Code

ggplot(lung, aes(x=LungCap)) + 
    geom_boxplot(fill="slateblue", alpha=0.2) + 
    xlab("Lung Capacity") +
  facet_wrap("Gender")

The probability distribution is pretty similar between male and female, but males skew to a higher lung capacity overall and the median line is at around 8 whereas female is closer to 7.5.

Compare the mean lung capacities for smokers and non-smokers. Does it make sense?

Code

lung %>%
  group_by(Smoke) %>%
  summarise(mean = mean(LungCap), n = n())

# A tibble: 2 × 3
  Smoke  mean     n
  <chr> <dbl> <int>
1 no     7.77   648
2 yes    8.65    77

We would expect the lung capacities for non smokers to be higher but the mean for smokers is actually a little bit higher.

Examine the relationship between Smoking and Lung Capacity within age groups: “less than or equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”.

Recoding the age groups –

Code

lunggroup <- lung %>%
  mutate(`AgeGroup` = dplyr::case_when(
    `Age` >= 0 & `Age` < 14 ~ "0-13",
    `Age` >= 14 & `Age` < 16 ~ "14-15",
    `Age` >= 16 & `Age` < 18 ~ "16-17",
    `Age` >= 18 ~ "18+" ))
head(lunggroup)

# A tibble: 6 × 7
  LungCap   Age Height Smoke Gender Caesarean AgeGroup
    <dbl> <dbl>  <dbl> <chr> <chr>  <chr>     <chr>   
1    6.48     6   62.1 no    male   no        0-13    
2   10.1     18   74.7 yes   female no        18+     
3    9.55    16   69.7 no    female yes       16-17   
4   11.1     14   71   no    male   no        14-15   
5    4.8      5   56.9 no    male   no        0-13    
6    6.22    11   58.7 no    female no        0-13

Mean lung capacity by age group –

Code

lunggroup %>%
group_by(Smoke, AgeGroup) %>%
  summarise(mean = mean(LungCap), n = n())

`summarise()` has grouped output by 'Smoke'. You can override using the
`.groups` argument.

# A tibble: 8 × 4
# Groups:   Smoke [2]
  Smoke AgeGroup  mean     n
  <chr> <chr>    <dbl> <int>
1 no    0-13      6.36   401
2 no    14-15     9.14   105
3 no    16-17    10.5     77
4 no    18+      11.1     65
5 yes   0-13      7.20    27
6 yes   14-15     8.39    15
7 yes   16-17     9.38    20
8 yes   18+      10.5     15

For both smokers and non-smokers, the lung capacity goes up as the age increases with 18+ having the highest average capacity. In all age ranges besides 0-13 (the broadest range), the mean is higher for non-smokers than smokers.

Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c. What could possibly be going on here?

Code

ggplot(lunggroup, aes(x=LungCap, fill=Smoke)) + 
    geom_boxplot() + 
    xlab("Lung Capacity") +
  facet_wrap("AgeGroup")

I’m seeing that the results by age group are slightly different than in part C. Above I can see that the average for all age ranges is higher for non-smokers, besides age group 0-13. I can see in my results in part D that the sample size for 0-13 non-smokers is extremely high, much higher than any other group of smokers or non-smokers, so with this higher sample size comes more variance. The median lines are actually pretty close but the outliers are probably affecting the mean.

Question 2

Creating a data frame –

Code

priorconviction <- c(0,1,2,3,4)
prisoners <- c(128,434,160,64,24)
q2 <- data.frame(priorconviction, prisoners)
head(q2)

  priorconviction prisoners
1               0       128
2               1       434
3               2       160
4               3        64
5               4        24

What is the probability that a randomly selected inmate has exactly 2 prior convictions?

Code

160/810

[1] 0.1975309

I found it to be .1975 or 19.75%

What is the probability that a randomly selected inmate has fewer than 2 prior convictions?

Code

(434+128)/810

[1] 0.6938272

To get this I added the sample of 0 or 1 prior conviction and it comes out to .69 or 69%.

What is the probability that a randomly selected inmate has 2 or fewer prior convictions?

Code

(128+434+160)/810

[1] 0.891358

To get this I added the sample of 0 or 1 or 2 prior convictions and it comes out to .89 or 89%.

What is the probability that a randomly selected inmate has more than 2 prior convictions?

Code

(64+24)/810

[1] 0.108642

To get this I added the sample of 3 or 4 prior convictions and it comes out to .108 or 11%.

What is the expected value1 for the number of prior convictions?

Code

sum(q2$priorconviction*prisoners)

[1] 1042

Code

1042/810

[1] 1.28642

To get this I summed all of the numbers of prior convictions by the amount of prisoners (1042) then divided this by total sample (810) to get a final expected value of 1.28 prior convictions.

Calculate the variance and the standard deviation for the Prior Convictions.

Code

var(q2$priorconviction)

[1] 2.5

Code

var(q2$priorconviction)*(5-1)/5

[1] 2

I used the above code to find a sample variance of 2.5 and a population variance of 2.

Code

sd(q2$priorconviction)

[1] 1.581139

I used the standard deviation function to calculate the above.

:::