hw1
homework1
abigailbalint
lungcap
prisoner
ggplot2
Author

Abigail Balint

Published

February 23, 2023

Code
library(tidyverse)
library(ggplot2)
library(dplyr)
library(readxl)
knitr::opts_chunk$set(echo = TRUE)

Question 1 - Lung Capacity

Reading in LungCapData –

Code
lung <- read_excel("_data/LungCapData.xls")
head(lung,2)
# A tibble: 2 × 6
  LungCap   Age Height Smoke Gender Caesarean
    <dbl> <dbl>  <dbl> <chr> <chr>  <chr>    
1    6.48     6   62.1 no    male   no       
2   10.1     18   74.7 yes   female no       

Looking at some basic descriptive stats –

Code
glimpse(lung)
Rows: 725
Columns: 6
$ LungCap   <dbl> 6.475, 10.125, 9.550, 11.125, 4.800, 6.225, 4.950, 7.325, 8.…
$ Age       <dbl> 6, 18, 16, 14, 5, 11, 8, 11, 15, 11, 19, 17, 12, 10, 10, 13,…
$ Height    <dbl> 62.1, 74.7, 69.7, 71.0, 56.9, 58.7, 63.3, 70.4, 70.5, 59.2, …
$ Smoke     <chr> "no", "yes", "no", "no", "no", "no", "no", "no", "no", "no",…
$ Gender    <chr> "male", "female", "female", "male", "male", "female", "male"…
$ Caesarean <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
Code
mean(lung$LungCap, na.rm = T)
[1] 7.863148
Code
var(lung$LungCap, na.rm = T)
[1] 7.086288
Code
sd(lung$LungCap, na.rm = T)
[1] 2.662008
Code
range(lung$LungCap, na.rm = T)
[1]  0.507 14.675
  1. What does the distribution of LungCap look like?
Code
ggplot(lung, aes(x = LungCap)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution looks relatively normal. There is a clear concentration of the sample around 7-8 and the outliers are only a very small portion of the sample.

  1. Compare the probability distribution of the LungCap with respect to Males and Females
Code
ggplot(lung, aes(x=LungCap)) + 
    geom_boxplot(fill="slateblue", alpha=0.2) + 
    xlab("Lung Capacity") +
  facet_wrap("Gender")

The probability distribution is pretty similar between male and female, but males skew to a higher lung capacity overall and the median line is at around 8 whereas female is closer to 7.5.

  1. Compare the mean lung capacities for smokers and non-smokers. Does it make sense?
Code
lung %>%
  group_by(Smoke) %>%
  summarise(mean = mean(LungCap), n = n())
# A tibble: 2 × 3
  Smoke  mean     n
  <chr> <dbl> <int>
1 no     7.77   648
2 yes    8.65    77

We would expect the lung capacities for non smokers to be higher but the mean for smokers is actually a little bit higher.

  1. Examine the relationship between Smoking and Lung Capacity within age groups: “less than or equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”.

Recoding the age groups –

Code
lunggroup <- lung %>%
  mutate(`AgeGroup` = dplyr::case_when(
    `Age` >= 0 & `Age` < 14 ~ "0-13",
    `Age` >= 14 & `Age` < 16 ~ "14-15",
    `Age` >= 16 & `Age` < 18 ~ "16-17",
    `Age` >= 18 ~ "18+" ))
head(lunggroup)
# A tibble: 6 × 7
  LungCap   Age Height Smoke Gender Caesarean AgeGroup
    <dbl> <dbl>  <dbl> <chr> <chr>  <chr>     <chr>   
1    6.48     6   62.1 no    male   no        0-13    
2   10.1     18   74.7 yes   female no        18+     
3    9.55    16   69.7 no    female yes       16-17   
4   11.1     14   71   no    male   no        14-15   
5    4.8      5   56.9 no    male   no        0-13    
6    6.22    11   58.7 no    female no        0-13    

Mean lung capacity by age group –

Code
lunggroup %>%
group_by(Smoke, AgeGroup) %>%
  summarise(mean = mean(LungCap), n = n())
`summarise()` has grouped output by 'Smoke'. You can override using the
`.groups` argument.
# A tibble: 8 × 4
# Groups:   Smoke [2]
  Smoke AgeGroup  mean     n
  <chr> <chr>    <dbl> <int>
1 no    0-13      6.36   401
2 no    14-15     9.14   105
3 no    16-17    10.5     77
4 no    18+      11.1     65
5 yes   0-13      7.20    27
6 yes   14-15     8.39    15
7 yes   16-17     9.38    20
8 yes   18+      10.5     15

For both smokers and non-smokers, the lung capacity goes up as the age increases with 18+ having the highest average capacity. In all age ranges besides 0-13 (the broadest range), the mean is higher for non-smokers than smokers.

  1. Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c. What could possibly be going on here?
Code
ggplot(lunggroup, aes(x=LungCap, fill=Smoke)) + 
    geom_boxplot() + 
    xlab("Lung Capacity") +
  facet_wrap("AgeGroup")

I’m seeing that the results by age group are slightly different than in part C. Above I can see that the average for all age ranges is higher for non-smokers, besides age group 0-13. I can see in my results in part D that the sample size for 0-13 non-smokers is extremely high, much higher than any other group of smokers or non-smokers, so with this higher sample size comes more variance. The median lines are actually pretty close but the outliers are probably affecting the mean.

Question 2

Creating a data frame –

Code
priorconviction <- c(0,1,2,3,4)
prisoners <- c(128,434,160,64,24)
q2 <- data.frame(priorconviction, prisoners)
head(q2)
  priorconviction prisoners
1               0       128
2               1       434
3               2       160
4               3        64
5               4        24
  1. What is the probability that a randomly selected inmate has exactly 2 prior convictions?
Code
160/810
[1] 0.1975309

I found it to be .1975 or 19.75%

  1. What is the probability that a randomly selected inmate has fewer than 2 prior convictions?
Code
(434+128)/810
[1] 0.6938272

To get this I added the sample of 0 or 1 prior conviction and it comes out to .69 or 69%.

  1. What is the probability that a randomly selected inmate has 2 or fewer prior convictions?
Code
(128+434+160)/810
[1] 0.891358

To get this I added the sample of 0 or 1 or 2 prior convictions and it comes out to .89 or 89%.

  1. What is the probability that a randomly selected inmate has more than 2 prior convictions?
Code
(64+24)/810
[1] 0.108642

To get this I added the sample of 3 or 4 prior convictions and it comes out to .108 or 11%.

  1. What is the expected value1 for the number of prior convictions?
Code
sum(q2$priorconviction*prisoners)
[1] 1042
Code
1042/810
[1] 1.28642

To get this I summed all of the numbers of prior convictions by the amount of prisoners (1042) then divided this by total sample (810) to get a final expected value of 1.28 prior convictions.

  1. Calculate the variance and the standard deviation for the Prior Convictions.
Code
var(q2$priorconviction)
[1] 2.5
Code
var(q2$priorconviction)*(5-1)/5
[1] 2

I used the above code to find a sample variance of 2.5 and a population variance of 2.

Code
sd(q2$priorconviction)
[1] 1.581139

I used the standard deviation function to calculate the above.

:::