Homework 1

hw1

challenge1

Steph Roberts

dataset

ggplot2

Author

Steph Roberts

Published

October 3, 2022

Code

library(tidyverse)
library(dplyr)
library(readxl)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE)

Homework 1

##1. Use the LungCapData to answer the following questions. (Hint: Using dplyr, especiallygroup_by() and summarize() can help you answer the following questions relatively efficiently.)

Code

df<- read_excel("_data/LungCapData.xls")
head(df)

# A tibble: 6 × 6
  LungCap   Age Height Smoke Gender Caesarean
    <dbl> <dbl>  <dbl> <chr> <chr>  <chr>    
1    6.48     6   62.1 no    male   no       
2   10.1     18   74.7 yes   female no       
3    9.55    16   69.7 no    female yes      
4   11.1     14   71   no    male   no       
5    4.8      5   56.9 no    male   no       
6    6.22    11   58.7 no    female no

#Summarize

Code

summary(df)

    LungCap            Age            Height         Smoke          
 Min.   : 0.507   Min.   : 3.00   Min.   :45.30   Length:725        
 1st Qu.: 6.150   1st Qu.: 9.00   1st Qu.:59.90   Class :character  
 Median : 8.000   Median :13.00   Median :65.40   Mode  :character  
 Mean   : 7.863   Mean   :12.33   Mean   :64.84                     
 3rd Qu.: 9.800   3rd Qu.:15.00   3rd Qu.:70.30                     
 Max.   :14.675   Max.   :19.00   Max.   :81.80                     
    Gender           Caesarean        
 Length:725         Length:725        
 Class :character   Class :character  
 Mode  :character   Mode  :character

Code

mean(df$LungCap)

[1] 7.863148

Code

median(df$LungCap)

[1] 8

Code

var(df$LungCap)

[1] 7.086288

Code

sd(df$LungCap)

[1] 2.662008

Code

min(df$LungCap)

[1] 0.507

Code

max(df$LungCap)

[1] 14.675

#a. What does the distribution of LungCap look like? (Hint: Plot a histogram with probability density on the y axis)

Code

ggplot(df, aes(x=LungCap)) + 
  geom_histogram(binwidth=0.5,col='black',fill='gray')

The histogram follows a distribution close to normal distibution. In fact, if we change binwidth slightly, it appears even closer to normal distribution.

Code

ggplot(df, aes(x=LungCap)) + 
  geom_histogram(binwidth=1,col='black',fill='gray')

This helps illustrate the importance of binwidth and what it can do to our visualization interpretations.

#b. Compare the probability distribution of the LungCap with respect to Males and Females? (Hint: make boxplots separated by gender using the boxplot() function)

Code

ggplot(df, aes(x = LungCap, y = Gender)) +        
  geom_boxplot()

The distribution of male lung capacity is larger and longer than females’.

#c. Compare the mean lung capacities for smokers and non-smokers. Does it make sense?

Code

df %>%
  filter(Smoke == 'yes') %>%
  pull(LungCap) %>%
  mean()

[1] 8.645455

Code

df %>%
  filter(Smoke == 'no') %>%
  pull(LungCap) %>%
  mean()

[1] 7.770188

It does not make sense at face value. In this sample, smokers have a higher mean lung capacity than non-smokers. Let’s check how big each subsample is.

Code

length(which(df$Smoke == 'yes'))

[1] 77

Code

length(which(df$Smoke == 'no'))

[1] 648

As suspected, there are far more, almost 10 times as many, non-smokers. If we could gather data from all the smokers, perhaps our means would look a lot different. Maybe our sample was taken from young people whose lungs have not been long affected by the smoking.

Code

df %>%
  filter(Smoke == 'yes') %>%
  pull(Age) %>%
  median()

[1] 15

Again, as suspected, our sample of smokers is a young age. Therefore, the lack of difference in lung capacity between smokers and non-smokers is not too surprising.

#d. Examine the relationship between Smoking and Lung Capacity within age groups: “less than or equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”.

Code

#Create age groups
df <- df %>% 
  mutate(agegroup = case_when(
    Age <= 13  ~ "less than or equal to 13",
    Age >= 14 & Age <= 15 ~ "14 to 15",
    Age >= 16 & Age <= 17 ~ "16 TO 17",
    Age >= 18 ~ "greater than or equal to 18"))

table(df$agegroup)


                   14 to 15                    16 TO 17 
                        120                          97 
greater than or equal to 18    less than or equal to 13 
                         80                         428

Code

df %>%
  filter(Smoke == 'yes') %>%
  ggplot(aes(x=LungCap)) + 
  geom_histogram(binwidth=1,col='black',fill='gray')+
  facet_wrap(~agegroup)

These histograms suggest that participants 13 or younger have smaller lung capacity. The Lung capacity seems to generally increase with age as children grow.

#e. Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c. What could possibly be going on here?

Code

ggplot(df, aes(x = LungCap, 
           fill = agegroup)) +
  geom_density(alpha = 0.4)+
  facet_wrap(~Smoke)

This visualization starts to explain furthermore why there is an unexpected result for lung capacity in smokers vs. non-smokers. As we have deducted, lung capacity generally improves with age (in growing years). However, teenagers approaching adulthood are also a group more likely to have access or influence to smoking cigarettes. It is likely that our smokers account for some of the older participants, who happen to be closer to normal smoking age.

#f. Calculate the correlation and covariance between Lung Capacity and Age. (use the cov() and cor() functions in R). Interpret your results.

Code

cov(df$LungCap, df$Age) #calculate covariance

[1] 8.738289

Code

cor(df$LungCap, df$Age) #calculate correlation

[1] 0.8196749

A positive coraviance (8.74) indicates lung capacity and age tend to increase together. The positive correlation relatively close to 1 (0.82) indicates there is a fairly strong correlation between the variables.

##2. Let X = number of prior convictions for prisoners at a state prison at which there are 810 prisoners.

Code

#create the sample
x<-rep(c(0,1,2,3,4),times=c(128, 434, 160, 64, 24))
sample(x, 10)

 [1] 2 1 1 0 1 1 2 1 0 1

Code

#Verify n of sample
sum(128, 434, 160, 64, 24)

[1] 810

Code

#Calculate the mean
mean(x)

[1] 1.28642

Code

#Verify the mean
sample_mean <- (((128*0)+(434*1)+(160*2)+(64*3)+(24*4))/810)
print(sample_mean)

[1] 1.28642

Code

#Calculate the sd
sd(x)

[1] 0.9259016

#a. What is the probability that a randomly selected inmate has exactly 2 prior convictions?

Code

#probability of 2 convictions?
dnorm.convict <- dnorm(2, mean(x), sd(x))
print(dnorm.convict)

[1] 0.3201613

The probability of 2 convications in 0.32.

#b. What is the probability that a randomly selected inmate has fewer than 2 prior convictions?

Code

#probability of <2 convictions
less.than <- pnorm(2, mean(x), sd(x)) - dnorm.convict
print(less.than)

[1] 0.4593924

The probability of <2 convictions is 0.46.

#c. What is the probability that a randomly selected inmate has 2 or fewer prior convictions?

Code

#probability of =<2 convictions?
pnorm.convict <- pnorm(2, mean(x), sd(x))
print(pnorm.convict)

[1] 0.7795537

The probability of less than or equal to 2 convictions is 0.78.

#d. What is the probability that a randomly selected inmate has more than 2 prior convictions?

Code

#probability of >2 convictions?
greater.than <- 1 - pnorm.convict
print(greater.than)

[1] 0.2204463

The probability of greater than 2 convictions is 0.22.

Code

#Verify all probabilities add to 1
less.than + dnorm.convict + greater.than

[1] 1

#e. What is the expected value for the number of prior convictions?

Code

# Expected value of a probability distribution  can be found with μ = Σx * P(x), where x = data value and P(x) = probability of data. 

#Calculate probabilities of data
p0 <- dnorm(0, mean(x), sd(x))
p0

[1] 0.1641252

Code

p1 <- dnorm(1, mean(x), sd(x))
p1

[1] 0.410739

Code

p2 <- dnorm(2, mean(x), sd(x))
p2

[1] 0.3201613

Code

p3 <- dnorm(3, mean(x), sd(x))
p3

[1] 0.07772916

Code

p4 <- dnorm(4, mean(x), sd(x))
p4

[1] 0.005877753

Code

#Calculate expected value
ev <- sum((0*p0), (1*p1), (2*p2), (3*p3), (4*p4))
ev

[1] 1.30776

Code

#The expected value should be close to the mean in a normal distribution
mean(x)

[1] 1.28642

The expected value is 1.31.

#f. Calculate the variance and the standard deviation for the Prior Convictions.

Code

#Calculate variance
var(x)

[1] 0.8572937

Code

#Calculate the sd
sd(x)

[1] 0.9259016