hw1
desriptive statistics
probability
DACSS 603 HW#1
Author

Alexis Gamez

Published

February 20, 2023

Code
library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE)

Question 1

a)

First, let’s read in the data from the Excel file:

Code
getwd()
[1] "C:/Users/Leshiii/Desktop/DACSS Master's/DACSS 603/603_Spring_2023/posts"
Code
df <- read_excel("_data/LungCapData.xls")

The distribution of LungCap looks as follows:

Code
hist(df$LungCap, main = "Lung Capacity Distribution", xlab = "Lung Capacity", ylab = "Probability Density", prob = TRUE)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

b)

Provided below is a box plot of the probability distributions of the Lung Capacity data for the male and female genders.

Code
boxplot(df$LungCap ~ df$Gender, 
        ylab = "Gender",
        xlab = "Lung Capacity",
        horizontal = TRUE,
        col = "maroon")

c)

I don’t believe the data provided below make much sense. I would argue that it is much more likely for smokers to have a smaller lung capacity than those that do not smoke. I suspect that something could be going on with our sample.

Code
boxplot(df$LungCap ~ df$Smoke, 
        ylab = "Smoking Preference",
        xlab = "Lung Capacity",
        horizontal = TRUE,
        col = "bisque")

Code
no_smoke <- subset(df, Smoke == "no")
yes_smoke <- subset(df, Smoke == "yes")
mean(no_smoke$LungCap)
[1] 7.770188
Code
mean(yes_smoke$LungCap)
[1] 8.645455

d)

Comparing the charts below, we can see that as the participant ages, the mean and ranges of Lung Capacity increases. This could be due to natural maturation, with Lungs growing as children grow to adolescents, stalling circa 18 years old. However, I suspect there might be more to it, particularly our sample of smokers being so small compared to that of non-smokers.

Code
no_smoke_mean <- mean(no_smoke$LungCap)
`13_under` <- subset(yes_smoke, Age <= 13)

`14_to_15` <- subset(yes_smoke, subset = Age == 14 | Age == 15)

`16_to_17` <- subset(yes_smoke, subset = Age == 16 | Age == 17)

`18_over` <- subset(yes_smoke, Age >= 18)

hist(`13_under`$LungCap, main = "Smoker Lung Capacity Distribution by Age", xlab = "Lung Capacity", ylab = "Probability Density", prob = TRUE)
legend("topright", legend=c("13 & Under","14 to 15", "16 to 17", "18 & Over"), col=c("gray", rgb(0,0,1,0.5), 
     rgb(0,1,0,0.5), rgb(1,0,0,0.5)), pt.cex=2, pch=15)

Code
hist(`14_to_15`$LungCap, prob = TRUE, col = rgb(0,0,1,0.5), main = "Smoker Lung Capacity Distribution by Age", xlab = "Lung Capacity", ylab = "Probability Density")
legend("topright", legend=c("13 & Under","14 to 15", "16 to 17", "18 & Over"), col=c("gray", rgb(0,0,1,0.5), 
     rgb(0,1,0,0.5), rgb(1,0,0,0.5)), pt.cex=2, pch=15)

Code
hist(`16_to_17`$LungCap, prob = TRUE, col = rgb(0,1,0,0.5), main = "Smoker Lung Capacity Distribution by Age", xlab = "Lung Capacity", ylab = "Probability Density")
legend("topright", legend=c("13 & Under","14 to 15", "16 to 17", "18 & Over"), col=c("gray", rgb(0,0,1,0.5), 
     rgb(0,1,0,0.5), rgb(1,0,0,0.5)), pt.cex=2, pch=15)

Code
hist(`18_over`$LungCap, prob = TRUE, col = rgb(1,0,0,0.5), main = "Smoker Lung Capacity Distribution by Age", xlab = "Lung Capacity", ylab = "Probability Density")
legend("topright", legend=c("13 & Under","14 to 15", "16 to 17", "18 & Over"), col=c("gray", rgb(0,0,1,0.5), 
     rgb(0,1,0,0.5), rgb(1,0,0,0.5)), pt.cex=2, pch=15)

e)

Code
df_Agegroup <- df %>% 
  mutate(
Age_group = dplyr::case_when(
      Age <= 13            ~ "0-13",
      Age > 13 & Age <= 15 ~ "14-15",
      Age > 15 & Age <= 17 ~ "16-17",
      Age >= 18             ~ "18+"))

ggplot(data = df_Agegroup, aes(x=Age_group, y=LungCap)) + 
  geom_boxplot(aes(fill=Smoke)) +
  labs(x="Age Group",y="Lung Capacity",title="Lung Capacity of Smokers vs Non Smokers by Age Group")

Looking at the box plot above, its seems as though the ranges of data for the latter 3 age group divisions are naturally larger for non-smokers than for smokers. Additionally, the means for said divisions under non-smokers are also higher! What seems to be skewing the data is the large quantity of participants equal to 13 years of age or younger. I believe this is why our data hasn’t made much sense until now. There might be other factors in play here, like respondent bias, but I believe the sample size here is the main influence.

Question 2

Let X = Number of prior convictions

Sample = 810 prisoners

a)

Reading in our data.

Code
Convictions <- seq(0,4)
Freq <- c(128, 434, 160, 64, 24)/810

The probability of that a randomly selected inmate has exactly 2 prior convictions is:

Code
dbinom(x=1, size = 1, prob = 160/810)
[1] 0.1975309

b)

The probability that a randomly selected inmate has fewer than 2 prior convictions is:

Code
dbinom(x = 1, size = 1, prob = (128+434)/810)
[1] 0.6938272

c)

The probability that a randomly selected inmate has 2 or fewer prior convictions is:

Code
dbinom(x = 1, size = 1, prob = (128+434+160)/810)
[1] 0.891358

d)

The probability that a randomly selected inmate has more than 2 prior convictions is:

Code
dbinom(x = 1, size = 1, prob = (64+24)/810)
[1] 0.108642

e)

The expected value for the number of prior convictions is:

Code
Expected_v <- sum(Freq*Convictions)
Expected_v
[1] 1.28642

f)

The variance for prior convictions is:

Code
Variance <- sum((Convictions-Expected_v)^2*Freq)
Variance
[1] 0.8562353

The standard deviation for prior convictions is:

Code
SD <- sqrt(Variance)
SD
[1] 0.9253298

Plotting our data further validates our calculations as all values we’ve presented seem to coincide with their respective points on the plot.

Code
Conv_data <- tibble(
  x= Convictions,
  y= Freq)

ggplot(Conv_data, aes(x,y))+
  geom_line()+
  geom_vline(xintercept = 2, col="red",size=1)+
  geom_text(x=2.15,y=.245,label="c = 2")+
  labs(x="# of Prior Convictions",y="Probability",title="Probability Distribution of Prisoner # of Prior Convictions")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.