hw1
desriptive statistics
probability
Template of course blog qmd file
Author

Caitlin Rowley

Published

February 26, 2023

Question 1a: What does the distribution of LungCap look like?

First, let’s read in the data from the Excel file:

Code
library(readxl)
Warning: package 'readxl' was built under R version 4.2.2
Code
LungCapData <- read_excel("_data/LungCapData.xls")

The distribution of LungCap looks as follows:

Code
hist(LungCapData$LungCap)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

Question 1b: Compare the probability distribution of the LungCap with respect to Males and Females. (Hint:make boxplots separated by gender using the boxplot() function)

Code
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
library(magrittr)
library(ggplot2)
library(markdown)
library(ggtext)
Warning: package 'ggtext' was built under R version 4.2.2
Code
# generate boxplot:

LungCap_Gender <- LungCapData%>%
  group_by(LungCapData$Gender)%>%
  ggplot(aes(x=Gender, y=LungCap)) +
  geom_point(alpha=.08, size=5) +
  labs(x="Gender", y="LungCap", title="LungCap by Gender") +
  theme_light() +
  geom_boxplot() +
  theme(axis.text.x=element_markdown(hjust=1))
LungCap_Gender

Code
# would next like to identify specific values for boxplots...

The distribution of LungCap is higher for males than it is for females. For males, the Q1 value is approximately 6.5, the median value is approximately 8, and the Q3 value is approximately 10. For females, the Q1 value is approximately 5.5, the median value is approximately 7.5, and the Q3 value is approximately 9.

Question 1c: Compare the mean lung capacities for smokers and non-smokers. Does it make sense?

Code
Smoker_Mean <- LungCapData %>%
  filter(Smoke=="yes")%>%
  select(Smoke, LungCap)%>%
  summarize(mean(LungCap, na.rm = TRUE))
Smoker_Mean
# A tibble: 1 × 1
  `mean(LungCap, na.rm = TRUE)`
                          <dbl>
1                          8.65
Code
NonSmoker_Mean <- LungCapData %>%
  filter(Smoke=="no")%>%
  select(Smoke, LungCap)%>%
  summarize(mean(LungCap, na.rm = TRUE))
NonSmoker_Mean
# A tibble: 1 × 1
  `mean(LungCap, na.rm = TRUE)`
                          <dbl>
1                          7.77

The mean LungCap for smokers is 8.645, while the mean LungCap for non-smokers is 7.77. This doesn’t seem to make sense!

Question 1d: Examine the relationship between Smoking and Lung Capacity within age groups: “less than or equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”.

Code
Smoker_Age_1 <- LungCapData %>%
  filter(Smoke=="yes")%>%
  filter(Age<=13)%>%
  select(Smoke, Age, LungCap)%>%
  summarize(mean(LungCap, na.rm = TRUE))
Smoker_Age_1
# A tibble: 1 × 1
  `mean(LungCap, na.rm = TRUE)`
                          <dbl>
1                          7.20
Code
Smoker_Age_2 <- LungCapData %>%
  filter(Smoke=="yes")%>%
  filter(Age==14:15)%>%
  select(Smoke, Age, LungCap)%>%
  summarize(mean(LungCap, na.rm = TRUE))
Warning in Age == 14:15: longer object length is not a multiple of shorter
object length
Code
Smoker_Age_2
# A tibble: 1 × 1
  `mean(LungCap, na.rm = TRUE)`
                          <dbl>
1                          8.91
Code
Smoker_Age_3 <- LungCapData %>%
  filter(Smoke=="yes")%>%
  filter(Age==16:17)%>%
  select(Smoke, Age, LungCap)%>%
  summarize(mean(LungCap, na.rm = TRUE))
Warning in Age == 16:17: longer object length is not a multiple of shorter
object length
Code
Smoker_Age_3
# A tibble: 1 × 1
  `mean(LungCap, na.rm = TRUE)`
                          <dbl>
1                          9.60
Code
Smoker_Age_4 <- LungCapData %>%
  filter(Smoke=="yes")%>%
  filter(Age>=18)%>%
  select(Smoke, Age, LungCap)%>%
  summarize(mean(LungCap, na.rm = TRUE))
Smoker_Age_4
# A tibble: 1 × 1
  `mean(LungCap, na.rm = TRUE)`
                          <dbl>
1                          10.5

The mean LungCap value for smokers 13 years or younger is 7.202, whereas it is 8.909 for those ages 14-15. For smokers 16-17, the mean LungCap value is 9.602, and for those 18 years and older, it is 10.513.

Question 1e: Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c? What could possibly be going on here?

Code
NonSmoker_Age_1 <- LungCapData %>%
  filter(Smoke=="no")%>%
  filter(Age<=13)%>%
  select(Smoke, Age, LungCap)%>%
  summarize(mean(LungCap, na.rm = TRUE))
NonSmoker_Age_1
# A tibble: 1 × 1
  `mean(LungCap, na.rm = TRUE)`
                          <dbl>
1                          6.36
Code
NonSmoker_Age_2 <- LungCapData %>%
  filter(Smoke=="no")%>%
  filter(Age==14:15)%>%
  select(Smoke, Age, LungCap)%>%
  summarize(mean(LungCap, na.rm = TRUE))
NonSmoker_Age_2
# A tibble: 1 × 1
  `mean(LungCap, na.rm = TRUE)`
                          <dbl>
1                          8.84
Code
NonSmoker_Age_3 <- LungCapData %>%
  filter(Smoke=="no")%>%
  filter(Age==16:17)%>%
  select(Smoke, Age, LungCap)%>%
  summarize(mean(LungCap, na.rm = TRUE))
NonSmoker_Age_3
# A tibble: 1 × 1
  `mean(LungCap, na.rm = TRUE)`
                          <dbl>
1                          10.4
Code
NonSmoker_Age_4 <- LungCapData %>%
  filter(Smoke=="no")%>%
  filter(Age>=18)%>%
  select(Smoke, Age, LungCap)%>%
  summarize(mean(LungCap, na.rm = TRUE))
NonSmoker_Age_4
# A tibble: 1 × 1
  `mean(LungCap, na.rm = TRUE)`
                          <dbl>
1                          11.1

The corresponding LungCap mean values for non-smokers are 6.359, 8.844, 10.394, and 11.069. All LungCaps are higher for non-smokers with the exception of those in the 13-and-under age bracket. It would make sense that non-smokers would have higher lung capacities, though it is interesting that the value is lower for the youngest age group. This does, however, correlate with the results to question 1c, which indicated that smokers have higher lung capacities than non-smokers; there could, perhaps, be other extenuating circumstances contributing to these lower rates of capacity, such as health issues, development, or exposure.

Question 2a: What is the probability that a randomly selected inmate has exactly 2 prior convictions?

Code
inmate <- c(128, 434, 160, 64, 24)
mean(inmate)
[1] 162
Code
sd(inmate)
[1] 161.0838
Code
# mean=162
# sd=161.1

convictions <- c(0:4)

data <- tibble(inmate, convictions)

two <- 160/summarise(data, sum(inmate))
  two
  sum(inmate)
1   0.1975309

Here, we can see that the probability of an inmate having exactly two prior convictions is 19.8%.

Question 2b: What is the probability that a randomly selected inmate has fewer than 2 prior convictions?

Code
fewer_than_two <- 128 + 434
fewer_than_two/summarise(data, sum(inmate))
  sum(inmate)
1   0.6938272

Here, we can see that the probability of a randomly selected inmate having fewer than two prior convictions is 69.4%.

Question 2c: What is the probability that a randomly selected inmate has 2 or fewer prior convictions?

Code
two_or_fewer <- 128 + 434 + 160
two_or_fewer/summarise(data, sum(inmate))
  sum(inmate)
1    0.891358

Here, we can see that the probability of a randomly selected inmate having two or fewer prior convictions is 89.1%.

Question 2d: What is the probability that a randomly selected inmate has more than 2 prior convictions?

Code
more_than_two <- 160 + 64 + 24
more_than_two/summarise(data, sum(inmate))
  sum(inmate)
1   0.3061728

Here, we can see that the probability of a randomly selected inmate having more than two prior convictions is 30.6%.

Question 2e: What is the expected value1 for the number of prior convictions?

Code
data_prob <- transform(data, probability = (inmate/810))
data_ev <- transform(data_prob, x = convictions*probability)

EV <- sum(data_ev$x)
EV
[1] 1.28642

The expected value1 for the number of prior convictions is 1.3.

Question 2f: Calculate the variance and the standard deviation for the Prior Convictions.

Code
inmate <- c(128, 434, 160, 64, 24)
var(inmate)
[1] 25948
Code
sd(inmate)
[1] 161.0838

The variance for the prior convictions is 25948, while the standard deviation is 161.