hw1
desriptive statistics
probability
Template of course blog qmd file
Author

Meredith Derian-Toth

Published

February 5, 2023

Question 1

(1a) What does the distribution of LungCap look like?

First, let’s read in the data from the Excel file:

Code
library("quarto")
library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.1.8
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
Code
library("palmerpenguins")
library(readxl)
library(dplyr)
library(ggplot2)
df <- read_excel("_data/LungCapData.xls")
#View(df)

The distribution of LungCap looks as follows:

Code
hist(df$LungCap)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

(1b) The probability distribution of the LungCap with respect to gender is as follows:

Code
boxplot(df$LungCap ~ df$Gender)

(1c) The mean lung capacities for smokers and non-smokers can be found in the table below:

Code
df %>%
  group_by(Smoke) %>%
  summarise_at(vars(LungCap), list(name = mean))
# A tibble: 2 × 2
  Smoke  name
  <chr> <dbl>
1 no     7.77
2 yes    8.65

These means are not what I would expect. It looks like those who smoke (“yes”) have a higher long capacity (8.65) than those who do not smoke (7.77).

(1d) The relationship between Smoking and Lung Capacity within age groups

Code
#Age groups defined by:
#“less than or
#equal to 13”, 
#“14 to 15”, 
#“16 to 17”, 
#“greater than or equal to 18”.

# Create variable
df <- df %>% 
  mutate(age_group = case_when(
      Age <= 13 ~ "0-13",
      Age > 13 & Age < 16 ~ "14-15",
      Age > 15 & Age < 18 ~ "16-18",
      Age >= 18 ~ ">= 18"),
    # Convert to factor
    age_group = factor(
      age_group,
      level = c("0-13", "14-15","16-18", ">= 18")))

View(df)

df %>%
  group_by(age_group,Smoke) %>%
  summarise_at(vars(LungCap), list(name = mean))
# A tibble: 8 × 3
# Groups:   age_group [4]
  age_group Smoke  name
  <fct>     <chr> <dbl>
1 0-13      no     6.36
2 0-13      yes    7.20
3 14-15     no     9.14
4 14-15     yes    8.39
5 16-18     no    10.5 
6 16-18     yes    9.38
7 >= 18     no    11.1 
8 >= 18     yes   10.5 
Code
dbinom(x=8,size=8,prob=.5)
[1] 0.00390625
Code
dbinom(x=6,size=8,prob=.5)
[1] 0.109375

(1e) Compare the lung capacities for smokers and non-smokers within each age group.

Code
ggplot(df, aes(x=age_group, y=LungCap, color = Smoke)) +
  geom_boxplot()

This data visualization makes more sense for what we expect from lung capacity when comparing smokers to non smokers. It looks like lunch capacity increases as the participants get older. The data could have more participants who are smokers and who are older. This unbalance in participants could be skewing the overall average lunch capacity.

Question 2:Setting up the Dataframe

Code
StatePrison <- data.frame(number_convictions = 0:4, InMateCount = c(128, 434, 160, 64, 24)) %>%
                            mutate(Probability = InMateCount/810)

View(StatePrison)

(2a) What is the probability that a randomly selected inmate has exactly 2 prior convictions?

Code
dbinom(x = 1, size = 1, p = 160/810)
[1] 0.1975309

(2b) What is the probability that a randomly selected inmate has fewer than 2 prior convictions?

Code
dbinom(x = 1, size = 1, p = sum(128+434)/810)
[1] 0.6938272

(2c) What is the probability that a randomly selected inmate has 2 or fewer prior convictions?

Code
dbinom(x = 1, size = 1, p = sum(128+434+160)/810)
[1] 0.891358

(2d) What is the probability that a randomly selected inmate has more than 2 prior convictions?

Code
dbinom(x = 1, size = 1, p = sum(64+24)/810)
[1] 0.108642

(2e) What is the expected value for the number of prior convictions?

Code
EV <- sum(StatePrison$number_convictions *StatePrison$Probability)
print(EV)
[1] 1.28642

(2f) Calculate the variance and the standard deviation for the Prior Convictions.

Code
Var <- sum((StatePrison$number_convictions - EV) ^ 2 * StatePrison$Probability)

print(Var)
[1] 0.8562353
Code
SD <- sqrt(Var)

print(SD)
[1] 0.9253298