Homework_One

hw1

desriptive statistics

probability

Template of course blog qmd file

Author

Meredith Derian-Toth

Published

February 5, 2023

Question 1

(1a) What does the distribution of LungCap look like?

First, let’s read in the data from the Excel file:

Code

library("quarto")
library("tidyverse")

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.1.8
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

Code

library("palmerpenguins")
library(readxl)
library(dplyr)
library(ggplot2)
df <- read_excel("_data/LungCapData.xls")
#View(df)

The distribution of LungCap looks as follows:

Code

hist(df$LungCap)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

(1b) The probability distribution of the LungCap with respect to gender is as follows:

Code

boxplot(df$LungCap ~ df$Gender)

(1c) The mean lung capacities for smokers and non-smokers can be found in the table below:

Code

df %>%
  group_by(Smoke) %>%
  summarise_at(vars(LungCap), list(name = mean))

# A tibble: 2 × 2
  Smoke  name
  <chr> <dbl>
1 no     7.77
2 yes    8.65

These means are not what I would expect. It looks like those who smoke (“yes”) have a higher long capacity (8.65) than those who do not smoke (7.77).

(1d) The relationship between Smoking and Lung Capacity within age groups

Code

#Age groups defined by:
#“less than or
#equal to 13”, 
#“14 to 15”, 
#“16 to 17”, 
#“greater than or equal to 18”.

# Create variable
df <- df %>% 
  mutate(age_group = case_when(
      Age <= 13 ~ "0-13",
      Age > 13 & Age < 16 ~ "14-15",
      Age > 15 & Age < 18 ~ "16-18",
      Age >= 18 ~ ">= 18"),
    # Convert to factor
    age_group = factor(
      age_group,
      level = c("0-13", "14-15","16-18", ">= 18")))

View(df)

df %>%
  group_by(age_group,Smoke) %>%
  summarise_at(vars(LungCap), list(name = mean))

# A tibble: 8 × 3
# Groups:   age_group [4]
  age_group Smoke  name
  <fct>     <chr> <dbl>
1 0-13      no     6.36
2 0-13      yes    7.20
3 14-15     no     9.14
4 14-15     yes    8.39
5 16-18     no    10.5 
6 16-18     yes    9.38
7 >= 18     no    11.1 
8 >= 18     yes   10.5

Code

dbinom(x=8,size=8,prob=.5)

[1] 0.00390625

Code

dbinom(x=6,size=8,prob=.5)

[1] 0.109375

(1e) Compare the lung capacities for smokers and non-smokers within each age group.

Code

ggplot(df, aes(x=age_group, y=LungCap, color = Smoke)) +
  geom_boxplot()

This data visualization makes more sense for what we expect from lung capacity when comparing smokers to non smokers. It looks like lunch capacity increases as the participants get older. The data could have more participants who are smokers and who are older. This unbalance in participants could be skewing the overall average lunch capacity.

Question 2:Setting up the Dataframe

Code

StatePrison <- data.frame(number_convictions = 0:4, InMateCount = c(128, 434, 160, 64, 24)) %>%
                            mutate(Probability = InMateCount/810)

View(StatePrison)

(2a) What is the probability that a randomly selected inmate has exactly 2 prior convictions?

Code

dbinom(x = 1, size = 1, p = 160/810)

[1] 0.1975309

(2b) What is the probability that a randomly selected inmate has fewer than 2 prior convictions?

Code

dbinom(x = 1, size = 1, p = sum(128+434)/810)

[1] 0.6938272

(2c) What is the probability that a randomly selected inmate has 2 or fewer prior convictions?

Code

dbinom(x = 1, size = 1, p = sum(128+434+160)/810)

[1] 0.891358

(2d) What is the probability that a randomly selected inmate has more than 2 prior convictions?

Code

dbinom(x = 1, size = 1, p = sum(64+24)/810)

[1] 0.108642

(2e) What is the expected value for the number of prior convictions?

Code

EV <- sum(StatePrison$number_convictions *StatePrison$Probability)
print(EV)

[1] 1.28642

(2f) Calculate the variance and the standard deviation for the Prior Convictions.

Code

Var <- sum((StatePrison$number_convictions - EV) ^ 2 * StatePrison$Probability)

print(Var)

[1] 0.8562353

Code

SD <- sqrt(Var)

print(SD)

[1] 0.9253298