hw2
confidence interval
probability
Author

Caleb Hill

Published

October 14, 2022

Question 1

First, let’s load the relevant libraries.

Code
library(readxl)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(dplyr)
df <- read_excel("_data/LungCapData.xls")

For question 1, we need to construct the 90% confident interval to estimate the actual mean wait time for eahc of the two procedures.

Code
s_mean_b <- 19
s_sd_b <- 10
s_size_b <- 539
standard_error_b <- s_sd_b / sqrt(s_size_b)
confidence_level <- 0.90
tail_area <- (1-confidence_level)/2
t_score_b <- qt(p = 1 - tail_area, df = s_size_b - 1)
CI_b <- c(s_mean_b - t_score_b * standard_error_b,
        s_mean_b + t_score_b * standard_error_b)
print(CI_b)
[1] 18.29029 19.70971

This is the CI for bypass. The following code chunk is for angiography.

Code
s_mean_a <- 18
s_sd_a <- 9
s_size_a <- 847
standard_error_a <- s_sd_a / sqrt(s_size_a)
confidence_level <- 0.90
tail_area <- (1-confidence_level)/2
t_score_a <- qt(p = 1 - tail_area, df = s_size_a - 1)
CI_a <- c(s_mean_a - t_score_a * standard_error_a,
        s_mean_a + t_score_a * standard_error_a)
print(CI_a)
[1] 17.49078 18.50922

Is the confidence interval narrower for angiograpy or bypass survey? Answer = angiography.

Question 2

Code
s_mean_NCPP <- sum(567/1031)
s_sd_NCPP <- sd(567)
s_size_NCPP <- 1039
standard_error_NCPP <- s_sd_NCPP / sqrt(s_size_NCPP)
confidence_level <- 0.95
tail_area <- (1-confidence_level)/2
t_score_NCPP <- qt(p = 1 - tail_area, df = s_size_NCPP - 1)
CI_NCPP <- c(s_mean_NCPP - t_score_NCPP * standard_error_NCPP,
        s_mean_NCPP + t_score_NCPP * standard_error_NCPP)
print(CI_NCPP)
[1] NA NA

[I think it’s pulling NAs because I’m not calculating the SD correctly. Will need to come back to this.]

Question 3

Code
s_mean_3 <- 50
s_sd_3 <- 9
s_size_3 <- 15
standard_error_3 <- s_sd_3 / sqrt(s_size_3)
confidence_level <- 0.95
tail_area <- (1-confidence_level)/2
t_score_3 <- qt(p = 1 - tail_area, df = s_size_3 - 1)
CI_3 <- c(s_mean_3 - t_score_3 * standard_error_3,
        s_mean_3 + t_score_3 * standard_error_3)
print(CI_3)
[1] 45.01597 54.98403

After messing with random mean, sd, and size, the minimum sample size needs to be 15, if the CI will have to have a length of $10 or less. 14 or less provides a variances of $10+, so outside where the estimate could be useful.

Question 4

A

Code
s_mean_4a <- 410
s_sd_4a <- 90
s_size_4a <- 9
standard_error_4a <- s_sd_4a / sqrt(s_size_4a)
confidence_level <- 0.95
tail_area <- (1-confidence_level)/2
t_score_4a <- qt(p = 1 - tail_area, df = s_size_4a - 1)
CI_4a <- c(s_mean_4a - t_score_4a * standard_error_4a,
        s_mean_4a + t_score_4a * standard_error_4a)
print(CI_4a)
[1] 340.8199 479.1801

Based upon the data provided, we can be within a 95% CI that mean income for female employees is less than $500 per week. If Ha : μ < 500, then we can accept the hypothesis, based upon the CI. However, for section B, we’ll report the P-value via the t-score.

B

Code
t_score_4a <- qt(p = 1 - tail_area, df = s_size_4a - 1)
p_value=pt(q = t_score_4a, df = 8, lower.tail = FALSE)
print(p_value)
[1] 0.025

With a P-value of 0.025, we can accept the Ha : μ < 500. However, let’s change the lower.tail value to TRUE to see about Ha : μ > 500.

C

Code
t_score_4a <- qt(p = 1 - tail_area, df = s_size_4a - 1)
p_value = pt(q = t_score_4a, df = 8, lower.tail = TRUE)
print(p_value)
[1] 0.975

Just as I thought. We have to reject the second hypothesis, that Ha : μ > 500, as the P-value is 0.975, outside of statistical significance minimum of 0.05.

Question 5

A

For Jones:

Code
s_mean_5a <- 519.5
standard_error_5a <- 10
s_size_5a <- 1000
s_sd_5a <- standard_error_5a * sqrt(s_size_5a)
confidence_level <- 0.95
tail_area <- (1-confidence_level)/2
t_score_5a <- qt(p = 1 - tail_area, df = s_size_5a - 1)
print(t_score_5a)
[1] 1.962341
Code
t_score_5a <- qt(p = 1 - tail_area, df = s_size_5a - 1)
p_value = pt(q = t_score_5a, df = 8, lower.tail = FALSE)
print(p_value)
[1] 0.04267427
Code
CI_5a <- c(s_mean_5a - t_score_5a * standard_error_5a,
        s_mean_5a + t_score_5a * standard_error_5a)
print(CI_5a)
[1] 499.8766 539.1234

For Smith:

Code
s_mean_5a <- 519.7
standard_error_5a <- 10
s_size_5a <- 1000
s_sd_5a <- standard_error_5a * sqrt(s_size_5a)
confidence_level <- 0.95
tail_area <- (1-confidence_level)/2
t_score_5a <- qt(p = 1 - tail_area, df = s_size_5a - 1)
print(t_score_5a)
[1] 1.962341
Code
t_score_5a <- qt(p = 1 - tail_area, df = s_size_5a - 1)
p_value = pt(q = t_score_5a, df = 8, lower.tail = FALSE)
print(p_value)
[1] 0.04267427
Code
CI_5a <- c(s_mean_5a - t_score_5a * standard_error_5a,
        s_mean_5a + t_score_5a * standard_error_5a)
print(CI_5a)
[1] 500.0766 539.3234

B

Code for Section B are the P-values shown for each code chunk. Are they statistically significant? At 0.043 for both, yes as they are below 0.05.

C

The P-value is the likelihood of finding the particular set of observations if the null hypothesis were true. As the P-value is traditionally use in frequentist statistics, we are only able to ascribe probability to this specific set of observations – which are themselves a set amount of observations.

Therefore, it can sometimes be misleading to report a P-value as 0.05. CI levels allow a range within the set of observations. We can see this problem best with the above results via Jones and Smith. They do not get the same sample mean, even with similar observations.

Question 6

Code
gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)
s_mean_g <- mean(gas_taxes)
s_sd_g <- sd(gas_taxes)
s_size_g <- length(gas_taxes)
standard_error_g <- s_sd_g / sqrt(s_size_g)
confidence_level <- 0.95
tail_area <- (1-confidence_level)/2
t_score_g <- qt(p = 1 - tail_area, df = s_size_g - 1)
CI_g <- c(s_mean_g - t_score_g * standard_error_g,
        s_mean_g + t_score_g * standard_error_g)
print(CI_g)
[1] 36.23386 45.49169

There is enough information to conclude that at a 95% confidence interval that the average tax per gallon of gas in the US in 2005 was less than 45 cents. Why? The 95% CI tops out at 45.49, which is 0.49 of a cent higher than our cutoff – 45 cents. Therefore, we must accept the null hypothesis.