hw2
Roy Yoon
dataset
ggplot2
Author

Roy Yoon

Published

October 18, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE)

Question 1

Construct the 90% confidence interval to estimate the actual mean wait time for each of the two procedures. Is the confidence interval narrower for angiography or bypass surgery?

Bypass Sample Size, Mean Wait Time, Standard Deviation

Code
# Bypass information
bypass_sample_size <- 539
bypass_mean <- 19
bypass_sd <- 10

Standard Error for Bypass

Code
bypass_standard_error <- bypass_sd / sqrt(bypass_sample_size)

bypass_standard_error
[1] 0.4307305

T-Value Steps for Bypass

Code
bypass_confidence_level <- 0.90
bypass_tail_area <- (1 - bypass_confidence_level)/2

bypass_tail_area
[1] 0.05
Code
bypass_t_score <- qt(p = 1 - bypass_tail_area, df = bypass_sample_size - 1)

bypass_t_score
[1] 1.647691

Calculate Bypass Confidence Interval

Code
bypass_confidence_interval <- c(bypass_mean - (bypass_t_score * bypass_standard_error),
                                bypass_mean + (bypass_t_score * bypass_standard_error))

bypass_confidence_interval
[1] 18.29029 19.70971

Margin of Error for bypass

Code
bypass_margin_of_error <- bypass_t_score * bypass_standard_error

bypass_margin_of_error
[1] 0.7097107

Angiography Sample Size, Mean Wait Time, Standard Deviation

Code
# Bypass information
angiography_sample_size <- 847
angiography_mean <- 18
angiography_sd <- 9

Standard Error for Angiography

Code
angiography_standard_error <- angiography_sd / sqrt(angiography_sample_size)

angiography_standard_error
[1] 0.3092437

T-Value Steps for Angiography

Code
angiography_confidence_level <- 0.90
angiography_tail_area <- (1 - angiography_confidence_level)/2

angiography_tail_area
[1] 0.05
Code
angiography_t_score <- qt(p = 1 - angiography_tail_area, df = angiography_sample_size - 1)

angiography_t_score
[1] 1.646657

Calculate Angiography Confidence Interval

Code
angiography_confidence_interval <- c(angiography_mean - (angiography_t_score * angiography_standard_error),
                                angiography_mean + (angiography_t_score * angiography_standard_error))

angiography_confidence_interval
[1] 17.49078 18.50922

Margin of Error for Angiography

Code
angiography_margin_of_error <- angiography_t_score * angiography_standard_error

angiography_margin_of_error
[1] 0.5092182

Comparing Bypass and Angiogrpahy Confidence Intervals

Code
angiography_confidence_interval
[1] 17.49078 18.50922
Code
18.50922 - 17.49078
[1] 1.01844
Code
bypass_confidence_interval
[1] 18.29029 19.70971
Code
19.70971 - 18.29029
[1] 1.41942

Angiography has a more narrow confidence interval

Question 2

A survey of 1031 adult Americans was carried out by the National Center for Public Policy. Assume that the sample is representative of adult Americans. Among those surveyed, 567 believed that college education is essential for success. Find the point estimate, p, of the proportion of all adult Americans who believe that a college education is essential for success. Construct and interpret a 95% confidence interval for p.

Point Estimate

Code
sample_size <- 1031

education_essential <- 567

point_estimate <- education_essential / sample_size

point_estimate 
[1] 0.5499515
Code
prop.test(education_essential, sample_size)

    1-sample proportions test with continuity correction

data:  education_essential out of sample_size, null probability 0.5
X-squared = 10.091, df = 1, p-value = 0.00149
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5189682 0.5805580
sample estimates:
        p 
0.5499515 

The point estimate for adult Americans who believe college education is essential is 0.5499515. The 95 percent confidence interval is 0.5189682, 0.5805580. 95% of confidence intervals calculated with this procedure would contain the true mean. With the sampling method repeated, about 95% of the intervals would contain the true mean.

Question 3

Suppose that the financial aid office of UMass Amherst seeks to estimate the mean cost of textbooks per semester for students. The estimate will be useful if it is within $5 of the true population mean (i.e. they want the confidence interval to have a length of $10 or less). The financial aid office is pretty sure that the amount spent on books varies widely, with most values between $30 and $200. They think that the population standard deviation is about a quarter of this range (in other words, you can assume they know the population standard deviation). Assuming the significance level to be 5%, what should be the size of the sample?

Code
standard_dev <- (200 - 30)/4
standard_dev
[1] 42.5

The standard deviation from the data is 42.5. Now knowing the standard deviation we can look for the size of the sample by ways of the confidence interval.

Solving for n, by ways of confidence interval.

  1. 5 = 1.96(42.5/sqrt(n))

  2. 5 = 83.3/sqrt(n)

  3. 5 * sqrt(n) = 83.3

  4. sqrt(n) = 83.3 / 5

  5. n = (83.3/5)^2

  6. n = 278. 89

Question 4

According to a union agreement, the mean income for all senior-level workers in a large service company equals $500 per week. A representative of a women’s group decides to analyze whether the mean income μ for female employees matches this norm. For a random sample of nine female employees, ȳ = $410 and s = 90.

A.

Test whether the mean income of female employees differs from $500 per week. Include assumptions, hypotheses, test statistic, and P-value. Interpret the result.

Assumptions:

  • Data is normally distributed; significance level is 5%

  • Null Hypothesis (H0): μ = 500

  • Alternative Hypothesis(H1): μ < 500, μ > 500

A: Standard Error

Code
female_standard_dev <- 90
female_sample_size <- 9
female_sample_mean <- 410 
null_hypothesis_mean <- 500

female_standard_error <- female_standard_dev / sqrt(female_sample_size)

female_standard_error
[1] 30

A: t-test

Code
t_stat <- (female_sample_mean - null_hypothesis_mean) / female_standard_error

t_stat
[1] -3

A: p-value

Code
p_value <- (pt(t_stat, df = 8)) * 2

p_value
[1] 0.01707168

p-value (0.01707168) is smaller than the 5% significance level, so we are able to reject the null hypothesis and favor the alternative hypothesis.

B. Report the P-value for Ha : μ < 500. Interpret.

Code
low_p_value <- (pt(t_stat, df = 8, lower.tail = TRUE))
low_p_value
[1] 0.008535841

C. Report and interpret the P-value for H a: μ > 500. (Hint: The P-values for the two possible one-sided tests must sum to 1.)

Code
up_p_value <- (pt(t_stat, df = 8, lower.tail = FALSE))
up_p_value
[1] 0.9914642
Code
#sanity check from hint: The P-values for the two possible one-sided tests must sum to 1
low_p_value + up_p_value
[1] 1

Question 5

Jones and Smith separately conduct studies to test H0: μ = 500 against Ha : μ ≠ 500, each with n = 1000. Jones gets ȳ = 519.5, with se = 10.0. Smith gets ȳ = 519.7, with se = 10.0.

Code
jones_sample_mean <- 519.5
jones_standard_error <- 10
smith_sample_mean <- 519.7
smith_standard_error <- 10
h0_mean <- 500

A

Show that t = 1.95 and P-value = 0.051 for Jones. Show that t = 1.97 and P-value = 0.049 for Smith.

Code
#Jones

jones_t_stat <- (jones_sample_mean - h0_mean) / jones_standard_error

jones_t_stat
[1] 1.95
Code
jones_p_value <- (pt(jones_t_stat, df = 999, lower.tail=FALSE)) * 2

jones_p_value
[1] 0.05145555
Code
#Smith

smith_t_stat <- (smith_sample_mean - h0_mean) / smith_standard_error

smith_t_stat
[1] 1.97
Code
smith_p_value <- (pt(smith_t_stat, df = 999, lower.tail=FALSE) ) * 2

smith_p_value
[1] 0.04911426

B

Using α = 0.05, for each study indicate whether the result is “statistically significant.”

if the p value is less than the significance level, the null hypothesis is rejected.

significance level: α = 0.05

Jones p-value: 0.05145555; Jones p-value > significance level, so the study is not statistically significant

smith p-value: 0.04911426; Smith p-value < significance level, so the study is statistically significant

C

Using this example, explain the misleading aspects of reporting the result of a test as “P ≤ 0.05” versus “P > 0.05,” or as “reject H0” versus “Do not reject H0 ,” without reporting the actual P-value.

If the p value is less than the significance level, the null hypothesis is rejected and we can consider the finding/result statistically insignificant. So blanketing Jone’s study as not statistically significant and and Smith’s study as significant may make sense. However, when you look at the sample means from Jones and Smith, there is not an outstanding difference between them. By the way the mathematics of calculating the p-value worked out, it can seem as though there is a great difference between the numbers reported by Jones and Smith, when their reported numbers are actually very narrow to each other.

Question 6

Are the taxes on gasoline very high in the United States? According to the American Petroleum Institute, the per gallon federal tax that was levied on gasoline was 18.4 cents per gallon. However, state and local taxes vary over the same period. The sample data of gasoline taxes for 18 large cities is given below in the variable called gas_taxes.

gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)

Is there enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents? Explain.

Code
gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)

t.test(gas_taxes, mu = 45, alternative = 'less')

    One Sample t-test

data:  gas_taxes
t = -1.8857, df = 17, p-value = 0.03827
alternative hypothesis: true mean is less than 45
95 percent confidence interval:
     -Inf 44.67946
sample estimates:
mean of x 
 40.86278 

With a sample mean of 40.86278 and a p-value of 0.03827 which is less than the assumes 0.05 significance level, the null hypothesis is rejected and the alternative hypothesis is favored.

Thus, there is enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents.