Homework 2

hw2
hypothesis testing
confidence intervals
The second homework on hypothesis testing and confidence intervals
Author

Rosemary Pang

Published

March 29, 2023

Please check your answers against the solutions.

Question 1

bypass_n = 539
angio_n = 847

bypass_sample_mean = 19
angio_sample_mean = 18

bypass_sample_sd = 10
angio_sample_sd = 9

bypass_se = bypass_sample_sd/sqrt(bypass_n)
angio_se = angio_sample_sd/sqrt(angio_n)

bypass_me = qt(0.95, df = bypass_n - 1)*bypass_se
angio_me = qt(0.95, df = angio_n - 1)*angio_se

The confidence intervals:

print(bypass_sample_mean + c(-bypass_me, bypass_me))
[1] 18.29029 19.70971
print(angio_sample_mean + c(-angio_me, angio_me))
[1] 17.49078 18.50922

The size of the confidence intervals, which is twice the margin of error:

2 * bypass_me
[1] 1.419421
2 * angio_me
[1] 1.018436

The confidence interval for angiography is narrower.

Question 2

one-step solution:

n = 1031
k = 567
prop.test(k, n)

    1-sample proportions test with continuity correction

data:  k out of n, null probability 0.5
X-squared = 10.091, df = 1, p-value = 0.00149
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5189682 0.5805580
sample estimates:
        p 
0.5499515 

Alternatively:

p_hat <- k/n # point estimate
se = sqrt((p_hat*(1-p_hat))/n) # standard error
e = qnorm(0.975)*se # margin of error
p_hat + c(-e, e) # confidence interval 
[1] 0.5195839 0.5803191

Alternatively, we can use the exact binomial test. In large samples like the one we have, the results should essentially be the same as prop.test().

binom.test(k, n)

    Exact binomial test

data:  k and n
number of successes = 567, number of trials = 1031, p-value = 0.001478
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.5189927 0.5806243
sample estimates:
probability of success 
             0.5499515 

Question 3

range = 200-30
population_sd = range/4

Remember:

\[CI_{95} = \bar x \pm z \frac{s}{\sqrt n}\] (We can use \(z\) because we assume population standard deviation is known.)

We want the number \(n\) that ensures:

\[ z \frac{s}{\sqrt n} = 5 \] \[ zs = 5 \sqrt n\] \[ \frac{zs}{5} = \sqrt n\] \[ (\frac{zs}{5})^2 = n\]

In our case:

z = qnorm(.975)
s = population_sd
n = ((z *s) / 5)^2
print(n)
[1] 277.5454

Rounding up, we need a sample of 278.

Question 4

We can write a function to find the t-statistic, and then do all the tests in a, b, and c using that.

\[t = \frac{\bar x - \mu}{s / \sqrt n}\]

where \(\bar x\) is them sample mean, \(\mu\) is the hypothesizes population mean, \(s\) is the sample standard deviation, and \(n\) is the sample size.

Writing this in R:

get_t_stat <- function(x_bar, mu, sd, n){
  return((x_bar - mu) / (sd / sqrt(n)))
}

Find the t-statistic:

t_stat <- get_t_stat(x_bar = 410, mu = 500, sd = 90, n = 9)

A

Two-tailed test

n = 9
pval_two_tail = 2*pt(t_stat, df = n-1)
pval_two_tail
[1] 0.01707168

We can reject the hypothesis that population mean is 500.

B

pval_lower_tail = pt(t_stat, df = n-1)
pval_lower_tail
[1] 0.008535841

We can reject the hypothesis that population mean is greater than 500.

C

pval_upper_tail = pt(t_stat, df = n-1, lower.tail=FALSE)
pval_upper_tail
[1] 0.9914642

We fail to reject the hypothesis that population mean is less than 500.

Alternatively for C, we could just subtract the answer in B from 1:

1 - pval_lower_tail
[1] 0.9914642

Question 5

t_jones = ((519.5 - 500)/ 10)
t_smith = ((519.7 - 500)/ 10)
cat("t value for Jones:", t_jones, '\n')
t value for Jones: 1.95 
cat("t value for Smith:", t_smith, '\n')
t value for Smith: 1.97 
cat('p value for Jones:', round(2*pt(t_jones, df = 999, lower.tail=FALSE), 4), '\n')
p value for Jones: 0.0515 
cat('p value for Smith:', round(2*pt(t_smith, df = 999, lower.tail=FALSE), 4), '\n')
p value for Smith: 0.0491 

At 0.05 level Smith’s result is statistically significant but Jones’s is not. The result show the arbitrariness of the 0.05 demarcation line and the importance of reporting actual p-values to better make sense of results.

Question 6:

# Creating the dataframe
grade_level <- c(rep("6th grade", 100), rep("7th grade", 100), rep("8th grade", 100))
snack <- c(rep("healthy snack", 31), rep("unhealthy snack", 69), rep("healthy snack", 43),
           rep("unhealthy snack", 57), rep("healthy snack", 51), rep("unhealthy snack", 49))
snack_data <- data.frame(grade_level, snack)

We are conducting a Chi-square test in this question since we are testing the association between two categorical variables.

table(snack_data$snack,snack_data$grade_level)
                 
                  6th grade 7th grade 8th grade
  healthy snack          31        43        51
  unhealthy snack        69        57        49
chisq.test(snack_data$snack,snack_data$grade_level,correct = FALSE)

    Pearson's Chi-squared test

data:  snack_data$snack and snack_data$grade_level
X-squared = 8.3383, df = 2, p-value = 0.01547

A p-value smaller than 0.05 indicates that there is a relationship between grade level and the choice of snack.

Question 7:

# Creating the dataframe
Area <- c(rep("Area1", 6), rep("Area2", 6), rep("Area3", 6))
cost <- c(6.2, 9.3, 6.8, 6.1, 6.7, 7.5, 7.5, 8.2, 8.5, 8.2, 7.0, 9.3,
          5.8, 6.4, 5.6, 7.1, 3.0, 3.5)
Area_cost <- data.frame(Area,cost)

Since we are comparing the means of more than two groups, we are using the ANOVA test in this question.

one.way <- aov(cost ~ Area, data = Area_cost)
summary(one.way)
            Df Sum Sq Mean Sq F value  Pr(>F)   
Area         2  25.66  12.832   8.176 0.00397 **
Residuals   15  23.54   1.569                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The small p-value tells us that the three areas have a difference in means.