Homework 2

hw2

confidence_interval

hypothesis_testing

emma_narkewicz

Emma Narkewicz HW 2: Confidence Intervals and Hypothesis Testing

Author

Emma Narkewicz

Published

March 28, 2023

Code

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.0      ✔ stringr 1.4.1 
✔ readr   2.1.2      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Question 1

I constructed the 90% confidence interval of each of the two heart procedures manually using the equation: CI = (X bar) ± (t × s/sqrt(n)) and the mean wait time, sample standard deviation, and sample size provided for the Bypass and Angiography heart procedures.

For both procedures I calculated the t-score based on 2 tails, as there was no information indicating we should only focus on the upper or lower tail.

Code

# Calculating 90% CI  manually of Bypass heart surgery
## sample mean wait time
s_mean_bypass <- 19
## sample sd
s_sd_bypass <- 10
## sample size (n)
s_size_bypass <- 539
## standard error  (sd/swrt(n))
s_se_bypass <- s_sd_bypass / sqrt (s_size_bypass)
s_se_bypass

[1] 0.4307305

Code

## Specify confidence level 
confidence_level_bypass <- 0.9
##Calculate tail area of 2-tail t-test
tail_area_bypass <- (1 - confidence_level_bypass)/2
tail_area_bypass

[1] 0.05

Code

#calculation t-score using qt
t_score_bypass <-  qt(p = 1 - tail_area_bypass, df = s_size_bypass - 1)
t_score_bypass

[1] 1.647691

Code

##Calculating upper & lower CI bypass
CI_bypass <- c(s_mean_bypass - t_score_bypass * s_se_bypass, s_mean_bypass + t_score_bypass * s_se_bypass)
print(CI_bypass)

[1] 18.29029 19.70971

The 90% confidence interval for the Bypass wait time is 18.29 days - 19.71 days

Code

#Calculating the 90% CI manually of angiography heart surgery
##sample mean wait time
s_mean_angio <- 18
##sample sd
s_sd_angio <- 9
## sample size (n)
s_size_angio <- 847
## standard error  (sd/swrt(n)
s_se_angio <- s_sd_angio / sqrt (s_size_angio)
s_se_angio

[1] 0.3092437

Code

## Specify confidence level 
confidence_level_angio <- 0.9
##Calculate tail area of 2-tail t-test
tail_area_angio <- (1 - confidence_level_angio)/2
tail_area_angio

[1] 0.05

Code

#calculation t-score using qt
t_score_angio <-  qt(p = 1 - tail_area_angio, df = s_size_angio - 1)
t_score_angio

[1] 1.646657

Code

##Calculating upper & lower CI bypass
CI_angio <- c(s_mean_angio - t_score_angio * s_se_angio, s_mean_angio + t_score_angio * s_se_angio)
print(CI_angio)

[1] 17.49078 18.50922

The 90% confidence interval for the Angiography wait time is 17.49 days - 18.51 days

Code

#narrower CI
delta_CI_bypass = 19.71 - 18.29
delta_CI_bypass

[1] 1.42

Code

delta_CI_angio = 18.51 - 17.49
delta_CI_angio

[1] 1.02

The confidence interval is narrower for the Angiography (CI delta of 1.02 days) than the confidence interval for the Bypass (CI delta of 1.42 days)

Question 2

To calculate & interpret a 95% CI for the proportion of all adult Americans who believe that a college education is essential for success I used prop.test() because this involves estimating a proportion, not a probability. In this test, a “success” is defined as an adult American reporting a college education is essential for success, with the alternative being a response of an adult American reporting a college education is not essential for success.

The point estimate for the proportion of adult Americans who believe that college is essential is 0.55.

Code

#Calculate point estimate college is essential

sample_size_survey <- 1031
point_estimate_college_essential <- 567/sample_size_survey
point_estimate_college_essential

[1] 0.5499515

Specific information plugged in was:

sample proportion p = point estimate (0.5499515),
number sample “successes” x = 567
sample size n = 1031

Code

#Performing prop.test
prop.test(x = 567, n = 1031, p = point_estimate_college_essential)


    1-sample proportions test without continuity correction

data:  567 out of 1031, null probability point_estimate_college_essential
X-squared = 6.9637e-30, df = 1, p-value = 1
alternative hypothesis: true p is not equal to 0.5499515
95 percent confidence interval:
 0.5194543 0.5800778
sample estimates:
        p 
0.5499515

The 95% confidence interval is [0.52, 0.58]

This should be interpreted as 95% of confidence intervals calculated with this procedure would contain the true proportion of adult Americans who believe that a college education is essential for success.

Question 3

Suppose that the financial aid office of UMass Amherst seeks to estimate the mean cost of textbooks per semester for students. The estimate will be useful if it is within $5 of the true population mean (i.e. they want the confidence interval to have a length of $10 or less). The financial aid office is pretty sure that the amount spent on books varies widely, with most values between $30 and $200. They think that the population standard deviation is about a quarter of this range (in other words, you can assume they know the population standard deviation).

Assuming the significance level to be 5%, what should be the size of the sample?

Because we are assuming we know the population sd, this signals we should a 2-sided z-test to calculate the sample size. I can solve for sample size using the z-score through the Margin of Error (MOE) equation that:

MOE = zscore @ confidence level * (sqrt(sd^2/n))

rearranging the equation to solve for n by squaring both sides, multiplying by n, and then dividing by MOE^2 results in the equation:

n = (z^2 * sd^2)/ (MOE^2)

MOE = +/- 5
pop sd = (200-30)/4 = 42.5
at an alpha level 0.05, 2-sided z-score = 1.959964

Code

#solving for sample size

n = ((1.959964^2) * (42.5^2)) / (5^2) 
n

[1] 277.5454

The ideal sample size is 278

Question 4

According to a union agreement, the mean income for all senior-level workers in a large service company equals $500 per week. A representative of a women’s group decides to analyze whether the mean income μ for female employees matches this norm. For a random sample of nine female employees, ȳ = $410 and s = 90

A

Test whether the mean income of female employees differs from $500 per week. Include assumptions, hypotheses, test statistic, and P-value. Interpret the result.

Assumptions of a 1-sample t-test:

population is normally distributed
observations in our sample are generated independently of one another

Information we are provided is that:

Population mean (μ) = 500
Sample mean (y bar) = 410
Sample sd = 90 s
Sample n = 9

Null Hypothesis: (H0) female mean income μ = 500 Alternative Hypothesis: (Ha) female mean income μ ≠ 500

Calculate the t statistic using the equation:

t = (y bar - μ ) / (sd estimate / sqrt(N))

Code

## Calculating tscore

t =  (410 - 500) / (90/sqrt(9))
t

[1] -3

The t-score is -3

To solve for the p-value from the t-score I will use the pt() function, with a q = the t-statistic of -3, df = 8 (9-1).

It it is important to note that an assumption of the pt() function is you are looking for the probability of the lower tail, not both tails. Because the t score is negative looking at the lower tail is appropriate, but to get the probability for 2-tailed test we need to multiply the lower tail probability by 2.

Code

#Calculating p-value

p_lower_tail = pt(q = t, df =8, lower.tail = TRUE)
p_lower_tail

[1] 0.008535841

This gives us the probability that μ is less than 500, but that is only 1 tail, so we multiply this problity by 2 to get the p value for our 2-sided t-test

Code

#Two-side t
p_twotail =  p_lower_tail * 2
p_twotail

[1] 0.01707168

With a 2-tail p value of 0.017, we can confidently reject the null hypothesis that female salary μ = 500 as p< 0.05. This suggests that the true mean salary of female workers is not equal to $500.

B

Report the P-value for Ha: μ < 500. Interpret.

Code

#P-value Ha: μ < 500

p_lower_tail = pt(q = t, df =8, lower.tail = TRUE)

p_lower_tail

[1] 0.008535841

With a small p-value of 0.00854 we reject the null hypothesis that μ = 500, suggesting that the true mean salary of female workers is less than $500.

C

Report and interpret the P-value for Ha: μ > 500.

Code

#P-value Ha: μ > 500

p_upper_tail = pt(q = t, df =8, lower.tail = FALSE)

p_upper_tail

[1] 0.9914642

With a large p-value of 0.99 we fail to reject the null hypothesis that μ = 500, suggesting that the true means salary of female workers is not greater than $500.

Question 5

Jones and Smith separately conduct studies to test H0: μ = 500 against Ha: μ ≠ 500, each with n = 1000. Jones gets ȳ = 519.5, with se = 10.0. Smith gets ȳ = 519.7,with se = 10.0.

A

Show that Show that t = 1.95 and P-value = 0.051 for Jones. Show that t = 1.97 and P-value = 0.049 for Smith.

Because we are provided with standard error not standard deviation, I used the equation:

t = (ȳ - μ) / se

Code

#Jones T score

t_Jones = (519.5 - 500) / 10.0
t_Jones

[1] 1.95

For Jones the t score is indeed 1.95

To calculate the p-value, I will use pt(), q = 1.95, df = (1000-1) = 999). Because Jone’s t-score is positive, we look at the upper tail.

Code

#Calculate 1-tail p-value Jones
p_upper_tail_Jones = pt(q = t_Jones, df = 999, lower.tail = FALSE)
p_upper_tail_Jones

[1] 0.02572777

The p value of the upper tail for Jones is 0.02572777, but because the alternative hypothesis is 2-tailed, it needs to be multiplied by 2

Code

#Calculating p_value Jones
p_Jones = p_upper_tail_Jones * 2
p_Jones

[1] 0.05145555

This shows the p-value of Jones is indeed 0.51

Code

# Smith T score 
t_Smith = (519.7 - 500) / 10.0
t_Smith

[1] 1.97

The t-score for Smith is indeed 1.97. The p value is calculated the same way as for Jones, using pt() & the upper tail:

Code

#Calculate 1-tail p-value Smith
p_upper_tail_Smith = pt(q = t_Smith, df = 999, lower.tail = FALSE)
p_upper_tail_Smith

[1] 0.02455713

Code

#Smith p=value
p_Smith = p_upper_tail_Smith * 2
p_Smith

[1] 0.04911426

Smith’s p-value does = 0.049

B

Using an alpha level = 0.05 for both

Jones’ p-value = 0.51 means a non-statistically significant result, where we fail to reject the null hypothesis that μ = 500
Smith’s p=value = 0.49 means a statistically significant results, where we reject the null hypothesis that μ = 500

C

This example showcases how reporting the actual p-value is important, as in this case the p-values are very close between Jones & Smith, which are both right on the edge of significance (0.050 +/- 0.01). A p-value of 0.049 & a p-value of 0.0000000049 are both less than 0.05 & would both lead us to reject the null hypothesis, but the second p-value is highly significant while the first p-value is only barely significant at an alpha level of 0.05. Reporting the p-value itself gives readers & other researchers information to much better understand your statistical findings than just reporting if the p value is > or < 0.05 or if you do or do not reject H0.

Question 6

A school nurse wants to determine whether age is a factor in whether children choose a healthy snack after school. She conducts a survey of 300 middle school students, with the results below. Test at α = 0.05 the claim that the proportion who choose a healthy snack differs by grade level. What is the null hypothesis? Which test should we use? What is the conclusion?

The null hypothesis is that there is no association between the grade of the student and if they choose a healthy snack (they are independent). In this question independence would result in the proportion of students in each grade who choose a healthy snack is the same: H0: p6th = p7th = p8th
Because I want to test if there is an association between 2 categorical variables, snack preference & grade level, I used a chi-squared test.

First I recreated the HW2 snacks data table in R:

Code

#Recreating nurse data

Health_tab <- matrix(c(31, 43, 51, 69, 57, 49), nrow=2, byrow=TRUE)
colnames(Health_tab) <- c('6th','7th','8th')
rownames(Health_tab) <- c('Healthy','Unhealthy')
Health_tab <- as.table(Health_tab)

Health_tab

          6th 7th 8th
Healthy    31  43  51
Unhealthy  69  57  49

Then a chi-squared was run on the data set:

Code

#Chi Squared test

chisq.test(Health_tab)


    Pearson's Chi-squared test

data:  Health_tab
X-squared = 8.3383, df = 2, p-value = 0.01547

Conclusion : The p-value of 0.015 from the chi-squared test is less than alpha=0.05, meaning we reject the null hypothesis that there is no association between student grade level and preference for healthy snacks.
At a significance level of 0.05 and df = 2, the critical value = 5.991. The X-squared = 8.3383 is greater than the critical value, once again supporting rejecting the null hypothesis. This suggests snack preference is not independent of grade level.

Question 7

Per-pupil costs (in thousands of dollars) for cyber charter school tuition for school districts in three areas are shown. Test the claim that there is a difference in means for the three areas, using an appropriate test. What is the null hypothesis? Which test should we use? What is the conclusion?

The null hypothesis is that there is no difference between the mean per-pupil cost for cyber charter tuition in 3 different areas, which can be written out as:

H0: μ1 = μ2 = μ3

Because we are testing if there is a difference between the means of 3 or more groups that differ by the categorical variable of area we should use a ANOVA test
I entered in the data below manually, and then pivoted longer to get 2 columns, Area and Cost.

Code

Area_1 <- c( 6.2, 9.3, 6.8, 6.1, 6.7, 7.5)
Area_2 <- c( 7.5, 8.2, 8.5, 8.2, 7.0, 9.3)
Area_3 <- c( 5.8, 6.4, 5.6, 7.1, 3.0, 3.5)

charter_data <- data.frame (Area_1, Area_2, Area_3)

Code

#Pivoted Longer

pivot_charter <- charter_data %>%
  pivot_longer(c(Area_1, Area_2, Area_3 ), names_to= "Area", values_to="Cost")

pivot_charter

# A tibble: 18 × 2
   Area    Cost
   <chr>  <dbl>
 1 Area_1   6.2
 2 Area_2   7.5
 3 Area_3   5.8
 4 Area_1   9.3
 5 Area_2   8.2
 6 Area_3   6.4
 7 Area_1   6.8
 8 Area_2   8.5
 9 Area_3   5.6
10 Area_1   6.1
11 Area_2   8.2
12 Area_3   7.1
13 Area_1   6.7
14 Area_2   7  
15 Area_3   3  
16 Area_1   7.5
17 Area_2   9.3
18 Area_3   3.5

Lastly, I ran the ANOVA, specifying Area as the independent variable & Cost (per pupil) as the dependent variable.

Code

#Anova 
anova_charter <- aov( Cost ~ Area, data = pivot_charter)

summary(anova_charter)

            Df Sum Sq Mean Sq F value  Pr(>F)   
Area         2  25.66  12.832   8.176 0.00397 **
Residuals   15  23.54   1.569                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The resulting p-value of 0.00397 is less than an alpha = 0.05 and therefor significant, leading us to reject out null hypothesis that there is no difference in the mean per-pupil charter school cost between the 3 areas. There does appear to be a difference in per-pupil charter school cost based on which of the 3 areas the school is in.