Homework 2

hw2

homework2

abigailbalint

Author

Abigail Balint

Published

April 3, 2023

Code

library(tidyverse)
library(ggplot2)
library(dplyr)
library(readxl)
knitr::opts_chunk$set(echo = TRUE)

Question 1

Generating a table here to calculate standard deviation

Code

heart <- matrix(c(539, 19, 10, 847, 18, 9), ncol=3, byrow=TRUE)
colnames(heart) <- c('Sample Size','Mean Wait Time','Standard Deviation')
rownames(heart) <- c('Bypass','Angiography')
heart <- as.table(heart)
head(heart)

            Sample Size Mean Wait Time Standard Deviation
Bypass              539             19                 10
Angiography         847             18                  9

Calculating confidence interval for Bypass:

Code

stander1 <- 10/sqrt(539)
confidence_level <- 0.90 
tail_area <- (1-confidence_level)/2
t_score <- qt(p = 1-tail_area, df = 539-1)
t_score

[1] 1.647691

Code

CI <- c(19 - t_score * stander1,
        19 + t_score * stander1)
print(CI)

[1] 18.29029 19.70971

Calculating confidence interval for Angiography:

Code

stander2 <- 9/sqrt(847)
confidence_level <- 0.90 
tail_area <- (1-confidence_level)/2
t_score <- qt(p = 1-tail_area, df = 847-1)
t_score

[1] 1.646657

Code

CI <- c(18 - t_score * stander2,
        18 + t_score * stander2)
print(CI)

[1] 17.49078 18.50922

Code

BypassInt <- 19.70971 - 18.29029
AngiographyInt <- 18.50922-17.49078 
print(BypassInt)

[1] 1.41942

Code

print(AngiographyInt)

[1] 1.01844

Question: Is the confidence interval narrower for angiography or bypass surgery? Answer: The interval for angiography surgery is shorter at an interval of 1.01.

Question 2

Here I am using a prop test function to find the p value -

Code

prop.test(567, 1031, conf.level = 0.95)


    1-sample proportions test with continuity correction

data:  567 out of 1031, null probability 0.5
X-squared = 10.091, df = 1, p-value = 0.00149
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5189682 0.5805580
sample estimates:
        p 
0.5499515

Answer: To interpret this, I am looking at my confidence interval and seeing that the true population mean of those who believe college is needed for success is somewhere between 52-58% which makes sense because the reported percentage from this random sample who believe college is needed for success is 54%, inside of that range.

Question 3

My calculations to find the sample size are step by step in the below code block:

Code

#Starting by trying to find margin of error:
#Formula is error= z * SD/sqrt of n
#By squaring the whole thing, can transform this formular to to n=z^2*SD^2/error^2
# Z score for 95% confidence is 1.96
1.96^2

[1] 3.8416

Code

#n=3.8416*SD^2/error^2
#Finding the range and dividing it by 4 since we know SD is quarter of the range
(200-30)/4

[1] 42.5

Code

42.5^2

[1] 1806.25

Code

#Now formula looks like this and just need to find standard error n=3.8416*1806.25/error^2
#We know that margin of error is 5 (within $5)
5^2

[1] 25

Code

#Now we can fill in the final formula n=3.8416*1806.25/25
(3.8416*1806.25)/25

[1] 277.5556

Code

#n=277.56

Answer: The sample size should be about 278 (rounded)

Question 4

A. Test whether the mean income of female employees differs from $500 per week. Include assumptions, hypotheses, test statistic, and P-value. Interpret the result.

Assumptions - -Randomly generated sample -From the same general population -Distribution is normal

Null hypothesis - females earn $500 a week Alternative hypothesis - females earn more or less than $500 a week

Calculating the t value below using standard formula -

t= (sample mean-population mean)/(standard deviation/sample size^2)

Code

tvalue =  (410 - 500) / (90/sqrt(9))
tvalue

[1] -3

Answer: t=-3

This tells us that the mean of the female group is three standard deviations away from the mean of the overall group’s pay.

B. Report the P-value for alternative hypothesis: μ < 500. Interpret.

Calculating p-value -

Formula pt(q = t, df =standardeviation-1, lower.tail = TRUE)

Code

pt(q = tvalue, df =8, lower.tail = TRUE)

[1] 0.008535841

Answer: .01 (rounded)

A p-value of .01 means we can reject the alternative hypothesis (mean for females < 500 a week)

C. Report and interpret the P-value for alternative hypothesis: μ > 500.

Code

pt(q = tvalue, df =8, lower.tail = FALSE)

[1] 0.9914642

Answer: .99 (rounded) This is the same but flipped, can reject the alternative hypothesis (mean for females > 500 a week)

Question 5

Using formula t= sample mean-population mean/standard error

Code

t=(519.5-500)/10
print(t)

[1] 1.95

Code

pt(q = t, df = 999, lower.tail = FALSE)*2

[1] 0.05145555

Answer: Jones t=1.95, p=.051

Code

t=(519.7-500)/10
print(t)

[1] 1.97

Code

pt(q = t, df = 999, lower.tail = FALSE)*2

[1] 0.04911426

Answer: Smith t=1.97, p=.049

B. This makes Smith statistically significant because .049 falls below .05 but .051 does not.

C. This shows that results presented this way can be misleading because even though the p-values are extremely close here, one would report rejecting the null hypothesis and one wouldn’t even though the differences in results are marginal.

Question 6

Null hypothesis: Proportion of those who choose healthy snacks is not equal by grade level.

Generating table -

Code

snack <- matrix(c(31, 43, 51, 69, 57, 49), ncol=3, byrow=TRUE)
colnames(snack) <- c('6th','7th','8th')
rownames(snack) <- c('healthy','unhealthy')
snack <- as.table(snack)
head(snack)

          6th 7th 8th
healthy    31  43  51
unhealthy  69  57  49

Performing chi squared test -

Code

chisq.test(snack, .05, correct = FALSE)


    Pearson's Chi-squared test

data:  snack
X-squared = 8.3383, df = 2, p-value = 0.01547

Since the p value is .01 we can assume that there is a difference by grade level in those who choose unhealthy vs healthy snacks, rejecting null hypothesis.

Question 7

Null hypothesis: There is no difference in per-pupil costs between areas.

Generating data frame -

Code

area <- c(rep("Area1", 6), rep("Area2", 6), rep("Area3", 6))
cost <- c(6.2, 9.3, 6.8, 6.1, 6.7, 7.5, 7.5, 8.2, 8.5, 8.2, 7.0, 9.3,
          5.8, 6.4, 5.6, 7.1, 3.0, 3.5)
tuition <- data.frame(area,cost)
head(tuition)

   area cost
1 Area1  6.2
2 Area1  9.3
3 Area1  6.8
4 Area1  6.1
5 Area1  6.7
6 Area1  7.5

Code

anova <- aov(cost ~ area, data = tuition)
summary(anova)

            Df Sum Sq Mean Sq F value  Pr(>F)   
area         2  25.66  12.832   8.176 0.00397 **
Residuals   15  23.54   1.569                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: The p value is very low indicating there is a difference and we can reject the null hypothesis. :