Homework 2

hw2

Adithya Parupudi

my homework 2

Author

Adithya Parupudi

Published

February 27, 2023

Code

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.1     ✔ stringr 1.4.1
✔ readr   2.1.3     ✔ forcats 0.5.2

Warning: package 'ggplot2' was built under R version 4.2.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

library(hrbrthemes)

NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
      Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
      if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

Code

library(viridis)

Warning: package 'viridis' was built under R version 4.2.2

Loading required package: viridisLite

Code

library(readxl)

Question 1

Code

procedure <- c("Bypass", "Angiography")
s_size <- c(539, 847)
mean_wait_time <- c(19, 18)
s_sd <- c(10, 9)

surgery <- data.frame(procedure, s_size, mean_wait_time, s_sd)

colnames(surgery) <- c("Surgical Procedure", "Sample Size","Mean Wait Time", "Standard Deviation")
surgery

  Surgical Procedure Sample Size Mean Wait Time Standard Deviation
1             Bypass         539             19                 10
2        Angiography         847             18                  9

Code

standard_error <- s_sd / sqrt(s_size)
standard_error

[1] 0.4307305 0.3092437

Code

confidence_level <- 0.90
tail_area <- (1-confidence_level)/2
tail_area

[1] 0.05

Code

t_score <- qt(p = 1-tail_area, df = s_size-1)
t_score

[1] 1.647691 1.646657

Code

confidence_interval <- c(mean_wait_time - t_score * standard_error,mean_wait_time + t_score * standard_error)
confidence_interval

[1] 18.29029 17.49078 19.70971 18.50922

We can be 90% confident that the population mean wait time for the bypass procedure is between 18.29029 and 19.70971 days. And for the angiography procedure, We can be 90% confident that the population mean wait time is between 17.49078 and 18.50922 days.

From the above results, confidence interval of angiography procedure is narrower.

Question 2

Code

prop.test(567, 1031, conf.level = .95)


    1-sample proportions test with continuity correction

data:  567 out of 1031, null probability 0.5
X-squared = 10.091, df = 1, p-value = 0.00149
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5189682 0.5805580
sample estimates:
        p 
0.5499515

Question 3

Code

ME <- 5
z <- 1.96
s_sd <- (200-30)/4

sample_size <- ((z*s_sd)/ME)^2
sample_size

[1] 277.5556

The necessary sample size is 278.

Question 4

4A

We assume that the sample is random and that the population has a normal distribution. Null hypothesis: H0: μ = 500 Alternative hypothesis: Ha: μ ≠ 500 We will reject the null hypothesis at a p-value, p <= 0.05

Code

# defining variables
s_mean <- 410
μ <- 500
s_sd <- 90
s_size <- 9

Code

# Calculating test-statistic
t_score <- (s_mean-μ)/(s_sd/sqrt(s_size))
t_score

[1] -3

Code

# Calculating p-value

p <- 2*pt(t_score, s_size-1)
p

[1] 0.01707168

The test-statistic is -3 and p-value is 0.01707168. As p-value is less than the 0.05, we reject the null hypothesis. Therefore, the mean income of female employees is not equal to $500.

4B

We assume that the sample is random and that the population has a normal distribution. Null hypothesis: H0: μ = 500 Alternative hypothesis: Ha: μ < 500 We will reject the null hypothesis at a p-value less than 0.05

Code

p_val <- pt(t_score, s_size-1, lower.tail = TRUE)
p_val

[1] 0.008535841

The p-value is 0.008535841. As p-value is less than the 0.05, we reject the null hypothesis. Therefore, the mean income of female employees is less than $500.

4C

We assume that the sample is random and that the population has a normal distribution. Null hypothesis: H0: μ = 500 Alternative hypothesis: Ha: μ > 500 We will reject the null hypothesis at a p-value less than 0.05

Code

p_val_c <- pt(t_score, s_size-1, lower.tail = FALSE)
p_val_c

[1] 0.9914642

The p-value is 0.9914642. As p-value is less than the 0.05, we reject the null hypothesis. Therefore, the mean income of female employees is greater than $500.

Question 5

5A

Code

# Calculating t-statistic and p-value for Jones
s_mean <- 519.5
μ <- 500
se <- 10
s_size <- 1000

james_t <- (s_mean-μ)/se
james_t

[1] 1.95

Code

p <- 2*pt(james_t, s_size-1, lower.tail = FALSE)
p

[1] 0.05145555

Code

# Calculating t-statistic and p-value for Smith
s_mean <- 519.7
μ <- 500
se <- 10
s_size <- 1000

smith_t <- (s_mean-μ)/se
smith_t

[1] 1.97

Code

p <- 2*pt(smith_t, s_size-1, lower.tail = FALSE)
p

[1] 0.04911426

The test-statistic is 1.95, p-value is 0.05145555 for Jones and the test-statistic is 1.97, p-value is 0.05145555 for Smith.

5B

The p-value is 0.05145555 for Jones. As p-value is greater than the 0.05, we fail to reject the null hypothesis. The p-value is 0.04911426 for Jones. As p-value is less than the 0.05, we reject the null hypothesis. Therefore, the result is statistically significant for Smith, but not Jones.

5C

If we fail to report the P-value and simply state whether the P-value is less than/equal to or greater than the defined significance level of the test, one cannot determine the strength of the conclusion. In the Jones/Smith example, reporting the results only as P ≤ 0.05 versus P > 0.05 will lead to different conclusions about very similar results.

Question 6

To test the claim that the proportion of students who choose a healthy snack differs by grade level, we will use the chi-squared test of independence. The null hypothesis is that the proportion of students who choose a healthy snack is the same across all grade levels. The alternative hypothesis is that the proportion of students who choose a healthy snack differs by grade level.

Code

# Data in a matrix
table_6 <- matrix(c(31, 43, 51, 69, 57, 49), nrow = 2, byrow = TRUE)

# Row and column names
rownames(table_6) <- c("Healthy snack", "Unhealthy snack")
colnames(table_6) <- c("6th grade", "7th grade", "8th grade")

# Chi-squared test of independence
test <- chisq.test(table_6)
test


    Pearson's Chi-squared test

data:  table_6
X-squared = 8.3383, df = 2, p-value = 0.01547

The p-value (0.01547) is less than the significance level (α = 0.05). Therefore, we reject the null hypothesis and conclude that the proportion of students who choose a healthy snack differs by grade level.)

Question 7

To test the claim that there is a difference in means for the three areas, we will use a one-way ANOVA test. The null hypothesis is that the means for the three areas are the same. The alternative hypothesis is that at least one area’s mean is different from the others.

Code

# Data in vectors
area1 <- c(6.2, 9.3, 6.8, 6.1, 6.7, 7.5)
area2 <- c(7.5, 8.2, 8.5, 8.2, 7.0, 9.3)
area3 <- c(5.8, 6.4, 5.6, 7.1, 3.0, 3.5)

# One-way ANOVA test
test <- aov(c(area1, area2, area3) ~ factor(c(rep("Area 1", length(area1)), rep("Area 2", length(area2)), rep("Area 3", length(area3)))))
summary(test)

                                                                                                    Df
factor(c(rep("Area 1", length(area1)), rep("Area 2", length(area2)), rep("Area 3", length(area3))))  2
Residuals                                                                                           15
                                                                                                    Sum Sq
factor(c(rep("Area 1", length(area1)), rep("Area 2", length(area2)), rep("Area 3", length(area3))))  25.66
Residuals                                                                                            23.54
                                                                                                    Mean Sq
factor(c(rep("Area 1", length(area1)), rep("Area 2", length(area2)), rep("Area 3", length(area3))))  12.832
Residuals                                                                                             1.569
                                                                                                    F value
factor(c(rep("Area 1", length(area1)), rep("Area 2", length(area2)), rep("Area 3", length(area3))))   8.176
Residuals                                                                                                  
                                                                                                     Pr(>F)
factor(c(rep("Area 1", length(area1)), rep("Area 2", length(area2)), rep("Area 3", length(area3)))) 0.00397
Residuals                                                                                                  
                                                                                                      
factor(c(rep("Area 1", length(area1)), rep("Area 2", length(area2)), rep("Area 3", length(area3)))) **
Residuals                                                                                             
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value (0.00397) is less than the significance level (α = 0.05). Therefore, we reject the null hypothesis and conclude that there is a difference in means for the three areas.