Homework 2

hw2

Author

Young Soo Choi

Published

March 27, 2023

Prepare

Code

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

library(dplyr)

Question 1

In order to obtain the confidence interval, first, a t-value of 90% confidence level must be obtained with the given statistics.

First, I will look for the confidence interval of the bypass group.

Code

# t of the given bypass group
bypass.t<-qt(p=0.95, df=539)
bypass.t

[1] 1.647686

The confidence interval includes a section that falls margin of error from a specific predicted value, and this problem is to obtain the confidence interval of the mean, so the upper and lower limits of the confidence interval were obtained using the sample mean and the previously obtained t-value, respectively.

Code

# confidence interval of bypass group
bypass.conf.up<-19+bypass.t*(10/sqrt(539))
bypass.conf.low<-19-bypass.t*(10/sqrt(539))
bypass.conf.up

[1] 19.70971

Code

bypass.conf.low

[1] 18.29029

Now let’s look at the confidence interval of 90% of the angiography group in the same way.

Code

# confidence interval of angiography group

angio.t<-qt(p=0.95, df=847)
angio.t

[1] 1.646655

Code

angio.conf.up<-18+angio.t*(9/sqrt(847))
angio.conf.low<-18-angio.t*(9/sqrt(847))
angio.conf.up

[1] 18.50922

Code

angio.conf.low

[1] 17.49078

For a 90% confidence interval, a bypass group is from 18.29 to 19.71, an angiography group is from 17.49 to 18.51.

Random data were generated using the given statistics and each confidence interval was also obtained.

Code

set.seed(1)

bypass.s<-rnorm(539, mean=19, sd=10)
angio.s<-rnorm(847, mean=18, sd=9)

t.test(bypass.s, conf.level = .9)


    One Sample t-test

data:  bypass.s
t = 44.184, df = 538, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
 18.44913 19.87842
sample estimates:
mean of x 
 19.16377

Code

t.test(angio.s, conf.level = .9)


    One Sample t-test

data:  angio.s
t = 54.954, df = 846, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
 17.15787 18.21788
sample estimates:
mean of x 
 17.68788

The statistics of the generated data did not exactly match the statistics given by the problem, so the exact same confidence interval was not obtained, but they are similar.

Question 2

A data vector of 1 for those who believe that college education is essential for success and 0 for those who do not.

Code

q2<-c(rep(1, 567), rep(0,464))

Then, we can simply obtain a 95% confidence interval using the t.test function.

Code

t.test(q2)


    One Sample t-test

data:  q2
t = 35.477, df = 1030, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.5195334 0.5803696
sample estimates:
mean of x 
0.5499515

As a result of this survey, the proportion of people who believe that college education is important for success is 55.0%, and the 95% confidence interval is from 52.0% to 58.0%.

When the significance level is 5%, the meaning of this confidence interval is that if the survey is conducted several times in the same way, there is a 95% probability that the confidence interval calculated from each survey result contains the parameter within that range.

Question 3

Depending on the value given in the problem, the population standard deviation is 42.5 (170/4), and the allowable margin of error is 5. And margin of error can be obtained by zsd/(n^1/2). So n can be obtained by solving the following equation. margin of error =< zsd(n^1/2) transform that equation, n >= (z*sd/margin of error)^2

The significance level here is 5%, so z is 1.96. And sd is 42.5 margin of error is 5. Then n >= (1.96*42.5/5)^2

Code

(1.96*42.5/5)^2

[1] 277.5556

n >= 277.56, In other words, the number of samples must exceed 278 to meet the conditions in question.

Question 4

A

First, assume that s in the population is equal to the sample in question. And the significance level is set at 5%.

Null Hypothesis: μ = 500 Alternative Hypothesis: μ ≠ 500

Since it is a test that estimates whether the parameter is a specific value, a two-side t-test should be performed. At this time, t is ȳ - μ / s/(n^1/2).

Code

t.a<-((410-500)/(90/3))
t.a

[1] -3

t is -3. And since it is a two-side test, the p-value of this hypothesis can be obtained by multiplying the p-value at t=-3.

Code

p.a<-(pt(t.a, df=8))*2
p.a

[1] 0.01707168

p-value is 0.017. And this is smaller than our significant level(0.05). So, we can reject our null hypothesis. This means that At the 95% confidence level, μ is not 500.

B

Here, the null hypothesis is μ >= 500. Then, the alternative hypothesis is μ < 500. In this case, a one-sided test shall be performed.

Code

t.b<--3
p.b<-pt(t.b, df=8)
p.b

[1] 0.008535841

The p-value can be obtained in the same way as A. Here, the value of p is 0.009. This means that at the 95% confidence level, μ is not greater than 500.

C

Here, contrary to B, the the null hypothesis is μ =< 500. Since the direction of the inequality has changed, the area of the p value in the t-distribution must be opposite to B.

Code

t.c<--3
p.c<-(1-pt(t.c, df=8))
p.c

[1] 0.9914642

Using these facts, the p-value is 0.9914. Now, at the 5% significance level, the null hypothesis cannot be rejected. That is, at a 95% confidence level, μ is less than 500.

Question 5

A

t is (ȳ-μ) / (sd/sqrt(n)). And sd/sqrt(n) is se. So we can compute t through this equation.

Code

# Jones's t and p-value
jones.t<-(519.5-500)/10
jones.p<-(1-pt(jones.t, df=999))*2
jones.t

[1] 1.95

Code

jones.p

[1] 0.05145555

Code

# Smith's t and p-value
smith.t<-(519.7-500)/10
smith.p<-(1-pt(smith.t, df=999))*2
smith.t

[1] 1.97

Code

smith.p

[1] 0.04911426

B

α is a level arbitrarily set according to the researcher’s judgment, and does not affect the t or p-value of each hypothesis.

The p value means the probability that t will have such a value when t is a normal distribution, that is, even if ȳ=μ=500, and se is 10, the probability of ȳ=519.5 can be 5.14%. Therefore, p greater than α means that even if ȳ=μ is true, the probability of accidentally ȳ=519.5 is greater than the type 1 error set by the researcher (false positive), which means that the null hypothesis cannot be rejected.

So when α=0.05, Jones’s null hypothesis cannot be rejected, and Smith’s null hypothesis can be rejected. In other words, only Smith’s research results are statistically significant.

C

As seen in B, the p-value itself means the probability of a specific t, and not revealing a specific p-value does not tell how statistically significant this study will be at the significance level (type 1 error), so information loss occurs.

Question 6

First, I will enter the data frame given in the problem in r.

Code

df1 <- data.frame(
  grade = rep("6th", 100),
  snack = c(rep("healthy", 31), rep("unhealthy", 69)))

df2 <- data.frame(
  grade = rep("7th", 100),
  snack = c(rep("healthy", 43), rep("unhealthy", 57)))

df3 <- data.frame(
  grade = rep("8th", 100),
  snack = c(rep("healthy", 51), rep("unhealthy", 49)))

df<-rbind(df1, df2, df3)
table(df)

     snack
grade healthy unhealthy
  6th      31        69
  7th      43        57
  8th      51        49

Null Hypothesis: Snack selection ratio is the same for each grade.(Healthy Snack Choice Ratio of 6th grade, 7th grade, and 8th grade is equal.)

Test: Here, the categorical data of healthy and unhealthy snacks are given, so Chi-square is used.

Code

chisq.test(table(df))


    Pearson's Chi-squared test

data:  table(df)
X-squared = 8.3383, df = 2, p-value = 0.01547

Test result chi-square value is 8.338. And p-value here(0.015) is less than given α(0.05). In other words, it can be concluded that the null hypothesis is rejected, and the selection ratio of snacks at the significance level of 0.05 varies depending on the grade.

Question 7

Here, the null hypothesis is that the average cost of the 3 area are the same. Therefore, the research hypothesis is that the average cost of at least one area is different.

Since the cost is a numeric variable, should ANOVA be selected to test that the average cost of 3 area are different.

First, I created a data frame using the given data.

Code

df.7<-data.frame(
  cost=c(6.2, 9.3, 6.8, 6.1, 6.7, 7.5, 7.5, 8.2, 8.5, 8.2, 7.0, 9.3, 5.8, 6.4, 5.6, 7.1, 3.0, 3.5),
  area=c(rep("a1", 6), rep("a2", 6), rep("a3", 6))
)

Then let’s do ANOVA using the aov function and look at the resulting statistics.

Code

summary(aov(cost~area, df.7))

            Df Sum Sq Mean Sq F value  Pr(>F)   
area         2  25.66  12.832   8.176 0.00397 **
Residuals   15  23.54   1.569                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F value is 8.176 and p-value is 0.00397. Therefore, if the significance level is 0.05, the null hypothesis is rejected. In other words, at a 95% confidence level, it can be concluded that the average cost of these three regions is not the same.

Finally, I will visually examine the average cost by region with a boxplot. From this, it can be seen that there is a clear difference.

Code

boxplot(cost~area, data=df.7)

Prepare

Question 1

More

Question 2

Question 3

Question 4

A

B

C

Question 5

A

B

C

Question 6

Question 7