HW2
Author

Diana Rinker

Published

March 25, 2023

DACSS 603, spring 2023

Homework 2, Diana Rinker.

Loading necessary libraries:

Code
library(readxl)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
library(ggplot2)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2
──
✔ tibble  3.1.8     ✔ purrr   1.0.1
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Question 1

The time between the date a patient was recommended for heart surgery and the surgery date for cardiac patients in Ontario was collected by the Cardiac Care Network (“Wait Times Data Guide,” Ministry of Health and Long-Term Care, Ontario, Canada, 2006). The sample mean and sample standard deviation for wait times (in days) of patients for two cardiac procedures are given in the accompanying table. Assume that the sample is representative of the Ontario population.

Code
`Surgical Procedure` <- c("Bypass", "Angiography" )
`Sample Size` <- c(539,847 )
`Mean Wait Time` <- c(19, 18)
`Standard Deviation` <- c(10, 9)
data<- data.frame(`Surgical Procedure`,`Sample Size`, `Mean Wait Time`, `Standard Deviation` )
knitr::kable(data )
Surgical.Procedure Sample.Size Mean.Wait.Time Standard.Deviation
Bypass 539 19 10
Angiography 847 18 9

Construct the 90% confidence interval to estimate the actual mean wait time for each of the two procedures. Is the confidence interval narrower for angiography or bypass surgery?

The formula for the confidence interval of a sample is

\[ CI =\bar{X} \pm t \cdot \frac{s}{\sqrt n} \]

Code
#calculating degree of freedom for each sample: 
data <- data%>%
  mutate(confidence_level = 0.9,
         tail_area = (1-confidence_level)/2,
         t_score = qt(p = 1-tail_area, df = `Sample.Size` -1),
         CI.low  = `Mean Wait Time` - t_score * `Standard.Deviation` / sqrt(`Sample.Size`), 
         CI.high= `Mean Wait Time` + t_score * `Standard.Deviation` / sqrt(`Sample.Size`),
         CI.width = abs(CI.high - CI.low)
         )
data<- data %>%
select (Surgical.Procedure, CI.low,CI.high , CI.width, t_score, everything() )
knitr::kable(data ) 
Surgical.Procedure CI.low CI.high CI.width t_score Sample.Size Mean.Wait.Time Standard.Deviation confidence_level tail_area
Bypass 18.29029 19.70971 1.419421 1.647691 539 19 10 0.9 0.05
Angiography 17.49078 18.50922 1.018436 1.646657 847 18 9 0.9 0.05

From the table above I can see that the width of confidence interval for Angiography is smaller than for Bypass.

Alternatvie way to calculate the interval is by simulating the data based on given parameters and running one-sided t.test() :

Code
Angiography.data <- rnorm(n=847, mean =18 , sd = 9)
# Calculate the sample mean
mean.q1 <- mean(Angiography.data )
# Adjust the sample to have a mean of exactly 410
Angiography.data <- Angiography.data + 18 - mean.q1

t.test(Angiography.data, alternative = "two.sided", conf.level = 0.9) 

    One Sample t-test

data:  Angiography.data
t = 59.128, df = 846, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
 17.49872 18.50128
sample estimates:
mean of x 
       18 
Code
Bypass.data <- rnorm(n=539, mean =19 , sd = 10)
# Calculate the sample mean
mean.bypass.q1 <- mean(Bypass.data )
# Adjust the sample to have a mean of exactly 410
Bypass.data  <- Bypass.data  + 19 - mean.bypass.q1

t.test(Bypass.data , alternative = "two.sided", conf.level = 0.9) 

    One Sample t-test

data:  Bypass.data
t = 45.732, df = 538, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
 18.31544 19.68456
sample estimates:
mean of x 
       19 

Question 2

A survey of 1031 adult Americans was carried out by the National Center for Public Policy. Assume that the sample is representative of adult Americans. Among those surveyed, 567 believed that college education is essential for success. Find the point estimate, p, of the proportion of all adult Americans who believe that a college education is essential for success. Construct and interpret a 95% confidence interval for p.

I will assume, that the survey responses are within binomial distribution ( i.e. respondents had an option to agree or disagree that college education is essential for success).

Point estimate for this sample is 567/1031

Code
point.estimate <- 567/1031

Confidence interval = point estimate ± margin of error. To calculate a confidence interval for the p, I will usr the following formula:

\[ CI =\bar{p} \pm Zscore \cdot SE \]

Code
Sample.proportion <- 567/1031
Sample.Size <- 1031
data2<- data.frame(Sample.proportion,Sample.Size  )
data2<- data2%>%
  mutate(SE =(sqrt(Sample.proportion*(1-Sample.proportion)/Sample.Size)), # for binomial distribution
        z_score = 1.96, # for 95% cofidence interval  
       CI.low  = Sample.proportion - z_score * SE,
         CI.high= Sample.proportion + z_score * SE
        )
knitr::kable(data2 ) 
Sample.proportion Sample.Size SE z_score CI.low CI.high
0.5499515 1031 0.015494 1.96 0.5195833 0.5803197

This confidence interval can be interpreted as “The people who agreethat college education is essential for success is between 52% and 58%from general American population”.

I could also calculate confidence interval by generating a sample data set and using prob.test() function:

Code
#generating data table: 
  # Generating sample for grade 6:
  successes <- 0
  p <- 567/1031
  n<-1031
  # repeat until the desired number of successes is obtained
  while (successes != 567) {
          responses <- rbinom(n=1031, 1, p=p)
  # count the number of successes
          successes <- sum(responses)
  }
respondent.n <-seq(1:1031)
dataset.q2 <-data.frame( respondent.n , responses)
  
prop.test(x = sum(dataset.q2$responses), n=n,  p =p )

    1-sample proportions test without continuity correction

data:  sum(dataset.q2$responses) out of n, null probability p
X-squared = 6.9637e-30, df = 1, p-value = 1
alternative hypothesis: true p is not equal to 0.5499515
95 percent confidence interval:
 0.5194543 0.5800778
sample estimates:
        p 
0.5499515 

This way of calculating CI provided me with the same interval as the first one.

Question 3

Suppose that the financial aid office of UMass Amherst seeks to estimate the mean cost of textbooks per semester for students. The estimate will be useful if it is within $5 of the true population mean (i.e. they want the confidence interval to have a length of $10 or less). The financial aid office is pretty sure that the amount spent on books varies widely, with most values between $30 and $200. They think that the population standard deviation is about a quarter of this range (in other words, you can assume they know the population standard deviation). Assuming the significance level to be 5%, what should be the size of the sample?

Since I am given standard deviation of population, I can use the following formula for CI for confidence level 95%:

\[ CI =\bar{X} \pm 1.96 \cdot \frac{\sigma}{\sqrt n} \] The range of CI will be

\[ CI range = 10=(\bar{X} + 1.96 \cdot \frac{\sigma}{\sqrt n}) -(\bar{X} - 1.96 \cdot \frac{\sigma}{\sqrt n}) \]

\[ 2\cdot (1.96 \cdot \frac{\sigma}{\sqrt n}) = 10 \]

\[ {\sqrt n} = \frac{1.96\sigma}{5} \]

\[ n = (\frac{1.96 \cdot 42.5}{5})^2 \]

Code
SD.population <- (200-30)/4
n<- ((1.96*SD.population)/5)^2
knitr::kable(n) 
x
277.5556

To calculate confidence interval width of $10, financial aid office should collect a sample size of 278 students.

Question 4

According to a union agreement, the mean income for all senior-level workers in a large service company equals $500 per week. A representative of a women’s group decides to analyze whether the mean income μ for female employees matches this norm. For a random sample of nine female employees, ȳ = $410 and s = 90 A. Test whether the mean income of female employees differs from $500 per week. Include assumptions, hypotheses, test statistic, and P-value. Interpret the result. B. Report the P-value for Ha: μ < 500. Interpret. C. Report and interpret the P-value for Ha: μ > 500. (Hint: The P-values for the two possible one-sided tests must sum to 1.

A. Two-sided t.test:

Test whether the mean income of female employees differs from $500 per week. Include assumptions, hypotheses, test statistic, and P-value. Interpret the result.

Assumption: The sample is representative of all senior-level female workers in that company.

Ho: mean of female income is close to the company policy income of $500. H1: mean of female income is not the same as the company policy income of $500.

To test this hypothesis, I will used one-sample t.test, which will compare mean and cinficance interval of a sample to a policy level of $500.

T.test() function in R uses t-statistics to evaluate small sample size (9).

Code
# Generate a sample with a mean close to 410
female.sample <- rnorm(n=9, mean=410, sd=90)
# Calculate the sample mean
sample.mean <- mean(female.sample)
# Adjust the sample to have a mean of exactly 410
female.sample <- female.sample + 410 - sample.mean
a.ttest.sample<-t.test(female.sample, alternative = "two.sided", mu=500)
a.ttest.sample

    One Sample t-test

data:  female.sample
t = -2.2966, df = 8, p-value = 0.05074
alternative hypothesis: true mean is not equal to 500
95 percent confidence interval:
 319.6305 500.3695
sample estimates:
mean of x 
      410 

The output shows 95% confidence interval (325 - 494) with the p-value under 0.05. This interval does not include policy value of $500, meaning that the true female income mean at the company is very unlikely to be 500. Therefore I reject Ho, and Accept H1.

B. One-sided, “less”

Report the P-value for Ha: μ < 500. Interpret.

Ho: mean of female income is no less than the company policy income of $500. H1: mean of female income is lower than the company policy income of $500.

Code
b.ttest.sample<-t.test(female.sample, alternative = "less", mu=500)
b.ttest.sample

    One Sample t-test

data:  female.sample
t = -2.2966, df = 8, p-value = 0.02537
alternative hypothesis: true mean is less than 500
95 percent confidence interval:
     -Inf 482.8735
sample estimates:
mean of x 
      410 

The confidence interval for this hypothesis is even further from 500, that for two-sided test: (-inf, 455) which does not include 500. I accept alternative hypothesis that true female income is lower than 500. P-value for one-sided test is twice lower than for two-sided , because we are only estimating one direction from the mean.

C. One-sided, “greater”.

Report and interpret the P-value for Ha: μ > 500.

Ho: mean of female income is not greater than the company policy income of $500. H1: mean of female income is greater than the company policy income of $500.

Code
c.ttest.sample<-t.test(female.sample, alternative = "greater", mu=500)
c.ttest.sample

    One Sample t-test

data:  female.sample
t = -2.2966, df = 8, p-value = 0.9746
alternative hypothesis: true mean is greater than 500
95 percent confidence interval:
 337.1265      Inf
sample estimates:
mean of x 
      410 

The confidence interval for this hypothesis includes 500, which suppports H0 that the female income is not greater than company policy. Additionally, p-value is greater than 0.05 which also supports Ho. income is less or greater than 500) equals to 1.


Question 5

Jones and Smith separately conduct studies to test H0: μ = 500 against Ha: μ ≠ 500, each with n = 1000. Jones gets ȳ = 519.5, with se = 10.0. Smith gets ȳ = 519.7, with se = 10.0.

A. Show that t = 1.95 and P-value = 0.051 for Jones. Show that t = 1.97 and P-value = 0.049 for Smith.

B. Using α = 0.05, for each study indicate whether the result is “statistically significant.”

C. Using this example, explain the misleading aspects of reporting the result of a test as “P ≤ 0.05” versus “P > 0.05,” or as “reject H0” versus “Do not reject H0,” without reporting the actual P-value.

*A. Calculating t-statistics for each sample:

Show that t = 1.95 and P-value = 0.051 for Jones. Show that t = 1.97 and P-value = 0.049 for Smith.

To do that, we will create a summary table containing both samples/ parameters and calculate t-statistic, critical value and p-value

Code
# Generate a sample with a mean 519.5 for John
# John.data1<- rnorm(n=1000,  mean =519.5)
# John.mean <- mean(John.data1)
# John.data1<- John.data1 + 519.5 - John.mean
# John.mean <- mean(John.data1)
# John.mean

n<-c(1000, 1000) 
study.name <- c("John", "Smith")
Means <- c(519.5, 519.7) 
SE <- c(10.0, 10.0) 
summary <- data.frame(study.name, n, Means, SE)

summary <- summary%>%
  mutate(tail_area = (1-0.95)/2,
         t_score = round( qt(p = 1-tail_area, df = n -1), 2),# Critical t-value 
         CI.low  = Means - t_score * SE, 
         CI.high = Means + t_score * SE,
         test.statistics = ((Means - 500)/ SE),
         p.value = (1 - pt(test.statistics, df = n -1)) * 2
         )

 knitr::kable(summary) 
study.name n Means SE tail_area t_score CI.low CI.high test.statistics p.value
John 1000 519.5 10 0.025 1.96 499.9 539.1 1.95 0.0514555
Smith 1000 519.7 10 0.025 1.96 500.1 539.3 1.97 0.0491143

I can see from the table above, that the Confidence Interval (CI) for John’s data inludes 500 value, while Somth’s CI doesnt. Also I see that the two datasets result in slightly different test statistics and therefore p-value, which are very close to the tested value. Due to John’s p-value being above 0.05, I will accept HO. However, Smith p-value is below 0.05 and I would decline HO and accept H1.

B. Interpreting statistical significance.

Using α = 0.05, for each study indicate whether the result is “statistically significant.”

We can conclude that the difference between 500 and the mean of Smith’s sample is statistically significant to reject Ho, while John’s sample mean’s difference from 500 is not statistically significant (i.e. not far enough from 500).

C. Potential misleading of comparing p-value to 0.05:

Using this example, explain the misleading aspects of reporting the result of a test as “P ≤ 0.05” versus “P > 0.05,” or as “reject H0” versus “Do not reject H0,” without reporting the actual P-value.

Comparing this two almost identical samples allowed us to demonstrate how a tiny difference in sample mean can produce difference in study outcomes (rejection vs acceptance of Ho). It is important to remember, that the sample means vary due to randomness of sampling process, and therefore produce variety of confidence intervals, test statistics and p-values.Reporting actual p-value will demonstrate its closeness to 0.05 cutoff level, and warn the reader about potential error in conclusion.


Question 6

A school nurse wants to determine whether age is a factor in whether children choose a healthy snack after school. She conducts a survey of 300 middle school students, with the results below. Test at α = 0.05 the claim that the proportion who choose a healthy snack differs by grade level. What is the null hypothesis? Which test should we use? What is the conclusion?

Code
`grade level` <- c("6","7","8")
`Healthy snack` <- c(31,43,51)
`Unhealthy snack` <- c(69,57,49)
table <- data.frame (`grade level`, `Healthy snack`, `Unhealthy snack` )
# table <- addmargins( table , margin =1:2,  FUN = sum)
knitr::kable(table)  
grade.level Healthy.snack Unhealthy.snack
6 31 69
7 43 57
8 51 49

This table represents data for two categorical variables: grade level and snack choice

I will generate a sample data as given above:

Code
  # Generating sample for grade 6:
  successes <- 0
  # repeat until the desired number of successes is obtained
  while (successes != 31) {
          grade6 <- rbinom(n=100, 1, 0.31)
  # count the number of successes
          successes <- sum(grade6)
  }
  
  # Generating sample for grade 7:
  successes <- 0
  while (successes != 43) {
            grade7 <- rbinom(n=100, 1, 0.43)
            successes <- sum(grade7)
            }
  
  # Generating sample for grade 8:
  successes <- 0
  while (successes != 51) {
          grade8 <- rbinom(n=100, 1, 0.51)
          successes <- sum(grade8)
          }
  # Combining all grades in one table for analysis:
  table2<- data.frame(grade6, grade7, grade8)
  table2<- table2 %>%
        pivot_longer(c(grade6, grade7, grade8 ), names_to= "grade", values_to="healthy.choice")

  xtabs <-xtabs(~table2$grade + table2$healthy.choice)
  knitr::kable(  xtabs )  
0 1
grade6 69 31
grade7 57 43
grade8 49 51

The school nurse collected this data to test the following hypothesis:

Ho: all grades’ samples are coming from the same general population, and probability of making healthy choice is equal for all grades.

H1: probability of making healthy choice is not equal for all grades.

I will use a chi-square test to test this hypothesis:

Code
  table2$healthy.choice <- as.character (table2$healthy.choice)
  table2$grade <- as.character(table2$grade)
ch.sq<- chisq.test(table2$grade, table2$healthy.choice ,correct = FALSE)
ch.sq

    Pearson's Chi-squared test

data:  table2$grade and table2$healthy.choice
X-squared = 8.3383, df = 2, p-value = 0.01547

The p-value of 0.01 makes me to reject H0 hypothesis that all grades have equal probability, and accpet H1, that the kids of different grades differ in probability of making haealthy snack choice.


Question 7

Per-pupil costs (in thousands of dollars) for cyber charter school tuition for school districts in three areas are shown. Test the claim that there is a difference in means for the three areas, using an appropriate test. What is the null hypothesis? Which test should we use? What is the conclusion?

Code
Area.1 <- c( 6.2, 9.3, 6.8, 6.1, 6.7, 7.5)
Area.2 <- c( 7.5, 8.2, 8.5, 8.2, 7.0, 9.3)
Area.3 <- c( 5.8, 6.4 ,5.6, 7.1, 3.0, 3.5)

table.q7 <- data.frame (Area.1, Area.2, Area.3)
knitr::kable(table.q7 )   
Area.1 Area.2 Area.3
6.2 7.5 5.8
9.3 8.2 6.4
6.8 8.5 5.6
6.1 8.2 7.1
6.7 7.0 3.0
7.5 9.3 3.5

Ho: there is no difference in means between the three areas. H1: At least one of these areas’ means is significantly different.

To test this hypopthesis, I will use ANOVA test to compare means of three samples:

Code
 table.q7  <- table.q7  %>%
        pivot_longer(c(Area.1, Area.2, Area.3 ), names_to= "Area", values_to="score")

table.q7$Area <- as.factor (table.q7$Area)
anova.q7 <- aov( score ~ Area, table.q7 )
summary(anova.q7)
            Df Sum Sq Mean Sq F value  Pr(>F)   
Area         2  25.66  12.832   8.176 0.00397 **
Residuals   15  23.54   1.569                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of ANOVA test demonstrate p-value of 0.00397. I make a conclusion that at least one of the three areas’ means is significantly different from others (i.e reject Ho and accept H1).