hw2
desriptive statistics
probability
Homework 2
Author

Caitlin Rowley

Published

March 28, 2023

Code
# load libraries

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
library(magrittr)
library(ggplot2)
library(markdown)
library(ggtext)
Warning: package 'ggtext' was built under R version 4.2.2
Code
library(readxl)
Warning: package 'readxl' was built under R version 4.2.2

Question 1

The time between the date a patient was recommended for heart surgery and the surgery date
for cardiac patients in Ontario was collected by the Cardiac Care Network (“Wait Times Data
Guide,” Ministry of Health and Long-Term Care, Ontario, Canada, 2006). The sample mean
and sample standard deviation for wait times (in days) of patients for two cardiac procedures
are given in the accompanying table. Assume that the sample is representative of the Ontario
population. Construct the 90% confidence interval to estimate the actual mean wait time for each of the two procedures. Is the confidence interval narrower for angiography or bypass surgery?

Code
# bypass: sample size = 539, mean wait time = 19, SD = 10

b_size=539
b_mean=19
b_sd=10
b_ci=0.9

# calculate standard error:
bypass_se = b_sd/sqrt(b_size)

# specify confidence interval:
bypass_tail <- (1-b_ci)/2
print(bypass_tail)
[1] 0.05
Code
# calculate t-value:
b_t_value <-  qt(p=1-bypass_tail, df=b_size-1)
print(b_t_value)
[1] 1.647691
Code
# calculate confidence intervals:
b_conf_int <- c(b_mean-b_t_value*bypass_se, b_mean+b_t_value*bypass_se)
print(b_conf_int)
[1] 18.29029 19.70971

The 90% confidence interval for wait-time for bypass surgery is 18.3–19.7 days.

Code
# angiography: sample size = 847, mean wait time = 18, SD = 9

a_size=847
a_mean=18
a_sd=9
a_ci=0.9

# calculate standard error:
angio_se = a_sd/sqrt(a_size)

# specify confidence interval:
angio_tail <- (1-a_ci)/2
print(angio_tail)
[1] 0.05
Code
# calculate t-value:
a_t_value <-  qt(p=1-angio_tail, df=a_size-1)
print(a_t_value)
[1] 1.646657
Code
# calculate confidence intervals:
a_conf_int <- c(a_mean-a_t_value*angio_se, a_mean+a_t_value*angio_se)
print(a_conf_int)
[1] 17.49078 18.50922

The 90% confidence interval for wait-time for angiography surgery is 17.5–18.5 days.

Code
# calculate size of confidence interval for bypass:

bypass_ci_size=19.70971-18.29029
print(bypass_ci_size)
[1] 1.41942
Code
# calculate size of confidence interval for angiography:

angio_ci_size=18.50922-17.49078
print(angio_ci_size)
[1] 1.01844

The confidence level is narrower for angiography surgery (1.01844) than it is for bypass surgery (1.41942).

Question 2

A survey of 1031 adult Americans was carried out by the National Center for Public
Policy. Assume that the sample is representative of adult Americans. Among those surveyed, 567
believed that college education is essential for success. Find the point estimate, p, of the
proportion of all adult Americans who believe that a college education is essential for success.
Construct and interpret a 95% confidence interval for p.

Code
# binomial test - compares a sample proportion to a hypothesized proportion

total_pop = 1031
survey_pop = 567

binom.test(survey_pop, total_pop)

    Exact binomial test

data:  survey_pop and total_pop
number of successes = 567, number of trials = 1031, p-value = 0.001478
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.5189927 0.5806243
sample estimates:
probability of success 
             0.5499515 

The proportion of American adults who believe that a college education is essential for success, or p, is 0.55%, which falls between the 95% confidence interval of 52%-58%.

Question 3

Suppose that the financial aid office of UMass Amherst seeks to estimate the mean cost
of textbooks per semester for students. The estimate will be useful if it is within $5 of the true
population mean (i.e. they want the confidence interval to have a length of $10 or less). The
financial aid office is pretty sure that the amount spent on books varies widely, with most values
between $30 and $200. They think that the population standard deviation is about a quarter of
this range (in other words, you can assume they know the population standard deviation). Assuming the significance level to be 5%, what should be the size of the sample?

Code
# find sample size using [error=z*SD/sqrt of n]

# range in cost of books:
range=200-30

# population SD is 1/4 of range:
population_sd = range/4
print(population_sd)
[1] 42.5
Code
# If 95% of the area lies between −z and z, then 5% of the area must lie outside of this range. Since normal curves are symmetric, half of this amount (2.5%) must lie before −z. Then the area under the curve before z must be: 0.025+0.95=0.975. The number z is the 97.5th percentile of the standard normal distribution:
z=qnorm(.975)

# estimate is within $5 of true population mean:
n=((z*population_sd)/5)^2
print(n)
[1] 277.5454

The sample size should be 278.

Question 4

According to a union agreement, the mean income for all senior-level workers in a large service company equals $500 per week. A representative of a women’s group decides to analyze whether the mean income ‘u’ for female employees matches this norm. For a random sample of nine female employees, y=$410 and s=90.
A. Test whether the mean income of female employees differs from $500 per week. Include assumptions, hypotheses, test statistics, and P-value. Interpret the result.
B. Report the P-value for Ha: u<500. Interpret.
C. Report and interpret the P-value for Ha: u>500. (Hint: the P-values for the two possible one-sided tests must sum to 1).

A)

Code
# mean income = $500/week
# mean income female: y=$410, s=90

# A)
# test statistics: describes how far your observed data is from the null hypothesis of no relationship between variables or no difference among sample groups (support or reject null)
# formula: x_bar is sample mean, mu is hypothesized population mean, sd is sample standard deviation, and n is sample size.
# formula: (sample mean-population mean)/(standard deviation/sqrt(sample size))

test_stat <- function(x_bar, mu, sd, n){return((x_bar-mu)/(sd/sqrt(n)))}

sample_mean=410
pop_mean=500
sd=90
sample_size=9

# find t-value:
t_stat <- test_stat(sample_mean, pop_mean, sd, sample_size)
print(t_stat)
[1] -3
Code
# t-value is negative, so find lower tail
# degree of freedom = n-1
# find p-value:
low_p_value <- pt(q=t_stat, df=8, lower.tail=TRUE)
print(low_p_value)
[1] 0.008535841
Code
# find p-value for two-tailed t-test:
low_p_value <- 2*pt(t_stat, 8)
print(low_p_value)
[1] 0.01707168

My assumption is that the mean income of female employees will differ from the mean of all senior-level workers ($500/week). After running test statistics, we see that the t-stat is -3, meaning that females’ average pay is 3 standard deviations from the population mean. We also see that the p-value is 0.02, which means that we can reject the null hypothesis, which assumes that there is no significant difference in pay across genders (i.e., females make $500/week on average).

B)

Code
# B
# calculate p-value for LT alternative hypothesis (Ha: u<500):

p_Ha = pt(q=t_stat, df=8, lower.tail=TRUE)
p_Ha
[1] 0.008535841

The p-value for the lower-tail alternative hypothesis is 0.009, which supports our previous assertion that we can reject the null hypothesis and accept the alternative. This indicates that u≠500. In other words, females do not make $500/week.

C)

Code
# C
# calculate p-value for UT alternative hypothesis 2 (Ha: u>500):

p_Ha = pt(q=t_stat, df=8, lower.tail=FALSE)
p_Ha
[1] 0.9914642

The p-value for the upper-tail alternative hypothesis is 0.991, which indicates that ‘u’ is not greater than 500. In other words, we now know that, on average, females make less than $500/week.

Question 5

Jones and Smith separately conduct studies to test H0: μ = 500 against Ha: μ ≠ 500, each with n = 1000. Jones gets ȳ = 519.5, with se = 10.0. Smith gets ȳ = 519.7, with se = 10.0.
A. Show that t = 1.95 and P-value = 0.051 for Jones. Show that t = 1.97 and P-value = 0.049 for Smith.
B. Using α = 0.05, for each study indicate whether the result is “statistically significant.”
C. Using this example, explain the misleading aspects of reporting the result of a test as “P ≤0.05” versus “P > 0.05,” or as “reject H0” versus “Do not reject H0,” without reporting the actual P-value.

A)

Code
# H0: u=500
# Ha: u≠500
# n=1000
# Jones: y=519.5, se=10.0
# Smith : y=519.7, se=10.0

# Jones: t=1.95

t_jones = ((519.5 - 500)/ 10)
cat("t value for Jones:", t_jones, '\n')
t value for Jones: 1.95 
Code
# Jones: p-value=0.0515

cat('p value for Jones:', round(2*pt(t_jones, df = 999, lower.tail=FALSE), 4), '\n')
p value for Jones: 0.0515 
Code
# Smith: t=1.97

t_smith = ((519.7 - 500)/ 10)
cat("t value for Smith:", t_smith, '\n')
t value for Smith: 1.97 
Code
# Smith: p-value=0.049

cat('p value for Smith:', round(2*pt(t_smith, df = 999, lower.tail=FALSE), 4), '\n')
p value for Smith: 0.0491 

B)

Assuming that 0.05 indicates statistical significance, we can see that while Smith’s study shows statistical significance with a p-value of 0.0491 (<0.05), Jones’s does not (p-value=0.0515, p-value>0.05).

C)

Based on this example, we can see why it is misleading to report a p-value as significant solely using “p ≤0.05,” “reject H0,” or “cannot reject H0” as indicators. After calculating each study’s respective p-value, we can see that the two values are very close, and, if we were to round each to the nearest hundredth, both would be equal to 0.05. So, it is important to provide the p-values themselves so as to make the distinction between degrees of significance. In other words, although Jones’s study showed statistical significance and Smith’s did not, the differences in the two p-values was marginal, so it is important to indicate by how slim a margin both were greater/less than the significance level (0.05) so as to not overestimate statistical significance.

Question 6

A school nurse wants to determine whether age is a factor in whether children choose a
healthy snack after school. She conducts a survey of 300 middle school students, with the results
below. Test at α = 0.05 the claim that the proportion who choose a healthy snack differs by grade
level. What is the null hypothesis? Which test should we use? What is the conclusion?

Code
# 6th: healthy = 31, unhealthy = 69
# 7th: healthy = 43, unhealthy = 57
# 8th: healthy = 51, unhealthy = 49
# n=300, each grade has 100 survey participants

# create dataframe:
grade <- c(rep("6th", 100), rep("7th", 100), rep("8th", 100))
snack <- c(rep("healthy", 31), rep("unhealthy", 69), rep("healthy", 43),
           rep("unhealthy", 57), rep("healthy", 51), rep("unhealthy", 49))

survey_data <- data.frame(grade, snack)
head(survey_data)
  grade   snack
1   6th healthy
2   6th healthy
3   6th healthy
4   6th healthy
5   6th healthy
6   6th healthy
Code
# transform dataframe into table:
table(survey_data$snack,survey_data$grade)
           
            6th 7th 8th
  healthy    31  43  51
  unhealthy  69  57  49
Code
# conduct chi-squared test:
chisq.test(survey_data$snack,survey_data$grade,correct = FALSE)

    Pearson's Chi-squared test

data:  survey_data$snack and survey_data$grade
X-squared = 8.3383, df = 2, p-value = 0.01547

The null hypothesis is that age is not a factor in whether children choose a healthy snack after school. To test this hypothesis (two categorical variables), we should use Pearson’s Chi-squared test. After running this test, we can see that the p-value is 0.015, which indicates statistical significance. Thus, we can reject the null hypothesis and conclude that age is a factor in whether children choose a healthy snack after school.

Question 7

Per-pupil costs (in thousands of dollars) for cyber charter school tuition for school
districts in three areas are shown. Test the claim that there is a difference in means for the three
areas, using an appropriate test. What is the null hypothesis? Which test should we use? What is
the conclusion?

Code
# area 1: 6.2, 9.3, 6.8, 6.1, 6.7, 7.5
# area 2: 7.5, 8.2, 8.5, 8.2, 7.0, 9.3
# area 3: 5.8, 6.4, 5.6, 7.1, 3.0, 3.5

# create dataframe:
area <- c(rep("area_1", 6), rep("area_2", 6), rep("area_3", 6))
cost <- c(6.2, 9.3, 6.8, 6.1, 6.7, 7.5, 7.5, 8.2, 8.5, 8.2, 7.0, 9.3,
          5.8, 6.4, 5.6, 7.1, 3.0, 3.5)
area_cost <- data.frame(area,cost)
head(area_cost)
    area cost
1 area_1  6.2
2 area_1  9.3
3 area_1  6.8
4 area_1  6.1
5 area_1  6.7
6 area_1  7.5
Code
# one-way ANOVA test:
one.way <- aov(cost ~ area, data = area_cost)
summary(one.way)
            Df Sum Sq Mean Sq F value  Pr(>F)   
area         2  25.66  12.832   8.176 0.00397 **
Residuals   15  23.54   1.569                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis is that area has an effect on cost per pupil. Because we are testing the means of more than one group with one independent variable, we should use the one-way ANOVA test. After running the ANOVA test, we can see that the p-value (0.00397) is statistically significant. Thus, we can reject the null hypothesis and conclude that area does have an effect on cost per pupil.