DACSS 603 Homework 2

hw2
desriptive statistics
probability
Laura Collazo
Homework 2 for DASS 603.
Author

Laura Collazo

Published

March 23, 2023

Libraries

Code
library(tidyverse)
library(kableExtra)
library(readxl)
library(lsr)

Question 1

The data for the first question is shown below.

Code
surgical_procedure <- c("Bypass", "Angiography")
sample_size <- c(539, 847)
mean_wait_time <- c(19, 18)
standard_deviation <- c(10, 9)

data <- data.frame(surgical_procedure, sample_size, mean_wait_time, standard_deviation)

kable(data) %>% 
  kable_styling()
surgical_procedure sample_size mean_wait_time standard_deviation
Bypass 539 19 10
Angiography 847 18 9
Code
confidence_level <- .90

The 90% confidence interval for the actual mean wait time for patients receiving a bypass is:

Code
# calculate confidence interval for "Bypass"

bypass_se <- 10 / sqrt(539)

bypass_tail <- (1-confidence_level)/2

bypass_t <- qt(p = 1 - bypass_tail, df = 539 -1)

bypass_ci <- c(19 - bypass_t * bypass_se,
        19 + bypass_t * bypass_se)

bypass_ci
[1] 18.29029 19.70971

The 90% confidence interval for the actual mean wait time for patients receiving an angiography is:

Code
# calculate confidence interval for "Angiography"

angi_se <- 9 / sqrt(847)

angi_tail <- (1-confidence_level)/2

angi_t <- qt(p = 1 - angi_tail, df = 847 -1)

angi_ci <- c(18 - angi_t * angi_se,
        18 + angi_t * angi_se)

angi_ci
[1] 17.49078 18.50922

The confidence interval is narrower for the angiography.

Question 2

The data for question two is shown below.

Code
sample_size2 <- 1031
voted_yes <- 567
voted_no <- (sample_size2 - voted_yes)

data2 <- data.frame(sample_size2, voted_yes, voted_no)

kable(data2) %>% 
  kable_styling()
sample_size2 voted_yes voted_no
1031 567 464

The point estimate, p, for the proportion of all adult Americans who believe a college education is essential for success is:

Code
p <- voted_yes/sample_size2

p
[1] 0.5499515

A 95% confidence interval for p is:

Code
margin_of_error <- qnorm(0.975) * sqrt(p*(1-p)/sample_size2)

low <- p - margin_of_error

low
[1] 0.5195839
Code
high <- p + margin_of_error

high
[1] 0.5803191

What this tells us is at the 95% confidence interval between 51.95839% and 58.03191% of the population of adult Americans would believe a college education is essential for success.

Question 3

The size of sample the financial aid office should have to be able to estimate the mean cost of textbooks per semester for students is shown below. It is found using a confidence level of 95%, a range of $170, and knowing they want to be within $5 of the true population mean.

Code
z <- 1.96

range <- 200-30

sd <- range/4

n <- round((z * sd/ 5) ^ 2)

n
[1] 278

Question 4

A

The below shows the results of a one-sample t-test which tests whether the mean income of female employees ($410) differs from the mean income of all senior-level workers ($500).

One-sample t-tests assume that the population distribution is normal and also that the observations in the sample are generated independently of one another.

Code
set.seed(27)

rnorm_fixed <- function(n, mean, sd) {
  as.vector(mean + sd * scale(rnorm(n)))
}

data4 <- rnorm_fixed(9, 410, 90)

oneSampleTTest(x=data4, mu=500)

   One sample t-test 

Data variable:   data4 

Descriptive statistics: 
              data4
   mean     410.000
   std dev.  90.000

Hypotheses: 
   null:        population mean equals 500 
   alternative: population mean not equal to 500 

Test results: 
   t-statistic:  -3 
   degrees of freedom:  8 
   p-value:  0.017 

Other information: 
   two-sided 95% confidence interval:  [340.82, 479.18] 
   estimated effect size (Cohen's d):  1 

The results of this T-test is statistically significant as the p-value is below .05. This means we can reject the null hypothesis.

B

The p-value when the mean is less than 500 is below.

Code
t.test(data4, mu = 500, alternative = "less")

    One Sample t-test

data:  data4
t = -3, df = 8, p-value = 0.008536
alternative hypothesis: true mean is less than 500
95 percent confidence interval:
     -Inf 465.7864
sample estimates:
mean of x 
      410 

The results of this t-test are significant as the p-value is below .05. This means we can reject the null hypothesis that the true mean is not less than 500.

C

The p-value when the mean is greater than 500 is below.

Code
t.test(data4, mu = 500, alternative = "greater")

    One Sample t-test

data:  data4
t = -3, df = 8, p-value = 0.9915
alternative hypothesis: true mean is greater than 500
95 percent confidence interval:
 354.2136      Inf
sample estimates:
mean of x 
      410 

The results of this test are not significant as the p-value is greater than .05. This means we cannot reject the null hypothesis that the true mean is not greater than 500.

Question 5

A

Th first step for this question is to find the standard deviation. It is shown below.

Code
# find the standard deviation (se = sd/sqrt(n))

n <- 1000

se <- 10

sd <- se * sqrt(n)

sd
[1] 316.2278

The below shows the t and p-value for Jones.

Code
rnorm_fixed <- function(n, mean, sd) {
  as.vector(mean + sd * scale(rnorm(n)))
}

data5_j <- rnorm_fixed(1000, 519.5, sd)

t_test_j <- oneSampleTTest(x=data5_j, mu=500)

t_test_j

   One sample t-test 

Data variable:   data5_j 

Descriptive statistics: 
            data5_j
   mean     519.500
   std dev. 316.228

Hypotheses: 
   null:        population mean equals 500 
   alternative: population mean not equal to 500 

Test results: 
   t-statistic:  1.95 
   degrees of freedom:  999 
   p-value:  0.051 

Other information: 
   two-sided 95% confidence interval:  [499.877, 539.123] 
   estimated effect size (Cohen's d):  0.062 

The below shows the t and p-value for Smith.

Code
data5_s <- rnorm_fixed(1000, 519.7, sd)

t_test_s <- oneSampleTTest(x=data5_s, mu=500)

t_test_s

   One sample t-test 

Data variable:   data5_s 

Descriptive statistics: 
            data5_s
   mean     519.700
   std dev. 316.228

Hypotheses: 
   null:        population mean equals 500 
   alternative: population mean not equal to 500 

Test results: 
   t-statistic:  1.97 
   degrees of freedom:  999 
   p-value:  0.049 

Other information: 
   two-sided 95% confidence interval:  [500.077, 539.323] 
   estimated effect size (Cohen's d):  0.062 

B

The above indicates the results of Jones’ study (p = .051) are not statistically significant and the results of Smith’s study are statistically significant (p = .049) when α = 0.05.

C

I am not yet certain of how to describe the misleading aspects of reporting the result of a test as “P is < or = to 0.05” versus “P is > 0.” Perhaps this has to do with the importance of saying that the null hypothesis is rejected if p is statistically significant rather than saying we accept the null hypothesis as true if p is not statistically significant as we should always assume the null hypothesis is true. I think this question is asking for something different than this, though. I look forward to seeing the answer to this question so I can learn from it!

Question 6

The data for question 6 is shown below. A response of yes indicates a healthy snack was chosen after school.

Code
data6 <- read_xlsx("_data/homework_2_data.xlsx")

table(data6$healthy_snack, data6$grade)
     
      6th 7th 8th
  no   69  57  49
  yes  31  43  51

The null hypothesis for question 6 is that middle school students choosing a healthy snack is independent of grade level. The test that should be used to determine if we can reject the null hypothesis is a chi-square test. This test has been chosen as it’s used to determine when there’s an association between two categorical values. The results of the test are below.

Code
chisq.test(data6$healthy_snack, data6$grade, correct = FALSE)

    Pearson's Chi-squared test

data:  data6$healthy_snack and data6$grade
X-squared = 8.3383, df = 2, p-value = 0.01547
Code
table <- data.frame((with(data6, table(healthy_snack, grade))))

ggplot(table, aes(x=healthy_snack,y=Freq, fill=grade))+
  geom_bar(stat="identity",position="dodge")

We can reject the null hypothesis because the p-value is below 0.05. This tells us that choosing a healthy snack is not independent of grade level.

Furthermore, the X-squared value, as seen below, is higher than the critical value so the null hypothesis can be rejected.

Code
qchisq(.05, 2, lower.tail = FALSE)
[1] 5.991465

Question 7

The data for question 7 is shown below. Each tuition value is the per-pupil cost in thousands of dollars for cyber charter school tuition.

Code
data7 <- read_xlsx("_data/homework_2_data.xlsx", sheet = "Question 7")

kable(data7) %>% 
  kable_styling()
area tuition
Area 1 6.2
Area 1 9.3
Area 1 6.8
Area 1 6.1
Area 1 6.7
Area 1 7.5
Area 2 7.5
Area 2 8.2
Area 2 8.5
Area 2 8.2
Area 2 7.0
Area 2 9.3
Area 3 5.8
Area 3 6.4
Area 3 5.6
Area 3 7.1
Area 3 3.0
Area 3 3.5

The null hypothesis for question 7 is that the mean cyber charter school tuition cost per-pupil is the same in all areas. The test that should be used to determine if we can reject the null hypothesis is an ANOVA. This test has been chosen as it’s used to compare the means of two or more independent groups. The results of the test are shown below.

Code
ANOVA7 <- aov(tuition ~ area, data = data7)

summary(ANOVA7)
            Df Sum Sq Mean Sq F value  Pr(>F)   
area         2  25.66  12.832   8.176 0.00397 **
Residuals   15  23.54   1.569                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
ggplot(data7, aes(x = area, y = tuition, color=area))+
  geom_boxplot() +
  ylab("tuition") +
  xlab("area")

We can reject the null hypothesis because the p-value, Pr(>F), is less than .05. This tells us that at least one area has a tuition mean that is different from the others.