Homework 1

hw1

desriptive statistics

probability

hw1

Author

Young Soo Choi

Published

February 27, 2023

prepare

Code

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

library(dplyr)
library(readxl)
df <- read_excel("C:/Users/rotte/Documents/R/603_Spring_2023/posts/_data/LungCapData.xls")

Question 1

a

Code

# descriptive statistics
summary(df$LungCap)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.507   6.150   8.000   7.863   9.800  14.675

Code

sd(df$LungCap)

[1] 2.662008

Code

# making histogram
hist(df$LungCap)

Range is 0.507~14.675. Median is 8.00. And it’s distribution is looks like normal distribution that mean is 7.863 and sd is 2.662.

b

Code

# descriptive statistics
df %>%
  group_by(Gender) %>%
  summarise(mean(LungCap), sd(LungCap))

# A tibble: 2 × 3
  Gender `mean(LungCap)` `sd(LungCap)`
  <chr>            <dbl>         <dbl>
1 female            7.41          2.56
2 male              8.31          2.68

Code

# making boxplot
boxplot(LungCap~Gender, df)

Mean and sd of female’s LungCap are 7.406 and 2.564, respectively. And Mean and sd of male’s LungCap are 8.309 and 2.683, respectively. Male’s LungCap is bigger than female’s. We can also check this through boxplot.

c

Code

# finding lungcap for smokers and non-smokers
df %>% group_by(Smoke) %>%
  summarise(mean(LungCap))

# A tibble: 2 × 2
  Smoke `mean(LungCap)`
  <chr>           <dbl>
1 no               7.77
2 yes              8.65

Smokers have bigger Lung cap. It’s a different result from common sense. Through the t-test, I will find out whether this result is statistically significant.

Code

t.test(LungCap~Smoke, df, alternative="less")


    Welch Two Sample t-test

data:  LungCap by Smoke
t = -3.6498, df = 117.72, p-value = 0.0001964
alternative hypothesis: true difference in means between group no and group yes is less than 0
95 percent confidence interval:
       -Inf -0.4776762
sample estimates:
 mean in group no mean in group yes 
         7.770188          8.645455

As a result of the one-sided t-test, it was found to be statistically significant at the 95% level of significance.

##d

First, a new variable cAge is created and a new value is given for each age. For those under the age of 13, “Child”, 14, 15 years of age “Middle”, 16, 17 years of age “High”, and 18 years of age or older, “Adult” will be assigned.

Code

df<-mutate(df, cAge = ifelse(Age<=13, "Child", ifelse(Age %in% 14:15, "Middle", ifelse(Age %in% 16:17, "High", "Adult"))))

Code

df %>% group_by(cAge) %>%
  summarise(mean(LungCap))

# A tibble: 4 × 2
  cAge   `mean(LungCap)`
  <chr>            <dbl>
1 Adult            11.0 
2 Child             6.41
3 High             10.2 
4 Middle            9.05

Code

ggplot(df, aes(x=Age, y=LungCap)) +
  geom_point()

Looking at each group’s lung caps, child is 6.41, middle is 9.05, high is 10.25, and adult is 10.96. That is, the lung caps grow with age. Here, it is possible to infer why the lung caps of smokers and non-smokers presented in this data are different from our common sense. The more adults there are, the more smokers there will be, and that may have led to a larger lung cap for smokers.

##e

First, let’s look at the children’s group.

Code

childdf<-filter(df, cAge=="Child")
table(childdf$Smoke)


 no yes 
401  27

Code

childdf%>%group_by(Smoke) %>%
  summarise(mean(LungCap))

# A tibble: 2 × 2
  Smoke `mean(LungCap)`
  <chr>           <dbl>
1 no               6.36
2 yes              7.20

Smokers have bigger lung cap. Let’s look at the picture in more detail.

Code

ggplot(childdf, aes(x=Age, y=LungCap)) +
  geom_point(aes(col=factor(Smoke)))

From the plot, the older the age, the larger the lung cap for non-smokers. In other words, when looking at the entire child group, the growth of natural lung caps with growth is not well revealed, so smokers’ lung caps seem to be larger.

Next, let’s look at the middle group.

Code

middf<-filter(df, cAge=="Middle")
table(middf$Smoke)


 no yes 
105  15

Code

middf%>%group_by(Smoke) %>%
  summarise(mean(LungCap))

# A tibble: 2 × 2
  Smoke `mean(LungCap)`
  <chr>           <dbl>
1 no               9.14
2 yes              8.39

Code

boxplot(LungCap~Smoke, middf)

Non-smokers of middle group seem to have bigger lung caps.

Then, let’s look at the high group.

Code

highdf<-filter(df, cAge=="High")
table(highdf$Smoke)


 no yes 
 77  20

Code

highdf%>%group_by(Smoke) %>%
  summarise(mean(LungCap))

# A tibble: 2 × 2
  Smoke `mean(LungCap)`
  <chr>           <dbl>
1 no              10.5 
2 yes              9.38

Code

boxplot(LungCap~Smoke, highdf)

Non-smokers of high group also have bigger lung caps.

Lastly, let me check the adult group.

Code

adultdf<-filter(df, cAge=="Adult")
table(adultdf$Smoke)


 no yes 
 65  15

Code

adultdf%>%group_by(Smoke) %>%
  summarise(mean(LungCap))

# A tibble: 2 × 2
  Smoke `mean(LungCap)`
  <chr>           <dbl>
1 no               11.1
2 yes              10.5

Code

boxplot(LungCap~Smoke, adultdf)

Even in the adult group, non-smokers have a bigger lung cap.

Finally, I took a look at the overall plot.

Code

ggplot(df, aes(x=Age, y=LungCap)) +
  geom_point(aes(col=factor(Smoke)))

Overall, it seems that the lung cap of non-smokers (red) is higher than that of smokers (blue).

Question 2

Code

x<-c(0, 1, 2, 3, 4)
freq<-c(128, 434, 160, 64, 24)
p_df<-data.frame(x, freq)
p_df

a

The probability of selecting inmate has exact 2 priority convictions is the number of inmates with 2 priority convictions divided by the total number of inmates.

Code

160/sum(p_df$freq)

[1] 0.1975309

The answer is 0.1975.

b

In the same way, the frequency of 0 priority convictions and 1 priority convictions can be combined and divided into the total number of prisoners.

Code

(128+434)/sum(p_df$freq)

[1] 0.6938272

The answer is 0.6938.

c

Add the probabilities obtained from problems a and b to get the answer.

Code

(160/sum(p_df$freq))+((128+434)/sum(p_df$freq))

[1] 0.891358

The answer is 0.8914.

d

This time, we can get the answer using the probability obtained from c. The total sum of the probabilities is 1, so you can get the answer by subtracting the value obtained from 1 and c.

Code

1-((160/sum(p_df$freq))+((128+434)/sum(p_df$freq)))

[1] 0.108642

The answer is 0.1086.

e

To obtain the expected value, we divide the sum of priority convictions(x) times frequency (freq) by the total number of prisoners.

Code

sum(p_df$x*p_df$freq)/810

[1] 1.28642

In another way, the probability of priority conversions can be obtained and the expected value can be obtained by summing each frequency multiplied by this value.

Code

p_df<-mutate(p_df, pro=freq/810)
p_df

  x freq        pro
1 0  128 0.15802469
2 1  434 0.53580247
3 2  160 0.19753086
4 3   64 0.07901235
5 4   24 0.02962963

Code

sum(p_df$x*p_df$pro)

[1] 1.28642

The answer is the same.

f

Code

mean<-sum(p_df$x*p_df$pro)

First, to obtain the variance, get the sum of the squared difference between the x-value and the average value (expected value) and divided by the total number of prisoners.

The standard deviation is the square root of the variance.

Code

sum((x-mean)^2*p_df$freq)/810

[1] 0.8562353

Code

sqrt(sum((x-mean)^2*p_df$freq)/810)

[1] 0.9253298

Variance is 0.8562 and standard deviation is 0.925.

Alternatively, the variance can be obtained by multiplying the square of the difference between the x-value and the average value by each probability.

Code

sum((x-mean)^2*p_df$pro)

[1] 0.8562353

Code

sqrt(sum((x-mean)^2*p_df$pro))

[1] 0.9253298

As expected, the answer is the same.