hw1
desriptive statistics
probability
hw1
Author

Young Soo Choi

Published

February 27, 2023

prepare

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(dplyr)
library(readxl)
df <- read_excel("C:/Users/rotte/Documents/R/603_Spring_2023/posts/_data/LungCapData.xls")

Question 1

a

Code
# descriptive statistics
summary(df$LungCap)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.507   6.150   8.000   7.863   9.800  14.675 
Code
sd(df$LungCap)
[1] 2.662008
Code
# making histogram
hist(df$LungCap)

Range is 0.507~14.675. Median is 8.00. And it’s distribution is looks like normal distribution that mean is 7.863 and sd is 2.662.

b

Code
# descriptive statistics
df %>%
  group_by(Gender) %>%
  summarise(mean(LungCap), sd(LungCap))
# A tibble: 2 × 3
  Gender `mean(LungCap)` `sd(LungCap)`
  <chr>            <dbl>         <dbl>
1 female            7.41          2.56
2 male              8.31          2.68
Code
# making boxplot
boxplot(LungCap~Gender, df)

Mean and sd of female’s LungCap are 7.406 and 2.564, respectively. And Mean and sd of male’s LungCap are 8.309 and 2.683, respectively. Male’s LungCap is bigger than female’s. We can also check this through boxplot.

c

Code
# finding lungcap for smokers and non-smokers
df %>% group_by(Smoke) %>%
  summarise(mean(LungCap))
# A tibble: 2 × 2
  Smoke `mean(LungCap)`
  <chr>           <dbl>
1 no               7.77
2 yes              8.65

Smokers have bigger Lung cap. It’s a different result from common sense. Through the t-test, I will find out whether this result is statistically significant.

Code
t.test(LungCap~Smoke, df, alternative="less")

    Welch Two Sample t-test

data:  LungCap by Smoke
t = -3.6498, df = 117.72, p-value = 0.0001964
alternative hypothesis: true difference in means between group no and group yes is less than 0
95 percent confidence interval:
       -Inf -0.4776762
sample estimates:
 mean in group no mean in group yes 
         7.770188          8.645455 

As a result of the one-sided t-test, it was found to be statistically significant at the 95% level of significance.

##d

First, a new variable cAge is created and a new value is given for each age. For those under the age of 13, “Child”, 14, 15 years of age “Middle”, 16, 17 years of age “High”, and 18 years of age or older, “Adult” will be assigned.

Code
df<-mutate(df, cAge = ifelse(Age<=13, "Child", ifelse(Age %in% 14:15, "Middle", ifelse(Age %in% 16:17, "High", "Adult"))))
Code
df %>% group_by(cAge) %>%
  summarise(mean(LungCap))
# A tibble: 4 × 2
  cAge   `mean(LungCap)`
  <chr>            <dbl>
1 Adult            11.0 
2 Child             6.41
3 High             10.2 
4 Middle            9.05
Code
ggplot(df, aes(x=Age, y=LungCap)) +
  geom_point()

Looking at each group’s lung caps, child is 6.41, middle is 9.05, high is 10.25, and adult is 10.96. That is, the lung caps grow with age. Here, it is possible to infer why the lung caps of smokers and non-smokers presented in this data are different from our common sense. The more adults there are, the more smokers there will be, and that may have led to a larger lung cap for smokers.

##e

First, let’s look at the children’s group.

Code
childdf<-filter(df, cAge=="Child")
table(childdf$Smoke)

 no yes 
401  27 
Code
childdf%>%group_by(Smoke) %>%
  summarise(mean(LungCap))
# A tibble: 2 × 2
  Smoke `mean(LungCap)`
  <chr>           <dbl>
1 no               6.36
2 yes              7.20

Smokers have bigger lung cap. Let’s look at the picture in more detail.

Code
ggplot(childdf, aes(x=Age, y=LungCap)) +
  geom_point(aes(col=factor(Smoke)))

From the plot, the older the age, the larger the lung cap for non-smokers. In other words, when looking at the entire child group, the growth of natural lung caps with growth is not well revealed, so smokers’ lung caps seem to be larger.

Next, let’s look at the middle group.

Code
middf<-filter(df, cAge=="Middle")
table(middf$Smoke)

 no yes 
105  15 
Code
middf%>%group_by(Smoke) %>%
  summarise(mean(LungCap))
# A tibble: 2 × 2
  Smoke `mean(LungCap)`
  <chr>           <dbl>
1 no               9.14
2 yes              8.39
Code
boxplot(LungCap~Smoke, middf)

Non-smokers of middle group seem to have bigger lung caps.

Then, let’s look at the high group.

Code
highdf<-filter(df, cAge=="High")
table(highdf$Smoke)

 no yes 
 77  20 
Code
highdf%>%group_by(Smoke) %>%
  summarise(mean(LungCap))
# A tibble: 2 × 2
  Smoke `mean(LungCap)`
  <chr>           <dbl>
1 no              10.5 
2 yes              9.38
Code
boxplot(LungCap~Smoke, highdf)

Non-smokers of high group also have bigger lung caps.

Lastly, let me check the adult group.

Code
adultdf<-filter(df, cAge=="Adult")
table(adultdf$Smoke)

 no yes 
 65  15 
Code
adultdf%>%group_by(Smoke) %>%
  summarise(mean(LungCap))
# A tibble: 2 × 2
  Smoke `mean(LungCap)`
  <chr>           <dbl>
1 no               11.1
2 yes              10.5
Code
boxplot(LungCap~Smoke, adultdf)

Even in the adult group, non-smokers have a bigger lung cap.

Finally, I took a look at the overall plot.

Code
ggplot(df, aes(x=Age, y=LungCap)) +
  geom_point(aes(col=factor(Smoke)))

Overall, it seems that the lung cap of non-smokers (red) is higher than that of smokers (blue).

Question 2

Code
x<-c(0, 1, 2, 3, 4)
freq<-c(128, 434, 160, 64, 24)
p_df<-data.frame(x, freq)
p_df
  x freq
1 0  128
2 1  434
3 2  160
4 3   64
5 4   24

a

The probability of selecting inmate has exact 2 priority convictions is the number of inmates with 2 priority convictions divided by the total number of inmates.

Code
160/sum(p_df$freq)
[1] 0.1975309

The answer is 0.1975.

b

In the same way, the frequency of 0 priority convictions and 1 priority convictions can be combined and divided into the total number of prisoners.

Code
(128+434)/sum(p_df$freq)
[1] 0.6938272

The answer is 0.6938.

c

Add the probabilities obtained from problems a and b to get the answer.

Code
(160/sum(p_df$freq))+((128+434)/sum(p_df$freq))
[1] 0.891358

The answer is 0.8914.

d

This time, we can get the answer using the probability obtained from c. The total sum of the probabilities is 1, so you can get the answer by subtracting the value obtained from 1 and c.

Code
1-((160/sum(p_df$freq))+((128+434)/sum(p_df$freq)))
[1] 0.108642

The answer is 0.1086.

e

To obtain the expected value, we divide the sum of priority convictions(x) times frequency (freq) by the total number of prisoners.

Code
sum(p_df$x*p_df$freq)/810
[1] 1.28642

In another way, the probability of priority conversions can be obtained and the expected value can be obtained by summing each frequency multiplied by this value.

Code
p_df<-mutate(p_df, pro=freq/810)
p_df
  x freq        pro
1 0  128 0.15802469
2 1  434 0.53580247
3 2  160 0.19753086
4 3   64 0.07901235
5 4   24 0.02962963
Code
sum(p_df$x*p_df$pro)
[1] 1.28642

The answer is the same.

f

Code
mean<-sum(p_df$x*p_df$pro)

First, to obtain the variance, get the sum of the squared difference between the x-value and the average value (expected value) and divided by the total number of prisoners.

The standard deviation is the square root of the variance.

Code
sum((x-mean)^2*p_df$freq)/810
[1] 0.8562353
Code
sqrt(sum((x-mean)^2*p_df$freq)/810)
[1] 0.9253298

Variance is 0.8562 and standard deviation is 0.925.

Alternatively, the variance can be obtained by multiplying the square of the difference between the x-value and the average value by each probability.

Code
sum((x-mean)^2*p_df$pro)
[1] 0.8562353
Code
sqrt(sum((x-mean)^2*p_df$pro))
[1] 0.9253298

As expected, the answer is the same.