hw1
desriptive statistics
probability
Homework 1
Author

Guanhua Tan

Published

February 5, 2023

Question 1

a)

First, let’s read in the data from the Excel file:

Code
library(readxl)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(ggplot2)
df <- read_excel("_data/LungCapData.xls")
head(df)
# A tibble: 6 × 6
  LungCap   Age Height Smoke Gender Caesarean
    <dbl> <dbl>  <dbl> <chr> <chr>  <chr>    
1    6.48     6   62.1 no    male   no       
2   10.1     18   74.7 yes   female no       
3    9.55    16   69.7 no    female yes      
4   11.1     14   71   no    male   no       
5    4.8      5   56.9 no    male   no       
6    6.22    11   58.7 no    female no       

The distribution of LungCap looks as follows:

Code
hist(df$LungCap)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

b) Compare the probability distribution of the LungCap with respect to Males and Females?

Code
df %>%
  ggplot(aes(x=Gender,y=LungCap))%+%
  stat_boxplot(geom = "errorbar", # Error bars
               width = 0.2)+
  geom_boxplot()

The box graphic suggests that the median of male lung capacities are slightly larger than the one of female ones.

c) Compare the mean lung capacities for smokers and non-smokers. Does it make sense?

Code
df_c <- df %>%
  group_by(Smoke) %>%
  mutate(mean_lungcap=mean(LungCap))%>%
  distinct(mean_lungcap)
df_c 
# A tibble: 2 × 2
# Groups:   Smoke [2]
  Smoke mean_lungcap
  <chr>        <dbl>
1 no            7.77
2 yes           8.65

The data indicates smokers’ lung capacities are larger than no-smokers’ ones. It runs counter to the intuition.

d) Examine the relationship between Smoking and Lung Capacity within age groups: “less than or equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”.

Code
# less than or equal to 13
df_d_13<-df %>%
  filter(Smoke == "yes" & Age <= 13) %>%
  mutate(mean_lungcap=mean(LungCap)) %>%
  distinct(mean_lungcap)


df_d_14_15 <-df %>%
   filter(Smoke == "yes" & Age <= 15 | Age >= 14) %>%
  mutate(mean_lungcap=mean(LungCap)) %>%
  distinct(mean_lungcap)


df_d_16_17 <-df %>%
   filter(Smoke == "yes" & Age <= 17 | Age >= 16) %>%
  mutate(mean_lungcap=mean(LungCap)) %>%
  distinct(mean_lungcap)


df_d_18 <-df %>%
   filter(Smoke == "yes" & Age >= 18) %>%
  mutate(mean_lungcap=mean(LungCap)) %>%
  distinct(mean_lungcap)
result <-c(df_d_13, df_d_14_15, df_d_16_17, df_d_18)
print(result)
$mean_lungcap
[1] 7.201852

$mean_lungcap
[1] 9.725077

$mean_lungcap
[1] 10.00616

$mean_lungcap
[1] 10.51333

The data indicates that with the increase of the age, the lung capacities grows larger.

e) Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c. What could possibly be going on here?

Code
df_e_13<-df %>%
  filter(Age <= 13) %>%
  group_by(Smoke) %>%
  mutate(mean_lungcap=mean(LungCap)) %>%
  distinct(mean_lungcap)

df_e_14<-df %>%
  filter(Age == 15 | Age == 14) %>%
  group_by(Smoke) %>%
  mutate(mean_lungcap=mean(LungCap)) %>%
  distinct(mean_lungcap)


df_e_16<-df %>%
  filter(Age == 17 | Age == 16) %>%
  group_by(Smoke) %>%
  mutate(mean_lungcap=mean(LungCap)) %>%
  distinct(mean_lungcap)


df_e_18<-df %>%
  filter( Age >= 18) %>%
  group_by(Smoke) %>%
  mutate(mean_lungcap=mean(LungCap)) %>%
  distinct(mean_lungcap)

df_e_13
# A tibble: 2 × 2
# Groups:   Smoke [2]
  Smoke mean_lungcap
  <chr>        <dbl>
1 no            6.36
2 yes           7.20
Code
df_e_14
# A tibble: 2 × 2
# Groups:   Smoke [2]
  Smoke mean_lungcap
  <chr>        <dbl>
1 no            9.14
2 yes           8.39
Code
df_e_16
# A tibble: 2 × 2
# Groups:   Smoke [2]
  Smoke mean_lungcap
  <chr>        <dbl>
1 no           10.5 
2 yes           9.38
Code
df_e_18
# A tibble: 2 × 2
# Groups:   Smoke [2]
  Smoke mean_lungcap
  <chr>        <dbl>
1 yes           10.5
2 no            11.1

The shows a big difference from the part C. Only in age group under 13, smokers have larger lung capacities than non-smokers. In other age groups, unlike what the part C suggests, non-smokers have large lung capacities that smokers.

Question2

  1. Let X = number of prior convictions for prisoners at a state prison at which there are 810 prisoners.

a) What is the probability that a randomly selected inmate has exactly 2 prior convictions?

Code
c <- 160/810
c
[1] 0.1975309

The probability is 19.8%.

b) What is the probability that a randomly selected inmate has fewer than 2 prior convictions?

Code
c<-(128+434)/810
c
[1] 0.6938272

The probability is 69.4%.

c) What is the probability that a randomly selected inmate has 2 or fewer prior convictions?

Code
c <- (434+160+128)/810
c
[1] 0.891358

The probability is 89.1%.

d) What is the probability that a randomly selected inmate has more than 2 prior convictions?

Code
c<-(64+24)/810
c
[1] 0.108642

The probability is 10.9%.

e) What is the expected value1 for the number of prior convictions?

Code
vals<-c(0,1,2,3,4)
probs<-c(128/810, 434/810, 160/810, 64/801, 24/810)
exv<-weighted.mean(vals, probs)
exv
[1] 1.28794

The expected value is 1.29.

  1. Calculate the variance and the standard deviation for the Prior Convictions.
Code
var <- sum((vals-exv)^2*probs)
var
[1] 0.8588399
Code
sd <- sqrt(var)
sd
[1] 0.9267361

The variance is 0.8588. The standard deviation is 0.9267.