hw1
desriptive statistics
probability
Homework 1
Author

Thrishul

Published

February 5, 2023

Question 1

a

First, let’s read in the data from the Excel file:

Code
library(readxl)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
df <- read_excel("_data/LungCapData.xls")
head(df)
# A tibble: 6 × 6
  LungCap   Age Height Smoke Gender Caesarean
    <dbl> <dbl>  <dbl> <chr> <chr>  <chr>    
1    6.48     6   62.1 no    male   no       
2   10.1     18   74.7 yes   female no       
3    9.55    16   69.7 no    female yes      
4   11.1     14   71   no    male   no       
5    4.8      5   56.9 no    male   no       
6    6.22    11   58.7 no    female no       

The distribution of LungCap looks as follows:

Code
hist(df$LungCap)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

update and comit check

Code
# Subset the data frame by gender
male_df <- df[df$Gender == "male", ]
female_df <- df[df$Gender == "female", ]

# Create separate boxplots for males and females
boxplot(male_df$LungCap, female_df$LungCap, 
        names = c("Male", "Female"),
        xlab = "Gender", ylab = "Lung Capacity",
        main = "Lung Capacity by Gender")

Code
no_smoke <- df[df$Smoke == "no",]
yes_smoke <- df[df$Smoke == "yes",]
mean(no_smoke$LungCap)
[1] 7.770188
Code
mean(yes_smoke$LungCap)
[1] 8.645455
Code
age_group_1 <- df[df$Age <= 13, ]
age_group_2 <- df[df$Age >= 14 & df$Age <= 15, ]
age_group_3 <- df[df$Age >= 16 & df$Age <= 17, ]
age_group_4 <- df[df$Age >= 18, ]
par(mfrow=c(2,2)) # Set up a 2x2 grid of plots

boxplot(LungCap ~ Smoke, data = age_group_1, 
        names = c("Non-Smoker", "Smoker"),
        main = "Age Group <= 13")

boxplot(LungCap ~ Smoke, data = age_group_2, 
        names = c("Non-Smoker", "Smoker"),
        main = "Age Group 14-15")

boxplot(LungCap ~ Smoke, data = age_group_3, 
        names = c("Non-Smoker", "Smoker"),
        main = "Age Group 16-17")

boxplot(LungCap ~ Smoke, data = age_group_4, 
        names = c("Non-Smoker", "Smoker"),
        main = "Age Group >= 18")

Code
# Calculate the mean and standard deviation of Lung Capacity for each age group and smoking status
agg_data <- aggregate(LungCap ~ Age + Smoke, data = df, 
                      FUN = function(x) c(mean = mean(x), sd = sd(x)))

agg_data
   Age Smoke LungCap.mean LungCap.sd
1    3    no    2.9466923  1.7725478
2    4    no    2.9416667  1.1691521
3    5    no    3.4987500  1.4349371
4    6    no    4.4610000  1.4426914
5    7    no    4.6202703  1.7044248
6    8    no    5.2743902  1.5602933
7    9    no    6.6743750  1.4851993
8   10    no    6.5861702  1.3697906
9   11    no    7.4325000  1.5284734
10  12    no    7.7471311  1.5590134
11  13    no    8.2700820  1.6003504
12  14    no    8.7785000  1.4940444
13  15    no    9.4663636  1.5326404
14  16    no   10.0577778  1.4230731
15  17    no   11.0492187  1.5239153
16  18    no   10.8475000  1.4681731
17  19    no   11.2585714  1.6228260
18  10   yes    5.9812500  1.4073283
19  11   yes    7.5156250  2.1181586
20  12   yes    6.7464286  0.9939849
21  13   yes    7.8968750  1.1576174
22  14   yes    7.8291667  1.7020147
23  15   yes    8.7666667  1.1875000
24  16   yes    8.8972222  1.3884819
25  17   yes    9.7818182  1.1881756
26  18   yes   10.3903846  1.2926692
27  19   yes   11.3125000  0.6187184

#Question 2 # Let X = number of prior convictions for prisoners at a state prison at which there are 810 prisoners.

X 0 1 2 3 4

Frequency 128 434 160 64 24

a) What is the probability that a randomly selected inmate has exactly 2 prior convictions?

Code
# P(x=2)= 160/810
# P(x) = 0.1975
# p(x) = 19.75%

b) What is the probability that a randomly selected inmate has fewer than 2 prior convictions?

Code
# Frequency of X=0 + Frequency of X=1 = 128 + 434 = 562
# p(x) = 562/810
# P(x) = 0.6938 or 69.38%
# Probability = 69.38%

c) What is the probability that a randomly selected inmate has 2 or fewer prior convictions?

Code
# Frequency of X=0 + Frequency of X=1 + Frequency of X=2 = 128 + 434 + 160 = 722
# P(X ≤ 2) = frequency of X≤2 / total number of inmates
#  722/810 = 0.8914
# probability = 0.8914 or 89.14%

d) What is the probability that a randomly selected inmate has more than 2 prior convictions?

Code
# Frequency of X=3 + Frequency of X=4 = 64 + 24 = 88
# P(X > 2) = frequency of X>2 / total number of inmates

# 88/810 = 0.1086
# Probability = 0.1086 or 10.86%

e) What is the expected value1 for the number of prior convictions?

Code
# E(X) = Σ(xi * pi)
# (0 * 128/810) + (1 * 434/810) + (2 * 160/810) + (3 * 64/810) + (4 * 24/810)
# 1.283

# Expected value for the number of prior convictions is 1.283.

f) Calculate the variance and the standard deviation for the Prior Convictions

Code
# Var(X) = E(X^2) - [E(X)]^2
# E(X^2) = Σ(xi^2 * pi)
# P(X=x) 128/810 434/810 160/810 64/810 24/810
# E(X^2) = Σ(xi^2 * pi)
# = (0^2 * 128/810) + (1^2 * 434/810) + (2^2 * 160/810) + (3^2 * 64/810) + (4^2 * 24/810)
# = 2.105

# VARIANCE
# Var(X) = E(X^2) - [E(X)]^2
# = 2.105 - (1.283)^2
# variance for the number of prior convictions is = 0.921

# STANDARD DEVIATION
# SD(X) = sqrt(Var(X))
# = sqrt(0.921)
# = 0.960
#