hw1
desriptive statistics
probability
Homework 1
Author

Diana Rinker

Published

February 17, 2023

Question 1

First, let’s read in the data from the Excel file:

Code
library(readxl)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
library(tidyr)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2
──
✔ ggplot2 3.4.1     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
df <- read_excel("_data/LungCapData.xls")

a) What does the distribution of LungCap look like? (Hint: Plot a histogram with probability density on the y axis)

Code
hist(df$LungCap)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

b) Compare the probability distribution of the LungCap with respect to Males and Females? (Hint: make boxplots separated by gender using the boxplot() function)

Code
boxplot( LungCap ~ Gender , data =df)

c) Compare the mean lung capacities for smokers and non-smokers. Does it make sense?

Code
df.grouped<- df %>%
  group_by(Smoke)%>%
  summarize (mean.Lunc.Cap = mean (LungCap))
knitr::kable(df.grouped)
Smoke mean.Lunc.Cap
no 7.770188
yes 8.645454

It is surprising that smoker’s lung capacity mean is larger than for nonsmokers. T o understand the reason, I would need to break up the data into subgroups.

d) Examine the relationship between Smoking and Lung Capacity within age groups: “less than or equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”.

Code
df$age.group <- NA  
df$age.group<- ifelse(df$Age <=13, "under13" , df$age.group )
df$age.group<- ifelse(df$Age >=14 & df$Age <=15, "14-15" , df$age.group )
df$age.group<- ifelse(df$Age >=16 & df$Age <=17, "16-17" , df$age.group )
df$age.group<- ifelse(df$Age >=18, "18+" , df$age.group )
df$age.group <-factor (df$age.group, levels = c("under13",  "14-15", "16-17", "18+") )

df.grouped
# A tibble: 2 × 2
  Smoke mean.Lunc.Cap
  <chr>         <dbl>
1 no             7.77
2 yes            8.65
Code
ggplot (df, mapping=aes(y=LungCap, x = Smoke ))+
  geom_boxplot()+
  facet_wrap (~ age.group)

d) Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c. What could possibly be going on here?

In the table of counts within each age group we can see that the group ’under 13” is over 50 of the records. Within this group, there is no difference between smokers and non-smokers (probably because of the length of smoking and the fact that only 7% are smokers), which contibutes to overall sample mean.

Code
df.grouped<-df%>%
  group_by(age.group, Smoke)%>%
summarize (count=n())
`summarise()` has grouped output by 'age.group'. You can override using the
`.groups` argument.

To compare smokers and non-smokers accurately, we could exclude the group “under 13”.

Code
df.filtered<- df %>%
  filter(age.group != "under13")

ggplot (df.filtered, mapping=aes(y=LungCap, x = Smoke ))+
  geom_boxplot()

Now we can see the difference between mean, where smokers have smaller Lung capacity.

Question 2

Let X = number of prior convictions for prisoners at a state prison at which there are 810 prisoners.

Code
X <- c(0, 1, 2, 3, 4)
Frequency <- c(128,434, 160, 64, 24)

df<-tibble (X, Frequency) 
df$Probabilty <- Frequency/sum(Frequency)
knitr::kable(df)
X Frequency Probabilty
0 128 0.1580247
1 434 0.5358025
2 160 0.1975309
3 64 0.0790123
4 24 0.0296296

a) What is the probability that a randomly selected inmate has exactly 2 prior convictions?

Number of prior convictions of inmates has Poisson distribution. Probability of X = is 0.197

Code
chances.of.2<- df$Probabilty[3]
knitr::kable(chances.of.2)
x
0.1975309

b) What is the probability that a randomly selected inmate has fewer than 2 prior convictions?

To calculate “fewer than2”, we will use cumulative probability for Poisson distribution with default “lower.tail =T”, for value “1” to exclude value “2”.

Code
prob.under.2 <- sum(df$Probabilty[1:2])
knitr::kable(prob.under.2   )
x
0.6938272

c) What is the probability that a randomly selected inmate has 2 or fewer prior convictions?

To calculate “fewer than2”, we will use cumulative probability for Poisson distribution with default “lower.tail =T”: It will include value of “2”.

Code
prob.under.and.2 <- sum(df$Probabilty[1:3])
knitr::kable(prob.under.and.2)
x
0.891358

d) What is the probability that a randomly selected inmate has over 2 prior convictions?

To calculate “over than 2”, we will use cumulative probability for Poisson distribution with “lower.tail =F”:

Code
lambda<- mean (df$X)
prob.over.2 <- sum(df$Probabilty[4:5] )
knitr::kable(prob.over.2  )
x
0.108642

e) What is the expected value for the number of prior convictions

\[ E(X) = \sum_{all x} x \cdot p(x) = \mu \]

Code
df$Probabilty <- Frequency/sum(Frequency)
(expected.value.of.X <- sum(df$X * df$Probabilty ))
[1] 1.28642
Code
knitr::kable(expected.value.of.X )
x
1.28642

f) Calculate the variance and the standard deviation for the Prior Convictions.

Variance of a random variable: \[ \sigma^2 = E[(X-\mu)^2] = \sum_{all x}(x-\mu)^2 \cdot p(x) \]

Code
variance.X <- sum (((df$X - expected.value.of.X )^2) * df$Probabilty  )
knitr::kable(variance.X  )
x
0.8562353

Alternatively: \[ \sigma^2 = E(X^2)-[E(X)]^2 = E(X^2)-\mu^2 \]

Code
variance <- (sum((df$X^2) * df$Probabilty )) - ((expected.value.of.X)^2)
knitr::kable(variance)
x
0.8562353

Standard deviation is a square root of variance: \[ \sigma = \sqrt variance \]

Code
sd<- sqrt(variance)
knitr::kable(sd)
x
0.9253298