Homework 1

hw1

desriptive statistics

probability

Homework 1

Author

Diana Rinker

Published

February 17, 2023

Question 1

First, let’s read in the data from the Excel file:

Code

library(readxl)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Code

library(tidyr)
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2
──

✔ ggplot2 3.4.1     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

df <- read_excel("_data/LungCapData.xls")

a) What does the distribution of LungCap look like? (Hint: Plot a histogram with probability density on the y axis)

Code

hist(df$LungCap)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

b) Compare the probability distribution of the LungCap with respect to Males and Females? (Hint: make boxplots separated by gender using the boxplot() function)

Code

boxplot( LungCap ~ Gender , data =df)

c) Compare the mean lung capacities for smokers and non-smokers. Does it make sense?

Code

df.grouped<- df %>%
  group_by(Smoke)%>%
  summarize (mean.Lunc.Cap = mean (LungCap))
knitr::kable(df.grouped)

Smoke	mean.Lunc.Cap
no	7.770188
yes	8.645454

It is surprising that smoker’s lung capacity mean is larger than for nonsmokers. T o understand the reason, I would need to break up the data into subgroups.

d) Examine the relationship between Smoking and Lung Capacity within age groups: “less than or equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”.

Code

df$age.group <- NA  
df$age.group<- ifelse(df$Age <=13, "under13" , df$age.group )
df$age.group<- ifelse(df$Age >=14 & df$Age <=15, "14-15" , df$age.group )
df$age.group<- ifelse(df$Age >=16 & df$Age <=17, "16-17" , df$age.group )
df$age.group<- ifelse(df$Age >=18, "18+" , df$age.group )
df$age.group <-factor (df$age.group, levels = c("under13",  "14-15", "16-17", "18+") )

df.grouped

# A tibble: 2 × 2
  Smoke mean.Lunc.Cap
  <chr>         <dbl>
1 no             7.77
2 yes            8.65

Code

ggplot (df, mapping=aes(y=LungCap, x = Smoke ))+
  geom_boxplot()+
  facet_wrap (~ age.group)

d) Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c. What could possibly be going on here?

In the table of counts within each age group we can see that the group ’under 13” is over 50 of the records. Within this group, there is no difference between smokers and non-smokers (probably because of the length of smoking and the fact that only 7% are smokers), which contibutes to overall sample mean.

Code

df.grouped<-df%>%
  group_by(age.group, Smoke)%>%
summarize (count=n())

`summarise()` has grouped output by 'age.group'. You can override using the
`.groups` argument.

To compare smokers and non-smokers accurately, we could exclude the group “under 13”.

Code

df.filtered<- df %>%
  filter(age.group != "under13")

ggplot (df.filtered, mapping=aes(y=LungCap, x = Smoke ))+
  geom_boxplot()

Now we can see the difference between mean, where smokers have smaller Lung capacity.

Question 2

Let X = number of prior convictions for prisoners at a state prison at which there are 810 prisoners.

Code

X <- c(0, 1, 2, 3, 4)
Frequency <- c(128,434, 160, 64, 24)

df<-tibble (X, Frequency) 
df$Probabilty <- Frequency/sum(Frequency)
knitr::kable(df)

X	Frequency	Probabilty
0	128	0.1580247
1	434	0.5358025
2	160	0.1975309
3	64	0.0790123
4	24	0.0296296

a) What is the probability that a randomly selected inmate has exactly 2 prior convictions?

Number of prior convictions of inmates has Poisson distribution. Probability of X = is 0.197

Code

chances.of.2<- df$Probabilty[3]
knitr::kable(chances.of.2)

x
0.1975309

b) What is the probability that a randomly selected inmate has fewer than 2 prior convictions?

To calculate “fewer than2”, we will use cumulative probability for Poisson distribution with default “lower.tail =T”, for value “1” to exclude value “2”.

Code

prob.under.2 <- sum(df$Probabilty[1:2])
knitr::kable(prob.under.2   )

x
0.6938272

c) What is the probability that a randomly selected inmate has 2 or fewer prior convictions?

To calculate “fewer than2”, we will use cumulative probability for Poisson distribution with default “lower.tail =T”: It will include value of “2”.

Code

prob.under.and.2 <- sum(df$Probabilty[1:3])
knitr::kable(prob.under.and.2)

x
0.891358

d) What is the probability that a randomly selected inmate has over 2 prior convictions?

To calculate “over than 2”, we will use cumulative probability for Poisson distribution with “lower.tail =F”:

Code

lambda<- mean (df$X)
prob.over.2 <- sum(df$Probabilty[4:5] )
knitr::kable(prob.over.2  )

x
0.108642

e) What is the expected value for the number of prior convictions

\[ E(X) = \sum_{all x} x \cdot p(x) = \mu \]

Code

df$Probabilty <- Frequency/sum(Frequency)
(expected.value.of.X <- sum(df$X * df$Probabilty ))

[1] 1.28642

Code

knitr::kable(expected.value.of.X )

x
1.28642

f) Calculate the variance and the standard deviation for the Prior Convictions.

Variance of a random variable: \[ \sigma^2 = E[(X-\mu)^2] = \sum_{all x}(x-\mu)^2 \cdot p(x) \]

Code

variance.X <- sum (((df$X - expected.value.of.X )^2) * df$Probabilty  )
knitr::kable(variance.X  )

x
0.8562353

Alternatively: \[ \sigma^2 = E(X^2)-[E(X)]^2 = E(X^2)-\mu^2 \]

Code

variance <- (sum((df$X^2) * df$Probabilty )) - ((expected.value.of.X)^2)
knitr::kable(variance)

x
0.8562353

Standard deviation is a square root of variance: \[ \sigma = \sqrt variance \]

Code

sd<- sqrt(variance)
knitr::kable(sd)

x
0.9253298