a) What does the distribution of LungCap look like? (Hint: Plot a histogram with probability density on the y axis)
Code
hist(df$LungCap)
The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).
b) Compare the probability distribution of the LungCap with respect to Males and Females? (Hint: make boxplots separated by gender using the boxplot() function)
Code
boxplot( LungCap ~ Gender , data =df)
c) Compare the mean lung capacities for smokers and non-smokers. Does it make sense?
It is surprising that smoker’s lung capacity mean is larger than for nonsmokers. T o understand the reason, I would need to break up the data into subgroups.
d) Examine the relationship between Smoking and Lung Capacity within age groups: “less than or equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”.
# A tibble: 2 × 2
Smoke mean.Lunc.Cap
<chr> <dbl>
1 no 7.77
2 yes 8.65
Code
ggplot (df, mapping=aes(y=LungCap, x = Smoke ))+geom_boxplot()+facet_wrap (~ age.group)
d) Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c. What could possibly be going on here?
In the table of counts within each age group we can see that the group ’under 13” is over 50 of the records. Within this group, there is no difference between smokers and non-smokers (probably because of the length of smoking and the fact that only 7% are smokers), which contibutes to overall sample mean.
b) What is the probability that a randomly selected inmate has fewer than 2 prior convictions?
To calculate “fewer than2”, we will use cumulative probability for Poisson distribution with default “lower.tail =T”, for value “1” to exclude value “2”.
Standard deviation is a square root of variance: \[
\sigma = \sqrt variance
\]
Code
sd<-sqrt(variance)knitr::kable(sd)
x
0.9253298
Source Code
---title: "Homework 1"author: "Diana Rinker"description: "Homework 1"date: "02/17/2023"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - hw1 - desriptive statistics - probability---# Question 1First, let's read in the data from the Excel file:```{r, echo=T}library(readxl)library(dplyr)library(tidyr)library(tidyverse)df <-read_excel("_data/LungCapData.xls")```#### a) What does the distribution of LungCap look like? (Hint: Plot a histogram with probability density on the y axis)```{r, echo=T}hist(df$LungCap)```The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).#### b) Compare the probability distribution of the LungCap with respect to Males and Females? (Hint: make boxplots separated by gender using the boxplot() function)```{r, echo=T}boxplot( LungCap ~ Gender , data =df)```#### c) Compare the mean lung capacities for smokers and non-smokers. Does it make sense?```{r, echo=T}df.grouped<- df %>%group_by(Smoke)%>%summarize (mean.Lunc.Cap =mean (LungCap))knitr::kable(df.grouped)```It is surprising that smoker's lung capacity mean is larger than for nonsmokers. T o understand the reason, I would need to break up the data into subgroups.#### d) Examine the relationship between Smoking and Lung Capacity within age groups: "less than or equal to 13", "14 to 15", "16 to 17", and "greater than or equal to 18".```{r, echo=T}df$age.group <-NAdf$age.group<-ifelse(df$Age <=13, "under13" , df$age.group )df$age.group<-ifelse(df$Age >=14& df$Age <=15, "14-15" , df$age.group )df$age.group<-ifelse(df$Age >=16& df$Age <=17, "16-17" , df$age.group )df$age.group<-ifelse(df$Age >=18, "18+" , df$age.group )df$age.group <-factor (df$age.group, levels =c("under13", "14-15", "16-17", "18+") )df.groupedggplot (df, mapping=aes(y=LungCap, x = Smoke ))+geom_boxplot()+facet_wrap (~ age.group)```#### d) Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c. What could possibly be going on here?In the table of counts within each age group we can see that the group 'under 13" is over 50 of the records. Within this group, there is no difference between smokers and non-smokers (probably because of the length of smoking and the fact that only 7% are smokers), which contibutes to overall sample mean.```{r, echo=T}df.grouped<-df%>%group_by(age.group, Smoke)%>%summarize (count=n())```To compare smokers and non-smokers accurately, we could exclude the group "under 13".```{r, echo=T}df.filtered<- df %>%filter(age.group !="under13")ggplot (df.filtered, mapping=aes(y=LungCap, x = Smoke ))+geom_boxplot()```Now we can see the difference between mean, where smokers have smaller Lung capacity.# Question 2Let X = number of prior convictions for prisoners at a state prison at which there are 810 prisoners.```{r, echo=T}X <-c(0, 1, 2, 3, 4)Frequency <-c(128,434, 160, 64, 24)df<-tibble (X, Frequency) df$Probabilty <- Frequency/sum(Frequency)knitr::kable(df)```#### a) What is the probability that a randomly selected inmate has exactly 2 prior convictions?Number of prior convictions of inmates has Poisson distribution. Probability of X = is 0.197```{r, echo=T}chances.of.2<- df$Probabilty[3]knitr::kable(chances.of.2)```#### b) What is the probability that a randomly selected inmate has **fewer than 2** prior convictions?To calculate "fewer than2", we will use cumulative probability for Poisson distribution with default "lower.tail =T", for value "1" to exclude value "2".```{r, echo=T}prob.under.2<-sum(df$Probabilty[1:2])knitr::kable(prob.under.2 )```#### c) What is the probability that a randomly selected inmate has **2 or fewer** prior convictions?To calculate "fewer than2", we will use cumulative probability for Poisson distribution with default "lower.tail =T":It will include value of "2".```{r, echo=T}prob.under.and.2<-sum(df$Probabilty[1:3])knitr::kable(prob.under.and.2)```#### d) What is the probability that a randomly selected inmate has **over 2** prior convictions?To calculate "over than 2", we will use cumulative probability for Poisson distribution with "lower.tail =F":```{r, echo=T}lambda<-mean (df$X)prob.over.2<-sum(df$Probabilty[4:5] )knitr::kable(prob.over.2 )```#### e) What is the expected value for the number of prior convictions$$E(X) = \sum_{all x} x \cdot p(x) = \mu $$```{r, echo=T}df$Probabilty <- Frequency/sum(Frequency)(expected.value.of.X <-sum(df$X * df$Probabilty ))knitr::kable(expected.value.of.X )```#### f) Calculate the variance and the standard deviation for the Prior Convictions.Variance of a random variable: $$\sigma^2 = E[(X-\mu)^2] = \sum_{all x}(x-\mu)^2 \cdot p(x)$$```{r, echo=T}variance.X <-sum (((df$X - expected.value.of.X )^2) * df$Probabilty )knitr::kable(variance.X )```Alternatively: $$\sigma^2 = E(X^2)-[E(X)]^2 = E(X^2)-\mu^2$$```{r, echo=T}variance <- (sum((df$X^2) * df$Probabilty )) - ((expected.value.of.X)^2)knitr::kable(variance)```Standard deviation is a square root of variance: $$\sigma = \sqrt variance$$```{r, echo=T}sd<-sqrt(variance)knitr::kable(sd)```