Code
library(tidyverse)
library(dplyr)
library(readxl)
library(ggplot2)
::opts_chunk$set(echo = TRUE) knitr
Steph Roberts
October 3, 2022
##1. Use the LungCapData to answer the following questions. (Hint: Using dplyr, especiallygroup_by() and summarize() can help you answer the following questions relatively efficiently.)
# A tibble: 6 × 6
LungCap Age Height Smoke Gender Caesarean
<dbl> <dbl> <dbl> <chr> <chr> <chr>
1 6.48 6 62.1 no male no
2 10.1 18 74.7 yes female no
3 9.55 16 69.7 no female yes
4 11.1 14 71 no male no
5 4.8 5 56.9 no male no
6 6.22 11 58.7 no female no
#Summarize
LungCap Age Height Smoke
Min. : 0.507 Min. : 3.00 Min. :45.30 Length:725
1st Qu.: 6.150 1st Qu.: 9.00 1st Qu.:59.90 Class :character
Median : 8.000 Median :13.00 Median :65.40 Mode :character
Mean : 7.863 Mean :12.33 Mean :64.84
3rd Qu.: 9.800 3rd Qu.:15.00 3rd Qu.:70.30
Max. :14.675 Max. :19.00 Max. :81.80
Gender Caesarean
Length:725 Length:725
Class :character Class :character
Mode :character Mode :character
#a. What does the distribution of LungCap look like? (Hint: Plot a histogram with probability density on the y axis)
The histogram follows a distribution close to normal distibution. In fact, if we change binwidth slightly, it appears even closer to normal distribution.
This helps illustrate the importance of binwidth and what it can do to our visualization interpretations.
#b. Compare the probability distribution of the LungCap with respect to Males and Females? (Hint: make boxplots separated by gender using the boxplot() function)
The distribution of male lung capacity is larger and longer than females’.
#c. Compare the mean lung capacities for smokers and non-smokers. Does it make sense?
[1] 8.645455
[1] 7.770188
It does not make sense at face value. In this sample, smokers have a higher mean lung capacity than non-smokers. Let’s check how big each subsample is.
As suspected, there are far more, almost 10 times as many, non-smokers. If we could gather data from all the smokers, perhaps our means would look a lot different. Maybe our sample was taken from young people whose lungs have not been long affected by the smoking.
Again, as suspected, our sample of smokers is a young age. Therefore, the lack of difference in lung capacity between smokers and non-smokers is not too surprising.
#d. Examine the relationship between Smoking and Lung Capacity within age groups: “less than or equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”.
14 to 15 16 TO 17
120 97
greater than or equal to 18 less than or equal to 13
80 428
These histograms suggest that participants 13 or younger have smaller lung capacity. The Lung capacity seems to generally increase with age as children grow.
#e. Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c. What could possibly be going on here?
This visualization starts to explain furthermore why there is an unexpected result for lung capacity in smokers vs. non-smokers. As we have deducted, lung capacity generally improves with age (in growing years). However, teenagers approaching adulthood are also a group more likely to have access or influence to smoking cigarettes. It is likely that our smokers account for some of the older participants, who happen to be closer to normal smoking age.
#f. Calculate the correlation and covariance between Lung Capacity and Age. (use the cov() and cor() functions in R). Interpret your results.
[1] 8.738289
[1] 0.8196749
A positive coraviance (8.74) indicates lung capacity and age tend to increase together. The positive correlation relatively close to 1 (0.82) indicates there is a fairly strong correlation between the variables.
##2. Let X = number of prior convictions for prisoners at a state prison at which there are 810 prisoners.
[1] 2 1 1 0 1 1 2 1 0 1
[1] 810
[1] 1.28642
[1] 1.28642
#a. What is the probability that a randomly selected inmate has exactly 2 prior convictions?
[1] 0.3201613
The probability of 2 convications in 0.32.
#b. What is the probability that a randomly selected inmate has fewer than 2 prior convictions?
[1] 0.4593924
The probability of <2 convictions is 0.46.
#c. What is the probability that a randomly selected inmate has 2 or fewer prior convictions?
[1] 0.7795537
The probability of less than or equal to 2 convictions is 0.78.
#d. What is the probability that a randomly selected inmate has more than 2 prior convictions?
[1] 0.2204463
The probability of greater than 2 convictions is 0.22.
#e. What is the expected value for the number of prior convictions?
[1] 0.1641252
[1] 0.410739
[1] 0.3201613
[1] 0.07772916
[1] 0.005877753
[1] 1.30776
[1] 1.28642
The expected value is 1.31.
#f. Calculate the variance and the standard deviation for the Prior Convictions.
---
title: "Homework 1"
author: "Steph Roberts"
desription: "Homework 1 - Descriptive Statistics"
date: "10/3/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw1
- challenge1
- Steph Roberts
- dataset
- ggplot2
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
library(dplyr)
library(readxl)
library(ggplot2)
knitr::opts_chunk$set(echo = TRUE)
```
### Homework 1
##1. Use the LungCapData to answer the following questions. (Hint: Using dplyr, especiallygroup_by() and summarize() can help you answer the following questions relatively efficiently.)
```{r}
df<- read_excel("_data/LungCapData.xls")
head(df)
```
#Summarize
```{r}
summary(df)
```
```{r}
mean(df$LungCap)
```
```{r}
median(df$LungCap)
```
```{r}
var(df$LungCap)
```
```{r}
sd(df$LungCap)
```
```{r}
min(df$LungCap)
max(df$LungCap)
```
#a. What does the distribution of LungCap look like? (Hint: Plot a histogram with probability density on the y axis)
```{r}
ggplot(df, aes(x=LungCap)) +
geom_histogram(binwidth=0.5,col='black',fill='gray')
```
The histogram follows a distribution close to normal distibution. In fact, if we change binwidth slightly, it appears even closer to normal distribution.
```{r}
ggplot(df, aes(x=LungCap)) +
geom_histogram(binwidth=1,col='black',fill='gray')
```
This helps illustrate the importance of binwidth and what it can do to our visualization interpretations.
#b. Compare the probability distribution of the LungCap with respect to Males and Females? (Hint: make boxplots separated by gender using the boxplot() function)
```{r}
ggplot(df, aes(x = LungCap, y = Gender)) +
geom_boxplot()
```
The distribution of male lung capacity is larger and longer than females'.
#c. Compare the mean lung capacities for smokers and non-smokers. Does it make sense?
```{r}
df %>%
filter(Smoke == 'yes') %>%
pull(LungCap) %>%
mean()
df %>%
filter(Smoke == 'no') %>%
pull(LungCap) %>%
mean()
```
It does not make sense at face value. In this sample, smokers have a higher mean lung capacity than non-smokers. Let's check how big each subsample is.
```{r}
length(which(df$Smoke == 'yes'))
length(which(df$Smoke == 'no'))
```
As suspected, there are far more, almost 10 times as many, non-smokers. If we could gather data from all the smokers, perhaps our means would look a lot different. Maybe our sample was taken from young people whose lungs have not been long affected by the smoking.
```{r}
df %>%
filter(Smoke == 'yes') %>%
pull(Age) %>%
median()
```
Again, as suspected, our sample of smokers is a young age. Therefore, the lack of difference in lung capacity between smokers and non-smokers is not too surprising.
#d. Examine the relationship between Smoking and Lung Capacity within age groups: “less than or equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”.
```{r}
#Create age groups
df <- df %>%
mutate(agegroup = case_when(
Age <= 13 ~ "less than or equal to 13",
Age >= 14 & Age <= 15 ~ "14 to 15",
Age >= 16 & Age <= 17 ~ "16 TO 17",
Age >= 18 ~ "greater than or equal to 18"))
table(df$agegroup)
df %>%
filter(Smoke == 'yes') %>%
ggplot(aes(x=LungCap)) +
geom_histogram(binwidth=1,col='black',fill='gray')+
facet_wrap(~agegroup)
```
These histograms suggest that participants 13 or younger have smaller lung capacity. The Lung capacity seems to generally increase with age as children grow.
#e. Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c. What could possibly be going on here?
```{r}
ggplot(df, aes(x = LungCap,
fill = agegroup)) +
geom_density(alpha = 0.4)+
facet_wrap(~Smoke)
```
This visualization starts to explain furthermore why there is an unexpected result for lung capacity in smokers vs. non-smokers. As we have deducted, lung capacity generally improves with age (in growing years). However, teenagers approaching adulthood are also a group more likely to have access or influence to smoking cigarettes. It is likely that our smokers account for some of the older participants, who happen to be closer to normal smoking age.
#f. Calculate the correlation and covariance between Lung Capacity and Age. (use the cov() and cor() functions in R). Interpret your results.
```{r}
cov(df$LungCap, df$Age) #calculate covariance
cor(df$LungCap, df$Age) #calculate correlation
```
A positive coraviance (8.74) indicates lung capacity and age tend to increase together. The positive correlation relatively close to 1 (0.82) indicates there is a fairly strong correlation between the variables.
##2. Let X = number of prior convictions for prisoners at a state prison at which there are 810 prisoners.
```{r}
#create the sample
x<-rep(c(0,1,2,3,4),times=c(128, 434, 160, 64, 24))
sample(x, 10)
#Verify n of sample
sum(128, 434, 160, 64, 24)
```
```{r}
#Calculate the mean
mean(x)
#Verify the mean
sample_mean <- (((128*0)+(434*1)+(160*2)+(64*3)+(24*4))/810)
print(sample_mean)
```
```{r}
#Calculate the sd
sd(x)
```
#a. What is the probability that a randomly selected inmate has exactly 2 prior convictions?
```{r}
#probability of 2 convictions?
dnorm.convict <- dnorm(2, mean(x), sd(x))
print(dnorm.convict)
```
The probability of 2 convications in 0.32.
#b. What is the probability that a randomly selected inmate has fewer than 2 prior convictions?
```{r}
#probability of <2 convictions
less.than <- pnorm(2, mean(x), sd(x)) - dnorm.convict
print(less.than)
```
The probability of <2 convictions is 0.46.
#c. What is the probability that a randomly selected inmate has 2 or fewer prior convictions?
```{r}
#probability of =<2 convictions?
pnorm.convict <- pnorm(2, mean(x), sd(x))
print(pnorm.convict)
```
The probability of less than or equal to 2 convictions is 0.78.
#d. What is the probability that a randomly selected inmate has more than 2 prior convictions?
```{r}
#probability of >2 convictions?
greater.than <- 1 - pnorm.convict
print(greater.than)
```
The probability of greater than 2 convictions is 0.22.
```{r}
#Verify all probabilities add to 1
less.than + dnorm.convict + greater.than
```
#e. What is the expected value for the number of prior convictions?
```{r}
# Expected value of a probability distribution can be found with μ = Σx * P(x), where x = data value and P(x) = probability of data.
#Calculate probabilities of data
p0 <- dnorm(0, mean(x), sd(x))
p0
p1 <- dnorm(1, mean(x), sd(x))
p1
p2 <- dnorm(2, mean(x), sd(x))
p2
p3 <- dnorm(3, mean(x), sd(x))
p3
p4 <- dnorm(4, mean(x), sd(x))
p4
#Calculate expected value
ev <- sum((0*p0), (1*p1), (2*p2), (3*p3), (4*p4))
ev
#The expected value should be close to the mean in a normal distribution
mean(x)
```
The expected value is 1.31.
#f. Calculate the variance and the standard deviation for the Prior Convictions.
```{r}
#Calculate variance
var(x)
```
```{r}
#Calculate the sd
sd(x)
```