Code
library(tidyverse)
::opts_chunk$set(echo = TRUE) knitr
Karen Detter
October 3, 2022
##Read in data from Excel file
The histogram suggests that the distribution is close to a normal distribution - most of the observations are close to the mean, with very few close to the margins (0 and 15).
The boxplots show that male lung capacity has a wider range than that of females; however, the minimum, median, and maximum values are all higher than those of females. This implies that, as a group, men are likely to have higher lung capacity than women.
# A tibble: 2 × 3
Smoke mean n
<chr> <dbl> <int>
1 no 7.77 648
2 yes 8.65 77
In this dataset, the mean lung capacity of smokers is actually higher than that of non-smokers. Since this is counter to what would be expected, there is likely another variable exerting a confounding effect on lung capacity.
`summarise()` has grouped output by 'AgeGroup'. You can override using the
`.groups` argument.
# A tibble: 8 × 4
# Groups: AgeGroup [4]
AgeGroup Smoke MeanLungCap n
<chr> <chr> <dbl> <int>
1 14 to 15 no 9.14 105
2 14 to 15 yes 8.39 15
3 16 to 17 no 10.5 77
4 16 to 17 yes 9.38 20
5 greater than or equal to 18 no 11.1 65
6 greater than or equal to 18 yes 10.5 15
7 less than or equal to 13 no 6.36 401
8 less than or equal to 13 yes 7.20 27
When lung capacity data is further broken down by age group, the lung capacities of smokers and non-smokers appear to be more in line with expectations. The one exception is the 13 and under age category - here, mean lung capacity is actually higher for smokers. This anomaly could be due to the fact that the number of observations is significantly higher for this age group than any of the others, likely resulting in a wider range of lung capacities. Also, this age category, which includes ages 3 through 13, covers a broader scope of ages than any of the other categories, likely producing the paradox of a smaller number of smokers exhibiting higher lung capacities than their cohorts simply because they are older.
[1] 0.8196749
[1] 8.738289
Since the correlation coefficient is close to 1, there is a high degree of correlation between lung capacity and age. The covariance of 8.7, being a positive number, indicates that as age increases, lung capacity increases.
probability = frequency/n
probability = frequency(0)/n + frequency(1)/n
probability = frequency(0)/n + frequency(1)/n + frequency(2)/n
probability = frequency(3)/n + frequency(4)/n
---
title: "Homework 1"
author: "Karen Detter"
description: "The first homework on descriptive statistics and probability"
date: "10/03/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw1
- descriptive statistics
- probability
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE)
```
# Q.1a
##Read in data from Excel file
```{r}
library(readxl)
LungCapData <- read_excel('_data/LungCapData.xls')
```
## Plot histogram with probability density on the y axis
```{r}
hist(LungCapData$LungCap, freq = FALSE)
```
The histogram suggests that the distribution is close to a normal distribution - most of the observations are close to the mean, with very few close to the margins (0 and 15).
# Q.1b
## Create boxplots separated by gender
```{r}
boxplot(LungCap ~ Gender, data = LungCapData, horizontal = TRUE)
```
The boxplots show that male lung capacity has a wider range than that of females; however, the minimum, median, and maximum values are all higher than those of females. This implies that, as a group, men are likely to have higher lung capacity than women.
# Q.1c
## Group by smoking status and summarize mean lung capacities
```{r}
library(dplyr)
LungCapData %>%
group_by(Smoke) %>%
summarize(mean = mean(LungCap), n = n())
```
In this dataset, the mean lung capacity of smokers is actually higher than that of non-smokers. Since this is counter to what would be expected, there is likely another variable exerting a confounding effect on lung capacity.
# Q.1d
## Create new data frame with age group category variables
```{r}
LungCapData_AgeGroups <- LungCapData %>%
mutate(AgeGroup = case_when(Age <= 13 ~ "less than or equal to 13",
Age == 14 | Age == 15 ~ "14 to 15",
Age == 16 | Age == 17 ~ "16 to 17",
Age >= 18 ~ "greater than or equal to 18"))
```
## Summarize mean lung capacities by age group and smoking status
```{r}
LungCapData_AgeGroups %>%
group_by(AgeGroup, Smoke) %>%
summarize(MeanLungCap = mean(LungCap), n = n())
```
# Q.1e
When lung capacity data is further broken down by age group, the lung capacities of smokers and non-smokers appear to be more in line with expectations. The one exception is the 13 and under age category - here, mean lung capacity is actually higher for smokers. This anomaly could be due to the fact that the number of observations is significantly higher for this age group than any of the others, likely resulting in a wider range of lung capacities. Also, this age category, which includes ages 3 through 13, covers a broader scope of ages than any of the other categories, likely producing the paradox of a smaller number of smokers exhibiting higher lung capacities than their cohorts simply because they are older.
# Q.1f
## Calculate correlation and covariance between lung capacity and age
```{r}
cor(LungCapData$LungCap, LungCapData$Age)
cov(LungCapData$LungCap, LungCapData$Age)
```
Since the correlation coefficient is close to 1, there is a high degree of correlation between lung capacity and age. The covariance of 8.7, being a positive number, indicates that as age increases, lung capacity increases.
# Q.2a
## Create data frame
```{r}
PriorConv <- c(0,1,2,3,4)
Freq <- c(128,434,160,64,24)
PrisonerData <- data.frame (PriorConv, Freq)
PrisonerData
```
## Calculate probability that an inmate has == 2 prior convictions
probability = frequency/n
```{r}
160/810
```
# Q.2b
## Calculate probability that an inmate has < 2 prior convictions
probability = frequency(0)/n + frequency(1)/n
```{r}
(128/810) + (434/810)
```
# Q.2c
## Calculate probability that an inmate has <= 2 prior convictions
probability = frequency(0)/n + frequency(1)/n + frequency(2)/n
```{r}
(128/810) + (434/810) + (160/810)
```
# Q.2d
## Calculate probability that an inmate has > 2 prior convictions
probability = frequency(3)/n + frequency(4)/n
```{r}
(64/810) + (24/810)
```
# Q.2e
## Calculate expected value for number of prior convictions
## Create a matrix of prior conviction values and their probabilities
```{r}
PriorConv <- c(0,1,2,3,4)
Probs <- c(0.1580247, 0.5358025, 0.1975309, 0.07901235, 0.02962963)
```
## Calculate expected value
```{r}
c(PriorConv %*% Probs)
```
# Q.2f
## Calculate variance and standard deviation for prior convictions
```{r}
var(PriorConv)
sd(PriorConv)
```
## Double-check values
```{r}
sqrt(var(PriorConv)) == sd(PriorConv)
```