hw1
descriptive statistics
probability
The first homework on descriptive statistics and probability
Author

Karen Detter

Published

October 3, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE)

Q.1a

##Read in data from Excel file

Code
library(readxl)
LungCapData <- read_excel('_data/LungCapData.xls')

Plot histogram with probability density on the y axis

Code
hist(LungCapData$LungCap, freq = FALSE)

The histogram suggests that the distribution is close to a normal distribution - most of the observations are close to the mean, with very few close to the margins (0 and 15).

Q.1b

Create boxplots separated by gender

Code
boxplot(LungCap ~ Gender, data = LungCapData, horizontal = TRUE)

The boxplots show that male lung capacity has a wider range than that of females; however, the minimum, median, and maximum values are all higher than those of females. This implies that, as a group, men are likely to have higher lung capacity than women.

Q.1c

Group by smoking status and summarize mean lung capacities

Code
library(dplyr)
LungCapData %>%
group_by(Smoke) %>%
summarize(mean = mean(LungCap), n = n())
# A tibble: 2 × 3
  Smoke  mean     n
  <chr> <dbl> <int>
1 no     7.77   648
2 yes    8.65    77

In this dataset, the mean lung capacity of smokers is actually higher than that of non-smokers. Since this is counter to what would be expected, there is likely another variable exerting a confounding effect on lung capacity.

Q.1d

Create new data frame with age group category variables

Code
LungCapData_AgeGroups <- LungCapData %>%
mutate(AgeGroup = case_when(Age <= 13 ~ "less than or equal to 13", 
            Age == 14 | Age == 15 ~ "14 to 15",
            Age == 16 | Age == 17 ~ "16 to 17",
            Age >= 18 ~ "greater than or equal to 18"))

Summarize mean lung capacities by age group and smoking status

Code
LungCapData_AgeGroups %>%
group_by(AgeGroup, Smoke) %>%
summarize(MeanLungCap = mean(LungCap), n = n())
`summarise()` has grouped output by 'AgeGroup'. You can override using the
`.groups` argument.
# A tibble: 8 × 4
# Groups:   AgeGroup [4]
  AgeGroup                    Smoke MeanLungCap     n
  <chr>                       <chr>       <dbl> <int>
1 14 to 15                    no           9.14   105
2 14 to 15                    yes          8.39    15
3 16 to 17                    no          10.5     77
4 16 to 17                    yes          9.38    20
5 greater than or equal to 18 no          11.1     65
6 greater than or equal to 18 yes         10.5     15
7 less than or equal to 13    no           6.36   401
8 less than or equal to 13    yes          7.20    27

Q.1e

When lung capacity data is further broken down by age group, the lung capacities of smokers and non-smokers appear to be more in line with expectations. The one exception is the 13 and under age category - here, mean lung capacity is actually higher for smokers. This anomaly could be due to the fact that the number of observations is significantly higher for this age group than any of the others, likely resulting in a wider range of lung capacities. Also, this age category, which includes ages 3 through 13, covers a broader scope of ages than any of the other categories, likely producing the paradox of a smaller number of smokers exhibiting higher lung capacities than their cohorts simply because they are older.

Q.1f

Calculate correlation and covariance between lung capacity and age

Code
cor(LungCapData$LungCap, LungCapData$Age)
[1] 0.8196749
Code
cov(LungCapData$LungCap, LungCapData$Age)
[1] 8.738289

Since the correlation coefficient is close to 1, there is a high degree of correlation between lung capacity and age. The covariance of 8.7, being a positive number, indicates that as age increases, lung capacity increases.

Q.2a

Create data frame

Code
PriorConv <- c(0,1,2,3,4)
Freq <- c(128,434,160,64,24)
PrisonerData <- data.frame (PriorConv, Freq)
PrisonerData
  PriorConv Freq
1         0  128
2         1  434
3         2  160
4         3   64
5         4   24

Calculate probability that an inmate has == 2 prior convictions

probability = frequency/n

Code
160/810
[1] 0.1975309

Q.2b

Calculate probability that an inmate has < 2 prior convictions

probability = frequency(0)/n + frequency(1)/n

Code
(128/810) + (434/810)
[1] 0.6938272

Q.2c

Calculate probability that an inmate has <= 2 prior convictions

probability = frequency(0)/n + frequency(1)/n + frequency(2)/n

Code
(128/810) + (434/810) + (160/810)
[1] 0.891358

Q.2d

Calculate probability that an inmate has > 2 prior convictions

probability = frequency(3)/n + frequency(4)/n

Code
(64/810) + (24/810)
[1] 0.108642

Q.2e

Calculate expected value for number of prior convictions

Create a matrix of prior conviction values and their probabilities

Code
PriorConv <- c(0,1,2,3,4)
Probs <- c(0.1580247, 0.5358025, 0.1975309, 0.07901235, 0.02962963)

Calculate expected value

Code
c(PriorConv %*% Probs)
[1] 1.28642

Q.2f

Calculate variance and standard deviation for prior convictions

Code
var(PriorConv)
[1] 2.5
Code
sd(PriorConv)
[1] 1.581139

Double-check values

Code
sqrt(var(PriorConv)) == sd(PriorConv)
[1] TRUE