Homework 1

hw1

descriptive statistics

probability

The first homework on descriptive statistics and probability

Author

Karen Detter

Published

October 3, 2022

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE)

Q.1a

##Read in data from Excel file

Code

library(readxl)
LungCapData <- read_excel('_data/LungCapData.xls')

Plot histogram with probability density on the y axis

Code

hist(LungCapData$LungCap, freq = FALSE)

The histogram suggests that the distribution is close to a normal distribution - most of the observations are close to the mean, with very few close to the margins (0 and 15).

Q.1b

Create boxplots separated by gender

Code

boxplot(LungCap ~ Gender, data = LungCapData, horizontal = TRUE)

The boxplots show that male lung capacity has a wider range than that of females; however, the minimum, median, and maximum values are all higher than those of females. This implies that, as a group, men are likely to have higher lung capacity than women.

Q.1c

Group by smoking status and summarize mean lung capacities

Code

library(dplyr)
LungCapData %>%
group_by(Smoke) %>%
summarize(mean = mean(LungCap), n = n())

# A tibble: 2 × 3
  Smoke  mean     n
  <chr> <dbl> <int>
1 no     7.77   648
2 yes    8.65    77

In this dataset, the mean lung capacity of smokers is actually higher than that of non-smokers. Since this is counter to what would be expected, there is likely another variable exerting a confounding effect on lung capacity.

Q.1d

Create new data frame with age group category variables

Code

LungCapData_AgeGroups <- LungCapData %>%
mutate(AgeGroup = case_when(Age <= 13 ~ "less than or equal to 13", 
            Age == 14 | Age == 15 ~ "14 to 15",
            Age == 16 | Age == 17 ~ "16 to 17",
            Age >= 18 ~ "greater than or equal to 18"))

Summarize mean lung capacities by age group and smoking status

Code

LungCapData_AgeGroups %>%
group_by(AgeGroup, Smoke) %>%
summarize(MeanLungCap = mean(LungCap), n = n())

`summarise()` has grouped output by 'AgeGroup'. You can override using the
`.groups` argument.

# A tibble: 8 × 4
# Groups:   AgeGroup [4]
  AgeGroup                    Smoke MeanLungCap     n
  <chr>                       <chr>       <dbl> <int>
1 14 to 15                    no           9.14   105
2 14 to 15                    yes          8.39    15
3 16 to 17                    no          10.5     77
4 16 to 17                    yes          9.38    20
5 greater than or equal to 18 no          11.1     65
6 greater than or equal to 18 yes         10.5     15
7 less than or equal to 13    no           6.36   401
8 less than or equal to 13    yes          7.20    27

Q.1e

When lung capacity data is further broken down by age group, the lung capacities of smokers and non-smokers appear to be more in line with expectations. The one exception is the 13 and under age category - here, mean lung capacity is actually higher for smokers. This anomaly could be due to the fact that the number of observations is significantly higher for this age group than any of the others, likely resulting in a wider range of lung capacities. Also, this age category, which includes ages 3 through 13, covers a broader scope of ages than any of the other categories, likely producing the paradox of a smaller number of smokers exhibiting higher lung capacities than their cohorts simply because they are older.

Q.1f

Calculate correlation and covariance between lung capacity and age

Code

cor(LungCapData$LungCap, LungCapData$Age)

[1] 0.8196749

Code

cov(LungCapData$LungCap, LungCapData$Age)

[1] 8.738289

Since the correlation coefficient is close to 1, there is a high degree of correlation between lung capacity and age. The covariance of 8.7, being a positive number, indicates that as age increases, lung capacity increases.

Q.2a

Create data frame

Code

PriorConv <- c(0,1,2,3,4)
Freq <- c(128,434,160,64,24)
PrisonerData <- data.frame (PriorConv, Freq)
PrisonerData

  PriorConv Freq
1         0  128
2         1  434
3         2  160
4         3   64
5         4   24

Calculate probability that an inmate has == 2 prior convictions

probability = frequency/n

Code

160/810

[1] 0.1975309

Q.2b

Calculate probability that an inmate has < 2 prior convictions

probability = frequency(0)/n + frequency(1)/n

Code

(128/810) + (434/810)

[1] 0.6938272

Q.2c

Calculate probability that an inmate has <= 2 prior convictions

probability = frequency(0)/n + frequency(1)/n + frequency(2)/n

Code

(128/810) + (434/810) + (160/810)

[1] 0.891358

Q.2d

Calculate probability that an inmate has > 2 prior convictions

probability = frequency(3)/n + frequency(4)/n

Code

(64/810) + (24/810)

[1] 0.108642

Q.2e

Calculate expected value for number of prior convictions

Create a matrix of prior conviction values and their probabilities

Code

PriorConv <- c(0,1,2,3,4)
Probs <- c(0.1580247, 0.5358025, 0.1975309, 0.07901235, 0.02962963)

Calculate expected value

Code

c(PriorConv %*% Probs)

[1] 1.28642

Q.2f

Calculate variance and standard deviation for prior convictions

Code

var(PriorConv)

[1] 2.5

Code

sd(PriorConv)

[1] 1.581139

Double-check values

Code

sqrt(var(PriorConv)) == sd(PriorConv)

[1] TRUE

--- title: "Homework 1" author: "Karen Detter" description: "The first homework on descriptive statistics and probability" date: "10/03/2022" format: html: toc: true code-fold: true code-copy: true code-tools: true categories: - hw1 - descriptive statistics - probability --- ```{r} #| label: setup #| warning: false library(tidyverse) knitr::opts_chunk$set(echo = TRUE) ``` # Q.1a ##Read in data from Excel file ```{r} library(readxl) LungCapData <- read_excel('_data/LungCapData.xls') ``` ## Plot histogram with probability density on the y axis ```{r} hist(LungCapData$LungCap, freq = FALSE) ``` The histogram suggests that the distribution is close to a normal distribution - most of the observations are close to the mean, with very few close to the margins (0 and 15). # Q.1b ## Create boxplots separated by gender ```{r} boxplot(LungCap ~ Gender, data = LungCapData, horizontal = TRUE) ``` The boxplots show that male lung capacity has a wider range than that of females; however, the minimum, median, and maximum values are all higher than those of females. This implies that, as a group, men are likely to have higher lung capacity than women. # Q.1c ## Group by smoking status and summarize mean lung capacities ```{r} library(dplyr) LungCapData %>% group_by(Smoke) %>% summarize(mean = mean(LungCap), n = n()) ``` In this dataset, the mean lung capacity of smokers is actually higher than that of non-smokers. Since this is counter to what would be expected, there is likely another variable exerting a confounding effect on lung capacity. # Q.1d ## Create new data frame with age group category variables ```{r} LungCapData_AgeGroups <- LungCapData %>% mutate(AgeGroup = case_when(Age <= 13 ~ "less than or equal to 13", Age == 14 | Age == 15 ~ "14 to 15", Age == 16 | Age == 17 ~ "16 to 17", Age >= 18 ~ "greater than or equal to 18")) ``` ## Summarize mean lung capacities by age group and smoking status ```{r} LungCapData_AgeGroups %>% group_by(AgeGroup, Smoke) %>% summarize(MeanLungCap = mean(LungCap), n = n()) ``` # Q.1e When lung capacity data is further broken down by age group, the lung capacities of smokers and non-smokers appear to be more in line with expectations. The one exception is the 13 and under age category - here, mean lung capacity is actually higher for smokers. This anomaly could be due to the fact that the number of observations is significantly higher for this age group than any of the others, likely resulting in a wider range of lung capacities. Also, this age category, which includes ages 3 through 13, covers a broader scope of ages than any of the other categories, likely producing the paradox of a smaller number of smokers exhibiting higher lung capacities than their cohorts simply because they are older. # Q.1f ## Calculate correlation and covariance between lung capacity and age ```{r} cor(LungCapData$LungCap, LungCapData$Age) cov(LungCapData$LungCap, LungCapData$Age) ``` Since the correlation coefficient is close to 1, there is a high degree of correlation between lung capacity and age. The covariance of 8.7, being a positive number, indicates that as age increases, lung capacity increases. # Q.2a ## Create data frame ```{r} PriorConv <- c(0,1,2,3,4) Freq <- c(128,434,160,64,24) PrisonerData <- data.frame (PriorConv, Freq) PrisonerData ``` ## Calculate probability that an inmate has == 2 prior convictions probability = frequency/n ```{r} 160/810 ``` # Q.2b ## Calculate probability that an inmate has < 2 prior convictions probability = frequency(0)/n + frequency(1)/n ```{r} (128/810) + (434/810) ``` # Q.2c ## Calculate probability that an inmate has <= 2 prior convictions probability = frequency(0)/n + frequency(1)/n + frequency(2)/n ```{r} (128/810) + (434/810) + (160/810) ``` # Q.2d ## Calculate probability that an inmate has > 2 prior convictions probability = frequency(3)/n + frequency(4)/n ```{r} (64/810) + (24/810) ``` # Q.2e ## Calculate expected value for number of prior convictions ## Create a matrix of prior conviction values and their probabilities ```{r} PriorConv <- c(0,1,2,3,4) Probs <- c(0.1580247, 0.5358025, 0.1975309, 0.07901235, 0.02962963) ``` ## Calculate expected value ```{r} c(PriorConv %*% Probs) ``` # Q.2f ## Calculate variance and standard deviation for prior convictions ```{r} var(PriorConv) sd(PriorConv) ``` ## Double-check values ```{r} sqrt(var(PriorConv)) == sd(PriorConv) ```