Homework 1 - Donny Snyder

hw1

desriptive statistics

probability

The first homework on descriptive statistics and probability

Author

Donny Snyder

Published

October 1, 2022

Question 1

a

First, let’s read in the data from the Excel file:

Code

library(readxl)
df <- read_excel("_data/LungCapData.xls")

The distribution of LungCap looks as follows:

Code

hist(df$LungCap)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

##b

Code

library(ggplot2)
ggplot(df, aes(x = Gender, y = LungCap)) + geom_boxplot()

The probability distribution suggests that the lung capacity of males tends to be higher.

##c

Code

aggregate(data = df, LungCap~Smoke, mean)

  Smoke  LungCap
1    no 7.770188
2   yes 8.645455

The mean lung capacity of smokers vs nonsmokers appears to be higher for smokers. This doesn’t really make sense because I’ve been taught to think smokers tend to have reduced lung capacity.

##d and e

Code

x=1
df$AgeGroup <- rep(c("NA"),times=725)
while(x <= 725){
  if(df$Age[x] <= 13){
    df$AgeGroup[x] = "less than or equal to 13"
  }
  else if((df$Age[x] >= 14)&&(df$Age[x] <= 15)){
    df$AgeGroup[x] = "14 to 15"
  }
  else if((df$Age[x] >= 16)&&(df$Age[x] <= 17)){
    df$AgeGroup[x] = "16 to 17"
  }
  else if(df$Age[x] >= 18){
    df$AgeGroup[x] = "greater than 18"
  }
x = x + 1
}
aggregate(data = df, LungCap~AgeGroup+Smoke, mean)

                  AgeGroup Smoke   LungCap
1                 14 to 15    no  9.138810
2                 16 to 17    no 10.469805
3          greater than 18    no 11.068846
4 less than or equal to 13    no  6.358746
5                 14 to 15   yes  8.391667
6                 16 to 17   yes  9.383750
7          greater than 18   yes 10.513333
8 less than or equal to 13   yes  7.201852

Code

aggregate(data = df,LungCap~AgeGroup+Smoke,length)

                  AgeGroup Smoke LungCap
1                 14 to 15    no     105
2                 16 to 17    no      77
3          greater than 18    no      65
4 less than or equal to 13    no     401
5                 14 to 15   yes      15
6                 16 to 17   yes      20
7          greater than 18   yes      15
8 less than or equal to 13   yes      27

Code

aggregate(data = df,Age~Smoke,mean)

  Smoke      Age
1    no 12.03549
2   yes 14.77922

It seems like people tend to have a lung capacity that increases with age. However, nonsmokers have a higher lung capacity for each age break down besides less than or equal to 13. It seems like smokers just might tend to be older. I confirmed this by looking at the length and mean ages per group, where you can see a majority of smokers are older, whereas non smokers tend to be younger. The mean age for smokers also tends to be older.

##f

Code

cor(x= df$LungCap, y = df$Age)

[1] 0.8196749

Code

cov(x= df$LungCap, y = df$Age)

[1] 8.738289

Lung capacity appears to be quite correlated with age. This means that Lung capacity tends to go up as age goes up, and vice versa. This is confirmed also by the covariance.

#Question 2

##a

Code

print((160/810) * 100)

[1] 19.75309

The probability is 19.75309% that a randomly selected inmate has exactly 2 prior convictions.

##b

Code

print(((434+128)/810) * 100)

[1] 69.38272

The probability is 69.38272% that a randomly selected inmate has fewer than 2 prior convictions.

##c

Code

print(((160+434+128)/810) * 100)

[1] 89.1358

The probability is 89.1358% that a randomly selected inmate has 2 or fewer prior convictions.

##d

Code

print(((64+24)/810) * 100)

[1] 10.8642

The probability is 10.8642% that a randomly selected inmate has more than 2 prior convictions.

##e

Code

newDf <- NA
newDf[1:128] <- 0
newDf[129:562] <- 1
newDf[563:722] <- 2
newDf[723:786] <- 3
newDf[787:810] <- 4
newDf <- as.data.frame(newDf)
mean(newDf$newDf)

[1] 1.28642

The expected value, known as the “mean” when it deals in data that are not probability distributions, is 1.28642. Because I created a vector here, I took the mean, though I also could have calculated the expected value by multiplying the probabilities by the numbers. They are both the same value in this case.

##f

Code

sd(newDf$newDf)

[1] 0.9259016

Code

var(newDf$newDf)

[1] 0.8572937

The variance of prior convictions is 0.8572937, the standard deviation of prior convictions is 0.9259016.