hw1
desriptive statistics
probability
The first homework on descriptive statistics and probability
Author

Emma Rasmussen

Published

September 29, 2022

Code
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

library(readxl)
library(tidyverse)
library(ggplot2)
library(dplyr)
Code
lungcap<-read_excel("_data/LungCapData.xls")
head(lungcap)
# A tibble: 6 × 6
  LungCap   Age Height Smoke Gender Caesarean
    <dbl> <dbl>  <dbl> <chr> <chr>  <chr>    
1    6.48     6   62.1 no    male   no       
2   10.1     18   74.7 yes   female no       
3    9.55    16   69.7 no    female yes      
4   11.1     14   71   no    male   no       
5    4.8      5   56.9 no    male   no       
6    6.22    11   58.7 no    female no       
Code
#saving a copy of original dataset
lungcap_orig<-lungcap

#checking for missing values in LungCap
which(is.na(lungcap$LungCap))
integer(0)

1a.

The distribution of LungCapData is plotted as a histogram below.

Code
ggplot(lungcap, aes(x=LungCap))+geom_histogram()

The histogram looks approximately normally distributed

1b.

The probability distribution of LungCap data for males and females is compared using the boxplots below:

Code
ggplot(lungcap, aes(x=LungCap, y=Gender))+geom_boxplot()

The mean lung capacity of males appears slightly higher than that of females. The IQR and range for males and females appears similarly spread with a higher average for males.

1c.

Below the mean and standard deviation of smokers and non-smokers is compared. They are also plotted as a boxplot to help visualize the distribution.

Code
lungcap%>%
  group_by(Smoke) %>% 
  summarize(Mean=mean(LungCap))
# A tibble: 2 × 2
  Smoke  Mean
  <chr> <dbl>
1 no     7.77
2 yes    8.65
Code
lungcap%>%
  group_by(Smoke) %>% 
  summarize(stdev=sd(LungCap))
# A tibble: 2 × 2
  Smoke stdev
  <chr> <dbl>
1 no     2.73
2 yes    1.88
Code
ggplot(lungcap, aes(x=LungCap, y=Smoke))+geom_boxplot()

The mean lung capacity for smokers (8.645) in this sample is higher than that of non-smokers (7.770). This does not make sense. However, the standard deviation of non-smokers (2.726) is much higher than smokers (1.883) so there might be something else going on (see boxplot).

1d.

Below, means are taken by age groups of smokers/non-smokers. I also created a new age category variable (“AgeCat”) to plot the data by smoking status and age category.

Code
#Mean under 13 and nonsmoker
lungcap %>% 
  filter(Age<=13 & Smoke=="no") %>% 
  pull(LungCap) %>% 
  mean()
[1] 6.358746
Code
#Mean under 13 and smoker
lungcap %>% 
  filter(Age<=13 & Smoke=="yes") %>% 
  pull(LungCap) %>% 
  mean()
[1] 7.201852
Code
#Mean 14-15 and nonsmoker
lungcap %>% 
  filter(Age==14 | Age==15 & Smoke=="no") %>% 
  pull(LungCap) %>% 
  mean()
[1] 9.068018
Code
#Mean 14-15 and smoker
lungcap %>% 
  filter(Age==14 | Age==15 & Smoke=="yes") %>% 
  pull(LungCap) %>% 
  mean()
[1] 8.689231
Code
#Mean 16-17 and nonsmoker
lungcap %>% 
  filter(Age==16 | Age==17 & Smoke=="no") %>% 
  pull(LungCap) %>% 
  mean()
[1] 10.30523
Code
#Mean 16-17 and smoker
lungcap %>% 
  filter(Age==16 | Age==17 & Smoke=="yes") %>% 
  pull(LungCap) %>% 
  mean()
[1] 9.850385
Code
#Mean over 18 and nonsmoker
lungcap %>% 
  filter(Age>=18 & Smoke=="no") %>% 
  pull(LungCap) %>% 
  mean()
[1] 11.06885
Code
#Mean over 18 and smoker
lungcap %>% 
  filter(Age>=18 & Smoke=="yes") %>% 
  pull(LungCap) %>% 
  mean()
[1] 10.51333
Code
#creating new variable AgeCat to create boxplots
lungcap<-lungcap %>% 
  mutate(AgeCat= as.factor(case_when(Age <= 13 ~ "13 and under", 
                           Age == 14 |Age ==15 ~ "14-15", 
                           Age == 16 | Age==17 ~ "16-17",
                           Age >= 18 ~ "18 or over"
                           )))

#new Category AgeCat is the last column
lungcap
# A tibble: 725 × 7
   LungCap   Age Height Smoke Gender Caesarean AgeCat      
     <dbl> <dbl>  <dbl> <chr> <chr>  <chr>     <fct>       
 1    6.48     6   62.1 no    male   no        13 and under
 2   10.1     18   74.7 yes   female no        18 or over  
 3    9.55    16   69.7 no    female yes       16-17       
 4   11.1     14   71   no    male   no        14-15       
 5    4.8      5   56.9 no    male   no        13 and under
 6    6.22    11   58.7 no    female no        13 and under
 7    4.95     8   63.3 no    male   yes       13 and under
 8    7.32    11   70.4 no    male   no        13 and under
 9    8.88    15   70.5 no    male   no        14-15       
10    6.8     11   59.2 no    male   no        13 and under
# … with 715 more rows
Code
ggplot(lungcap, aes(x=LungCap))+geom_boxplot()+facet_grid(Smoke ~ AgeCat)

1e.

Comparing the lung capacities for smokers and non-smokers in different age categories:

Now we can see that the mean lung capacity for smokers by age group is generally lower than that of nonsmokers. This is true in all categories except for Under 13, which is likely because smokers in that category are going to be older than nonsmokers in that category (i.e. it is more likely that a 12 year old smokes than a 6 year old, and a 12 year old has a larger lung capacity than a 6 year old regardless of smoking status)

This explains the first calculation of mean by smoking status (before finding the mean by age categories). Smokers are generally going to be older than non-smokers for this sample (the oldest participant in the sample is 19- see code below), which explains why the mean for smokers versus non-smokers (not separated by age categories) makes it look like smokers have a higher average lung capacity.

Code
#checking how old participants in the sample are
lungcap %>% 
  summarize(range(Age))
# A tibble: 2 × 1
  `range(Age)`
         <dbl>
1            3
2           19

1f.

Calculating the correlation and covariance between Lung Capacity and Age:

Code
#Creating vectors of Age and Lung Capacity from df (lungcap) to apply cov() and cor() functions to
x<-c(lungcap$Age)
y<-c(lungcap$LungCap)


#Calculating covariance
cov(x, y)
[1] 8.738289
Code
#calculating correlation
cor(x, y)
[1] 0.8196749

The covariance, 8.738 is fairly high and positive, meaning as age increases, so does lung capacity (i.e. age and lung capacity co-vary). The correlation (0.82) is fairly close to one and positive, indicating they correlate fairly closely.

2a-f.

Prior Conviction Data

Code
#creating a data frame
X<-c(0, 1, 2, 3, 4)
Frequency<-c(128, 434, 160, 64, 24)
prison<- data.frame(X, Frequency)
prison
  X Frequency
1 0       128
2 1       434
3 2       160
4 3        64
5 4        24
Code
prison<-rename(prison, PriorConvictions=X)
prison
  PriorConvictions Frequency
1                0       128
2                1       434
3                2       160
4                3        64
5                4        24
Code
#visualizing df using bar chart
ggplot(prison, aes(x=PriorConvictions, y=Frequency))+geom_bar(stat="identity")+geom_text(aes(label = Frequency), vjust = -.3)

Code
#There are 810 obs in df
sum(Frequency)
[1] 810

Answering the Questions

Code
#creating a vector of probabilities
probs<-Frequency/810
probs
[1] 0.15802469 0.53580247 0.19753086 0.07901235 0.02962963
Code
#A
# P(x=2)=160/810
160/810
[1] 0.1975309
Code
#B
#P(x<2)=P(0)+P(1)
(128+434)/810
[1] 0.6938272
Code
#C
#P(x<=2)=P(0)+P(1)+P(2)
(128+434+160)/810
[1] 0.891358
Code
#D
#1-P(above)
1-((128+434+160)/810)
[1] 0.108642
Code
#E
#Expected value=sum of probabilities*each value (0, 1, 2, 3 or 4)
weighted.mean(X, probs)
[1] 1.28642
Code
#F
#Calculating the Variance using the formula for variance
(sum(Frequency*((X-1.28642)^2)))/(sum(Frequency)-1)
[1] 0.8572937
Code
#Calculating the sample standard deviation from the variance
sqrt(0.8572937)
[1] 0.9259016
  1. What is the probability that a randomly selected inmate has exactly 2 prior convictions? 19.75% probability (or 0.1975)
  2. What is the probability that a randomly selected inmate has fewer than 2 prior convictions? 69.38% probability
  3. What is the probability that a randomly selected inmate has 2 or fewer prior convictions? 89.14% probability
  4. What is the probability that a randomly selected inmate has more than 2 prior convictions? 10.86% probability
  5. What is the expected value for the number of prior convictions? 1.28642 prior convictions
  6. Calculate the variance and the standard deviation for the Prior Convictions. variance: 0.8572937 standard deviation: 0.9259016 prior convictions