The first homework on descriptive statistics and probability
Ethan Campbell
September 21, 2022
Question 1
First, let’s read in the data from the Excel file:
The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).
# A tibble: 6 x 6
LungCap Age Height Smoke Gender Caesarean
<dbl> <dbl> <dbl> <chr> <chr> <chr>
1 6.48 6 62.1 no male no
2 10.1 18 74.7 yes female no
3 9.55 16 69.7 no female yes
4 11.1 14 71 no male no
5 4.8 5 56.9 no male no
6 6.22 11 58.7 no female no
(Comparing lung cap by gender)
Here we notice that males tend to have a higher lung cap compared to females. Females average tends to sit around 8 while males seems to sit closer to 9
(smoker vs non-smoker lung cap)
Interestingly, none smokers tend to have a lower lung capacity however, I believe this might be due to age. No this does not make sense at first glance and does betray my expectation.
# A tibble: 2 x 2
Smoke mean
<chr> <dbl>
1 no 7.77
2 yes 8.65
(relation between smoking and lung cap at different age groups)
The lung cap starts off higher but takes and dip then rises as the age continues to grow. I believe the trend is the higher age grows the higher the lung cap until it reaches a certain point.
# lung cap is 9.62df %>%select(Age, LungCap) %>%filter(Age >=13) %>%colMeans()
Age LungCap
15.609290 9.628757
# lung cap is 9.04df %>%select(Age, LungCap) %>%filter(Age >=14& Age <=15) %>%colMeans()
Age LungCap
14.533333 9.045417
# lung cap is 10.24df %>%select(Age, LungCap) %>%filter(Age >=16& Age <=17) %>%colMeans()
Age LungCap
16.44330 10.24588
# lung cap is 11.26df %>%select(Age, LungCap) %>%filter(Age >18) %>%colMeans()
Age LungCap
19.00000 11.26149
(lung cap for smokers and non smokers broken into age groups)
We notice a clear trend that smokers have a lower lung capacity compared to non-smokers
# A tibble: 2 x 2
Smoke mean
<chr> <dbl>
1 no 11.3
2 yes 11.3
(correlation and covariance between lung capacity and age)
correlation is at .819 meaning they have a positive correlation of about 82%. This means that there is a connection between the two and when one goes up so does the other.
cov(df$LungCap, df$Age)
[1] 8.738289
cor(df$LungCap, df$Age)
[1] 0.8196749
Question 2
# creating the Tibbledf <-tibble(X=c(0,1,2,3,4), Freq=c(128,434,160,64,26))# Creating the probability of an event occurringdf1 <- df %>%select(X, Freq) %>%mutate(Probability = Freq/sum(Freq))df1
# A tibble: 5 x 3
X Freq Probability
<dbl> <dbl> <dbl>
1 0 128 0.158
2 1 434 0.534
3 2 160 0.197
4 3 64 0.0788
5 4 26 0.0320
probability of exactly 2 convictions probability = 19.7%
# A tibble: 5 x 4
X Freq Probability expected_value
<dbl> <dbl> <dbl> <dbl>
1 0 128 0.158 1.29
2 1 434 0.534 1.29
3 2 160 0.197 1.29
4 3 64 0.0788 1.29
5 4 26 0.0320 1.29
What is the variance and standard deviation of the prior convictions Variance = 25810.8 standard deviation = 160.6574
[1] 25810.8
[1] 160.6574
