Homework 1

hw1

desriptive statistics

probability

DACSS_603_Homework 1 on Descriptive Statistics and Probability

Author

Saisrinivas Ambatipudi

Published

February 27, 2023

Code

library(tidyverse)
library(readxl)
library(ggplot2)
library(stats)

knitr::opts_chunk$set(echo = TRUE)

Question 1

Reading data

Code

Lc <- read_excel("C:/UMass/DACSS_603/603_Spring_2023/posts/_data/LungCapData.xls")
Lc

The data consists of 725 rows and 6 columns. It determines the lung capacity of the based on their age, height and different characteristics. The main key classification that I can see is if they smoke or not.

1a

The distribution of LungCap looks as follows:

Code

Lc %>%
  ggplot(aes(LungCap, ..density..)) +
  geom_histogram(bins= 25, color = "orange") +
  geom_density(color = "darkblue") +
  theme_classic() + 
  labs(title = "Probability distribution of LungCap", x = "Lung Capcity", y = "Probability density")

Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

The histogram and density plots show that it is pretty close to a normal distribution. Most of the observations are close to the mean.

1b

The distribution of LungCap on basis of gender looks as follows:

Code

Lc %>%
  ggplot(aes(y = dnorm(LungCap), color = Gender)) +
  geom_boxplot() +
  theme_classic() + 
  labs(title = "Probability distribution of LungCap based on gender", y = "Probability density")

The box plot shows that the probability density of the male is lesser than the female.

1c

Comparison of mean lung capacities between smokers and non-smokers:

Code

Mean_smoke <- Lc %>%
  group_by(Smoke) %>%
  summarise(mean = mean(LungCap))
Mean_smoke

From the above table, we see that the mean lung capacity of those who smoke is greater than those who don’t smoke, but it doesn’t make sense. It also depends on the biological factors of the person who smoke, so we can’t conclude it.

1d

Relationship between Smoke and Lung capacity on basis of given age categories:

Code

Lc <- mutate(Lc, AgeGrp = case_when(Age <= 13 ~ "less than or equal to 13",
                                    Age == 14 | Age == 15 ~ "14 to 15",
                                    Age == 16 | Age == 17 ~ "16 to 17",
                                    Age >= 18 ~ "greater than or equal to 18"))

Lc %>%
  ggplot(aes(y = LungCap, color = Smoke)) +
  geom_histogram(bins = 25) +
  facet_wrap(vars(AgeGrp)) +
  theme_classic() + 
  labs(title = "Relationship of LungCap and Smoke based on age categories", y = "Lung Capacity", x = "Frequency")

From the above plot, we can derive two important observations: 1. The lung capacity of non smokers is more than smokers. 2. The people who smoke are less in age group of “less than or equal to 13”. So as the result as age increases the lung capacity decreases.

1e

Relationship between Smoke and Lung capacity on basis of age:

Code

Lc %>%
  ggplot(aes(x = Age, y = LungCap, color = Smoke)) +
  geom_line() +
  theme_classic() + 
  facet_wrap(vars(Smoke)) +
  labs(title = "Relationship of LungCap and Smoke based on age", y = "Lung Capacity", x = "Age")

Form the above data we can compare 1d and 1e and can say the results are pretty similar. Only 10 and above age group smoke.

1f

Calculating the correlation and covariance between Lung Capacity and Age:

Code

Covariance <- cov(Lc$LungCap, Lc$Age)
Correlation <- cor(Lc$LungCap, Lc$Age)
Covariance

[1] 8.738289

Code

Correlation

[1] 0.8196749

We can observe from the comparison that the covariance is positive and it indicates that there is a direct relationship between age and lung capacity. And the correlation is also positive, so they move in same direction. We can say from these results that as the age increases, the lung capacity also increases that is they are directly proportional to each other.

Question 2

Reading the table

Code

Prior_convitions <- c(0:4)
Inmate_count <- c(128, 434, 160, 64, 24)
Pc <- data_frame(Prior_convitions, Inmate_count)

Warning: `data_frame()` was deprecated in tibble 1.1.0.
ℹ Please use `tibble()` instead.

Code

Pc

Code

Pc <- mutate(Pc, Probability = Inmate_count/sum(Inmate_count))
Pc

2a

Probability that a randomly selected inmate has exactly 2 prior convictions:

Code

Pc %>%
  filter(Prior_convitions == 2) %>%
  select(Probability)

2b

Probability that a randomly selected inmate has fewer than 2 convictions:

Code

temp <- Pc %>%
  filter(Prior_convitions < 2)
sum(temp$Probability)

[1] 0.6938272

2c

Probability that a randomly selected inmate has 2 or fewer prior convictions:

Code

temp <- Pc %>%
  filter(Prior_convitions <= 2)
sum(temp$Probability)

[1] 0.891358

2d

Probability that a randomly selected inmate has more than 2 prior convictions:

Code

temp <- Pc %>%
  filter(Prior_convitions > 2)
sum(temp$Probability)

[1] 0.108642

2e

Expected value for the number of prior convictions:

Code

Pc <- mutate(Pc, Wm = Prior_convitions*Probability)
e <- sum(Pc$Wm)
e

[1] 1.28642

2f

Variance for the Prior Convictions:

Code

v <-sum(((Pc$Prior_convitions-e)^2)*Pc$Probability)
v

[1] 0.8562353

standard deviation for the Prior Convictions:

Code

sqrt(v)

[1] 0.9253298