hw1
desriptive statistics
probability
Descriptive Statistics and Probability functions
Author

Rahul Gundeti

Published

October 2, 2022

Code
library(tidyverse)
library(readxl)
library(ggplot2)
library(stats)

knitr::opts_chunk$set(echo = TRUE)

Question 1

Reading data

Code
lung <- read_excel("C:/Users/gunde/Downloads/LungCapData.xls")
lung

The Lung Capacity data contains 725 rows and 6 columns that determine age, height etc., The key classification parameter is based on smoker vs non-smoker.

1_A

The distribution of LungCap looks as follows:

Code
lung %>%
  ggplot(aes(LungCap, ..density..)) +
  geom_histogram(bins= 40, color = "red") +
  geom_density(color = "green") +
  theme_classic() + 
  labs(title = "LungCap Probability Distribution", x = "Lung Capcity", y = "Probability Density")

The observations plotted by histogram are closer to mean which suggests that it is a normal distribution.

1_B

The distribution of LungCap on basis of gender looks as follows:

Code
lung %>%
  ggplot(aes(y = dnorm(LungCap), color = Gender)) +
  geom_boxplot() +
  theme_classic() + 
  labs(title = "LungCap Probability Distribution based on gender", y = "Probability Density")

The box plot shows that the probability density of the male < female.

1_C

Comparison of mean lung capacities between smokers and non-smokers:

Code
Mean_smoke <- lung %>%
  group_by(Smoke) %>%
  summarise(mean = mean(LungCap))
Mean_smoke

The table contains the mean lung capacity. The observations suggest that the mean value is higher for smokers than non-smokers. This isn’t entirely correct as the individual biological factors plays a main role. So the data is inadequate to form an opinion.

1_D

Relationship between Smoke and Lung capacity on basis of given age categories:

Code
lung <- mutate(lung, AgeGrp = case_when(Age <= 13 ~ "less than or equal to 13",
                                    Age == 14 | Age == 15 ~ "14 to 15",
                                    Age == 16 | Age == 17 ~ "16 to 17",
                                    Age >= 18 ~ "greater than or equal to 18"))

lung %>%
  ggplot(aes(y = LungCap, color = Smoke)) +
  geom_histogram(bins = 40) +
  facet_wrap(vars(AgeGrp)) +
  theme_classic() + 
  labs(title = "Relationship of LungCap and Smoke based on age categories", y = "Lung Capacity", x = "Frequency")

From the above plot, we can derive two important observations: 1. The lung capacity of non-smokers is more than smokers. 2. The people who smoke are less in age group of “less than or equal to 13”. So as the result as age increases the lung capacity decreases.

1_E

Relationship between Smoke and Lung capacity on basis of age:

Code
lung %>%
  ggplot(aes(x = Age, y = LungCap, color = Smoke)) +
  geom_line() +
  theme_classic() + 
  facet_wrap(vars(Smoke)) +
  labs(title = "Relationship of LungCap and Smoke based on age", y = "Lung Capacity", x = "Age")

Comparing 1_D and 1_E we can find similarity which points that only 10 and above age group smoke.

1_F

Calculating the correlation and covariance between Lung Capacity and Age:

Code
Covariance <- cov(lung$LungCap, lung$Age)
Correlation <- cor(lung$LungCap, lung$Age)
Covariance
[1] 8.738289
Code
Correlation
[1] 0.8196749

The comparison shows that the covariance is positive, indicating that lung capacity and age have a direct relationship. As a result, they are moving in the same direction due to the positive correlation as well. This means that as age increases, lung capacity increases as well, which means they are directly proportional.

Question 2

Reading the table

Code
Prior_convitions <- c(0:4)
Inmate_count <- c(128, 434, 160, 64, 24)
prior <- data_frame(Prior_convitions, Inmate_count)
Warning: `data_frame()` was deprecated in tibble 1.1.0.
Please use `tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Code
prior
Code
prior <- mutate(prior, Probability = Inmate_count/sum(Inmate_count))
prior

2_A

Probability that a randomly selected inmate has exactly 2 prior convictions:

Code
prior %>%
  filter(Prior_convitions == 2) %>%
  select(Probability)

2_B

Probability that a randomly selected inmate has fewer than 2 convictions:

Code
random <- prior %>%
  filter(Prior_convitions < 2)
sum(random$Probability)
[1] 0.6938272

2_C

Probability that a randomly selected inmate has 2 or fewer prior convictions:

Code
random <- prior %>%
  filter(Prior_convitions <= 2)
sum(random$Probability)
[1] 0.891358

2_D

Probability that a randomly selected inmate has more than 2 prior convictions:

Code
random <- prior %>%
  filter(Prior_convitions > 2)
sum(random$Probability)
[1] 0.108642

2_E

Expected value for the number of prior convictions:

Code
prior <- mutate(prior, Wm = Prior_convitions*Probability)
ev <- sum(prior$Wm)
ev
[1] 1.28642

2_F

Variance for the Prior Convictions:

Code
variance <-sum(((prior$Prior_convitions-ev)^2)*prior$Probability)
variance
[1] 0.8562353

standard deviation for the Prior Convictions:

Code
sqrt(variance)
[1] 0.9253298