hw1
desriptive statistics
probability
The homework-1 on descriptive statistics and probability
Author

Niharika Pola

Published

October 2, 2022


Code
library(tidyverse)
library(readxl)
library(ggplot2)
library(stats)

knitr::opts_chunk$set(echo = TRUE)

Question 1

Reading data

Code
Lc <- read_excel("LungCapData.xls")
Error: `path` does not exist: 'LungCapData.xls'
Code
Lc
Error in eval(expr, envir, enclos): object 'Lc' not found

The data consists of 725 rows and 6 columns. It determines the lung capacity of the based on their age, height and different characteristics. The main key classification that I can see is if they smoke or not.

1a

The distribution of LungCap looks as follows:

Code
Lc %>%
  ggplot(aes(LungCap, ..density..)) +
  geom_histogram(bins= 25, color = "orange") +
  geom_density(color = "darkblue") +
  theme_classic() + 
  labs(title = "Probability distribution of LungCap", x = "Lung Capcity", y = "Probability density")
Error in ggplot(., aes(LungCap, ..density..)): object 'Lc' not found

The histogram and density plots show that it is pretty close to a normal distribution. Most of the observations are close to the mean.

1b

The distribution of LungCap on basis of gender looks as follows:

Code
Lc %>%
  ggplot(aes(y = dnorm(LungCap), color = Gender)) +
  geom_boxplot() +
  theme_classic() + 
  labs(title = "Probability distribution of LungCap based on gender", y = "Probability density")
Error in ggplot(., aes(y = dnorm(LungCap), color = Gender)): object 'Lc' not found

The box plot shows that the probability density of the male is lesser than the female.

1c

Comparison of mean lung capacities between smokers and non-smokers:

Code
Mean_smoke <- Lc %>%
  group_by(Smoke) %>%
  summarise(mean = mean(LungCap))
Error in group_by(., Smoke): object 'Lc' not found
Code
Mean_smoke
Error in eval(expr, envir, enclos): object 'Mean_smoke' not found

From the above table, we see that the mean lung capacity of those who smoke is greater than those who don’t smoke, but it doesn’t make sense. It also depends on the biological factors of the person who smoke, so we can’t conclude it.

1d

Relationship between Smoke and Lung capacity on basis of given age categories:

Code
Lc <- mutate(Lc, AgeGrp = case_when(Age <= 13 ~ "less than or equal to 13",
                                    Age == 14 | Age == 15 ~ "14 to 15",
                                    Age == 16 | Age == 17 ~ "16 to 17",
                                    Age >= 18 ~ "greater than or equal to 18"))
Error in mutate(Lc, AgeGrp = case_when(Age <= 13 ~ "less than or equal to 13", : object 'Lc' not found
Code
Lc %>%
  ggplot(aes(y = LungCap, color = Smoke)) +
  geom_histogram(bins = 25) +
  facet_wrap(vars(AgeGrp)) +
  theme_classic() + 
  labs(title = "Relationship of LungCap and Smoke based on age categories", y = "Lung Capacity", x = "Frequency")
Error in ggplot(., aes(y = LungCap, color = Smoke)): object 'Lc' not found

From the above plot, we can derive two important observations: 1. The lung capacity of non smokers is more than smokers. 2. The people who smoke are less in age group of “less than or equal to 13”. So as the result as age increases the lung capacity decreases.

1e

Relationship between Smoke and Lung capacity on basis of age:

Code
Lc %>%
  ggplot(aes(x = Age, y = LungCap, color = Smoke)) +
  geom_line() +
  theme_classic() + 
  facet_wrap(vars(Smoke)) +
  labs(title = "Relationship of LungCap and Smoke based on age", y = "Lung Capacity", x = "Age")
Error in ggplot(., aes(x = Age, y = LungCap, color = Smoke)): object 'Lc' not found

Form the above data we can compare 1d and 1e and can say the results are pretty similar. Only 10 and above age group smoke.

1f

Calculating the correlation and covariance between Lung Capacity and Age:

Code
Covariance <- cov(Lc$LungCap, Lc$Age)
Error in is.data.frame(y): object 'Lc' not found
Code
Correlation <- cor(Lc$LungCap, Lc$Age)
Error in is.data.frame(y): object 'Lc' not found
Code
Covariance
Error in eval(expr, envir, enclos): object 'Covariance' not found
Code
Correlation
Error in eval(expr, envir, enclos): object 'Correlation' not found

We can observe from the comparison that the covariance is positive and it indicates that there is a direct relationship between age and lung capacity. And the correlation is also positive, so they move in same direction. We can say from these results that as the age increases, the lung capacity also increases that is they are directly proportional to each other.

Question 2

Reading the table

Code
Prior_convitions <- c(0:4)
Inmate_count <- c(128, 434, 160, 64, 24)
Pc <- data_frame(Prior_convitions, Inmate_count)
Warning: `data_frame()` was deprecated in tibble 1.1.0.
Please use `tibble()` instead.
Code
Pc
Code
Pc <- mutate(Pc, Probability = Inmate_count/sum(Inmate_count))
Pc

2a

Probability that a randomly selected inmate has exactly 2 prior convictions:

Code
Pc %>%
  filter(Prior_convitions == 2) %>%
  select(Probability)

2b

Probability that a randomly selected inmate has fewer than 2 convictions:

Code
temp <- Pc %>%
  filter(Prior_convitions < 2)
sum(temp$Probability)
[1] 0.6938272

2c

Probability that a randomly selected inmate has 2 or fewer prior convictions:

Code
temp <- Pc %>%
  filter(Prior_convitions <= 2)
sum(temp$Probability)
[1] 0.891358

2d

Probability that a randomly selected inmate has more than 2 prior convictions:

Code
temp <- Pc %>%
  filter(Prior_convitions > 2)
sum(temp$Probability)
[1] 0.108642

2e

Expected value for the number of prior convictions:

Code
Pc <- mutate(Pc, Wm = Prior_convitions*Probability)
e <- sum(Pc$Wm)
e
[1] 1.28642

2f

Variance for the Prior Convictions:

Code
v <-sum(((Pc$Prior_convitions-e)^2)*Pc$Probability)
v
[1] 0.8562353

standard deviation for the Prior Convictions:

Code
sqrt(v)
[1] 0.9253298