hw1
desriptive statistics
probability
HW1 course blog qmd file
Author

Rahul Somu

Published

February 27, 2023

Question 1

a

First, let’s read in the data from the Excel file:

Code
library(readxl)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
library(ggplot2)
getwd()
[1] "/Users/rahulsomu/Documents/DACSS_601/603_repo/posts"
Code
df <- read_excel("_data/LungCapData.xls")

The distribution of LungCap looks as follows:

1a) The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

Code
hist(df$LungCap)

1b) Median lung capacity of male is greater than that of female.

Code
boxplot(LungCap ~ Gender, data = df, xlab = "Gender", ylab = "Lung Capacity",
        main = "Distribution of Lung Capacity by Gender")

1c) Logically mean lung capacity of non-smokers should be more than of smokers but with the data, it’s other way round

Code
# Calculate the mean lung capacity for smokers and non-smokers
mean_lungcap_smokers <- mean(df$LungCap[df$Smoke == "yes"])
mean_lungcap_non_smokers <- mean(df$LungCap[df$Smoke == "no"])

# Print the mean lung capacities
cat("Mean lung capacity for smokers:", round(mean_lungcap_smokers, 2), "\n")
Mean lung capacity for smokers: 8.65 
Code
cat("Mean lung capacity for non-smokers:", round(mean_lungcap_non_smokers, 2), "\n")
Mean lung capacity for non-smokers: 7.77 

1d) The average lung capacity growth for the non-smokers is more that of the smokers. The lung capacity has been gradually increasing with age 1e) The discrepancy in 1c is due the data for the 13 or younger age where the average lung capacity for the non-smokers is less that of the smokers. Also there have been more data points for 13 or younger age group non-smokers which is affecting the mean of entire distribution.

Code
# Define the age groups
df <- df %>%
  mutate(age_groups = cut(Age, c(0, 13, 15, 17, Inf), labels = c("<= 13", "14-15", "16-17", ">= 18")))

# Compare the probability distribution of lung capacity by gender
df %>%
  ggplot(aes(x = Gender, y = LungCap)) +
  geom_boxplot() +
  labs(x = "Gender", y = "Lung Capacity", 
       title = "Lung Capacity by Gender")

Code
# Compare the mean lung capacities for smokers and non-smokers
df %>%
  group_by(Smoke) %>%
  summarize(mean_lungcap = mean(LungCap)) %>%
  print()
# A tibble: 2 × 2
  Smoke mean_lungcap
  <chr>        <dbl>
1 no            7.77
2 yes           8.65
Code
# Examine the relationship between smoking and lung capacity within age groups
df %>%
  filter(Smoke %in% c("yes", "no")) %>%
  group_by(age_groups, Smoke) %>%
  summarize(mean_lungcap = mean(LungCap)) %>%
  print()
`summarise()` has grouped output by 'age_groups'. You can override using the
`.groups` argument.
# A tibble: 8 × 3
# Groups:   age_groups [4]
  age_groups Smoke mean_lungcap
  <fct>      <chr>        <dbl>
1 <= 13      no            6.36
2 <= 13      yes           7.20
3 14-15      no            9.14
4 14-15      yes           8.39
5 16-17      no           10.5 
6 16-17      yes           9.38
7 >= 18      no           11.1 
8 >= 18      yes          10.5 
Code
# Compare the lung capacities for smokers and non-smokers within each age group
df %>%
  filter(Smoke %in% c("yes", "no")) %>%
  ggplot(aes(x = age_groups, y = LungCap, fill = Smoke)) +
  geom_boxplot() +
  labs(x = "Age Group", y = "Lung Capacity", 
       title = "Lung Capacity by Smoking Status and Age Group")

#Challange2

2a) probability of having exactly 2 prior convictions: 0.1975

Code
# Define the values
x <- c(0, 1, 2, 3, 4)

p2 <- 160/810
p2
[1] 0.1975309

2b) probability of having fewer than 2 prior convictions: 0.6938272

Code
p_less2 <- (128 + 434) / 810
p_less2
[1] 0.6938272

2c) probability of having 2 or fewer prior convictions: 0.891358

Code
p_2less <- (128 + 434 + 160) / 810
p_2less
[1] 0.891358

2d)probability of having more than 2 prior convictions: 0.108642

Code
p_more2 <- (64 + 24) / 810
p_more2
[1] 0.108642

2e)expected value for the number of prior convictions: 1.28642

Code
ex <- sum(c(0, 1, 2, 3, 4) * c(128, 434, 160, 64, 24) / 810)
ex
[1] 1.28642

#2f) Variance: 0.898864 & Standard deviation: 0.9480844

Code
p_x <- c(0.158, 0.536, 0.198, 0.079, 0.029)
mu <- 1.502

# Calculate variance and standard deviation
variance <- sum((x - mu)^2 * p_x)
sd <- sqrt(variance)

# Print results
cat("Variance:", variance, "\n")
Variance: 0.898864 
Code
cat("Standard deviation:", sd)
Standard deviation: 0.9480844