Homework1

hw1

desriptive statistics

probability

HW1 course blog qmd file

Author

Rahul Somu

Published

February 27, 2023

Question 1

a

First, let’s read in the data from the Excel file:

Code

library(readxl)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Code

library(ggplot2)
getwd()

[1] "/Users/rahulsomu/Documents/DACSS_601/603_repo/posts"

Code

df <- read_excel("_data/LungCapData.xls")

The distribution of LungCap looks as follows:

1a) The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

Code

hist(df$LungCap)

1b) Median lung capacity of male is greater than that of female.

Code

boxplot(LungCap ~ Gender, data = df, xlab = "Gender", ylab = "Lung Capacity",
        main = "Distribution of Lung Capacity by Gender")

1c) Logically mean lung capacity of non-smokers should be more than of smokers but with the data, it’s other way round

Code

# Calculate the mean lung capacity for smokers and non-smokers
mean_lungcap_smokers <- mean(df$LungCap[df$Smoke == "yes"])
mean_lungcap_non_smokers <- mean(df$LungCap[df$Smoke == "no"])

# Print the mean lung capacities
cat("Mean lung capacity for smokers:", round(mean_lungcap_smokers, 2), "\n")

Mean lung capacity for smokers: 8.65

Code

cat("Mean lung capacity for non-smokers:", round(mean_lungcap_non_smokers, 2), "\n")

Mean lung capacity for non-smokers: 7.77

1d) The average lung capacity growth for the non-smokers is more that of the smokers. The lung capacity has been gradually increasing with age 1e) The discrepancy in 1c is due the data for the 13 or younger age where the average lung capacity for the non-smokers is less that of the smokers. Also there have been more data points for 13 or younger age group non-smokers which is affecting the mean of entire distribution.

Code

# Define the age groups
df <- df %>%
  mutate(age_groups = cut(Age, c(0, 13, 15, 17, Inf), labels = c("<= 13", "14-15", "16-17", ">= 18")))

# Compare the probability distribution of lung capacity by gender
df %>%
  ggplot(aes(x = Gender, y = LungCap)) +
  geom_boxplot() +
  labs(x = "Gender", y = "Lung Capacity", 
       title = "Lung Capacity by Gender")

Code

# Compare the mean lung capacities for smokers and non-smokers
df %>%
  group_by(Smoke) %>%
  summarize(mean_lungcap = mean(LungCap)) %>%
  print()

# A tibble: 2 × 2
  Smoke mean_lungcap
  <chr>        <dbl>
1 no            7.77
2 yes           8.65

Code

# Examine the relationship between smoking and lung capacity within age groups
df %>%
  filter(Smoke %in% c("yes", "no")) %>%
  group_by(age_groups, Smoke) %>%
  summarize(mean_lungcap = mean(LungCap)) %>%
  print()

`summarise()` has grouped output by 'age_groups'. You can override using the
`.groups` argument.

# A tibble: 8 × 3
# Groups:   age_groups [4]
  age_groups Smoke mean_lungcap
  <fct>      <chr>        <dbl>
1 <= 13      no            6.36
2 <= 13      yes           7.20
3 14-15      no            9.14
4 14-15      yes           8.39
5 16-17      no           10.5 
6 16-17      yes           9.38
7 >= 18      no           11.1 
8 >= 18      yes          10.5

Code

# Compare the lung capacities for smokers and non-smokers within each age group
df %>%
  filter(Smoke %in% c("yes", "no")) %>%
  ggplot(aes(x = age_groups, y = LungCap, fill = Smoke)) +
  geom_boxplot() +
  labs(x = "Age Group", y = "Lung Capacity", 
       title = "Lung Capacity by Smoking Status and Age Group")

#Challange2

2a) probability of having exactly 2 prior convictions: 0.1975

Code

# Define the values
x <- c(0, 1, 2, 3, 4)

p2 <- 160/810
p2

[1] 0.1975309

2b) probability of having fewer than 2 prior convictions: 0.6938272

Code

p_less2 <- (128 + 434) / 810
p_less2

[1] 0.6938272

2c) probability of having 2 or fewer prior convictions: 0.891358

Code

p_2less <- (128 + 434 + 160) / 810
p_2less

[1] 0.891358

2d)probability of having more than 2 prior convictions: 0.108642

Code

p_more2 <- (64 + 24) / 810
p_more2

[1] 0.108642

2e)expected value for the number of prior convictions: 1.28642

Code

ex <- sum(c(0, 1, 2, 3, 4) * c(128, 434, 160, 64, 24) / 810)
ex

[1] 1.28642

#2f) Variance: 0.898864 & Standard deviation: 0.9480844

Code

p_x <- c(0.158, 0.536, 0.198, 0.079, 0.029)
mu <- 1.502

# Calculate variance and standard deviation
variance <- sum((x - mu)^2 * p_x)
sd <- sqrt(variance)

# Print results
cat("Variance:", variance, "\n")

Variance: 0.898864

Code

cat("Standard deviation:", sd)

Standard deviation: 0.9480844