Homework1 - EDA of LungCap Data

hw1
Adithya Parupudi
HW1 submission
Author

Adithya Parpudi

Published

February 23, 2023

Libraries

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.1     ✔ stringr 1.4.1
✔ readr   2.1.3     ✔ forcats 0.5.2
Warning: package 'ggplot2' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(hrbrthemes)
NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
      Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
      if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
Code
library(viridis)
Warning: package 'viridis' was built under R version 4.2.2
Loading required package: viridisLite
Code
library(readxl)

Read data

Code
df <- read_excel("_data/LungCapData.xls")

Answering Questions

Question 1

1a) The distribution of LungCap looks as follows:

Code
hist(df$LungCap)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

1b) Compare the probability distribution of the LungCap with respect to Males and Females? (Hint:make boxplots separated by gender using the boxplot() function)

Code
#boxplot code

df %>%
  ggplot( aes(x=Gender, y=LungCap, fill=Gender)) +
    geom_boxplot() +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=12)
    ) +
    ggtitle("Lungcap vs Gender") +
    xlab("Gender") +
  ylab("Lung Cap")
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
found in Windows font database

Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
found in Windows font database

Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

c) Compare the mean lung capacities for smokers and non-smokers. Does it make sense?

Code
mean_smoke <- df %>%
  group_by(Smoke) %>%
  summarise(mean = mean(LungCap))
mean_smoke
# A tibble: 2 × 2
  Smoke  mean
  <chr> <dbl>
1 no     7.77
2 yes    8.65

According to this mean, it doesn’t make sense that lung capacities of smokers is greater than that of non-smokers.

d) Examine the relationship between Smoking and Lung Capacity within age groups: “less than or

equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”.

Code
df <- mutate(df, AgeGrp = case_when(Age <= 13 ~ "less than or equal to 13",
                                    Age == 14 | Age == 15 ~ "14 to 15",
                                    Age == 16 | Age == 17 ~ "16 to 17",
                                    Age >= 18 ~ "greater than or equal to 18"))

df %>%
  ggplot(aes(y = LungCap, color = Smoke)) +
  geom_histogram(bins = 25) +
  facet_wrap(vars(AgeGrp)) +
  labs(title = "Relationship of LungCap and Smoke based on Age", y = "Lung Capacity", x = "Frequency")

e) Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c. What could possibly be going on here?

Code
df %>%
  ggplot(aes(x = Age, y = LungCap, color = Smoke)) +
  geom_line() +
  theme_classic() + 
  facet_wrap(vars(Smoke)) +
  labs(title = "Relationship of LungCap and Smoke v/s Age", y = "Lung Capacity", x = "Age")

Question 2

Reading the table

Code
Prior_convitions <- c(0:4)
Inmate_count <- c(128, 434, 160, 64, 24)
Pc <- data_frame(Prior_convitions, Inmate_count)
Warning: `data_frame()` was deprecated in tibble 1.1.0.
ℹ Please use `tibble()` instead.
Code
Pc
# A tibble: 5 × 2
  Prior_convitions Inmate_count
             <int>        <dbl>
1                0          128
2                1          434
3                2          160
4                3           64
5                4           24
Code
Pc <- mutate(Pc, Probability = Inmate_count/sum(Inmate_count))
Pc
# A tibble: 5 × 3
  Prior_convitions Inmate_count Probability
             <int>        <dbl>       <dbl>
1                0          128      0.158 
2                1          434      0.536 
3                2          160      0.198 
4                3           64      0.0790
5                4           24      0.0296

2a - Probability that a randomly selected inmate has exactly 2 prior convictions:

Code
Pc %>%
  filter(Prior_convitions == 2) %>%
  select(Probability)
# A tibble: 1 × 1
  Probability
        <dbl>
1       0.198

2b - Probability that a randomly selected inmate has fewer than 2 convictions:

Code
temp <- Pc %>%
  filter(Prior_convitions < 2)
sum(temp$Probability)
[1] 0.6938272

2c - Probability that a randomly selected inmate has 2 or fewer prior convictions:

Code
temp <- Pc %>%
  filter(Prior_convitions <= 2)
sum(temp$Probability)
[1] 0.891358

2d - Probability that a randomly selected inmate has more than 2 prior convictions:

Code
temp <- Pc %>%
  filter(Prior_convitions > 2)
sum(temp$Probability)
[1] 0.108642

2e - Expected value for the number of prior convictions:

Code
Pc <- mutate(Pc, Wm = Prior_convitions*Probability)
e <- sum(Pc$Wm)
e
[1] 1.28642

2f - Variance for the Prior Convictions:

Code
v <-sum(((Pc$Prior_convitions-e)^2)*Pc$Probability)
v
[1] 0.8562353

standard deviation for the Prior Convictions:

Code
sqrt(v)
[1] 0.9253298