Homework1 - EDA of LungCap Data

hw1

Adithya Parupudi

HW1 submission

Author

Adithya Parpudi

Published

February 23, 2023

Libraries

Code

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.1     ✔ stringr 1.4.1
✔ readr   2.1.3     ✔ forcats 0.5.2

Warning: package 'ggplot2' was built under R version 4.2.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

library(hrbrthemes)

NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
      Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
      if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

Code

library(viridis)

Warning: package 'viridis' was built under R version 4.2.2

Loading required package: viridisLite

Code

library(readxl)

Read data

Code

df <- read_excel("_data/LungCapData.xls")

Answering Questions

Question 1

1a) The distribution of LungCap looks as follows:

Code

hist(df$LungCap)

The histogram suggests that the distribution is close to a normal distribution. Most of the observations are close to the mean. Very few observations are close to the margins (0 and 15).

1b) Compare the probability distribution of the LungCap with respect to Males and Females? (Hint:make boxplots separated by gender using the boxplot() function)

Code

#boxplot code

df %>%
  ggplot( aes(x=Gender, y=LungCap, fill=Gender)) +
    geom_boxplot() +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=12)
    ) +
    ggtitle("Lungcap vs Gender") +
    xlab("Gender") +
  ylab("Lung Cap")

Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
found in Windows font database

Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
found in Windows font database

Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

c) Compare the mean lung capacities for smokers and non-smokers. Does it make sense?

Code

mean_smoke <- df %>%
  group_by(Smoke) %>%
  summarise(mean = mean(LungCap))
mean_smoke

# A tibble: 2 × 2
  Smoke  mean
  <chr> <dbl>
1 no     7.77
2 yes    8.65

According to this mean, it doesn’t make sense that lung capacities of smokers is greater than that of non-smokers.

d) Examine the relationship between Smoking and Lung Capacity within age groups: “less than or

equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”.

Code

df <- mutate(df, AgeGrp = case_when(Age <= 13 ~ "less than or equal to 13",
                                    Age == 14 | Age == 15 ~ "14 to 15",
                                    Age == 16 | Age == 17 ~ "16 to 17",
                                    Age >= 18 ~ "greater than or equal to 18"))

df %>%
  ggplot(aes(y = LungCap, color = Smoke)) +
  geom_histogram(bins = 25) +
  facet_wrap(vars(AgeGrp)) +
  labs(title = "Relationship of LungCap and Smoke based on Age", y = "Lung Capacity", x = "Frequency")

e) Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part c. What could possibly be going on here?

Code

df %>%
  ggplot(aes(x = Age, y = LungCap, color = Smoke)) +
  geom_line() +
  theme_classic() + 
  facet_wrap(vars(Smoke)) +
  labs(title = "Relationship of LungCap and Smoke v/s Age", y = "Lung Capacity", x = "Age")

Question 2

Reading the table

Code

Prior_convitions <- c(0:4)
Inmate_count <- c(128, 434, 160, 64, 24)
Pc <- data_frame(Prior_convitions, Inmate_count)

Warning: `data_frame()` was deprecated in tibble 1.1.0.
ℹ Please use `tibble()` instead.

Code

Pc

# A tibble: 5 × 2
  Prior_convitions Inmate_count
             <int>        <dbl>
1                0          128
2                1          434
3                2          160
4                3           64
5                4           24

Code

Pc <- mutate(Pc, Probability = Inmate_count/sum(Inmate_count))
Pc

# A tibble: 5 × 3
  Prior_convitions Inmate_count Probability
             <int>        <dbl>       <dbl>
1                0          128      0.158 
2                1          434      0.536 
3                2          160      0.198 
4                3           64      0.0790
5                4           24      0.0296

2a - Probability that a randomly selected inmate has exactly 2 prior convictions:

Code

Pc %>%
  filter(Prior_convitions == 2) %>%
  select(Probability)

# A tibble: 1 × 1
  Probability
        <dbl>
1       0.198

2b - Probability that a randomly selected inmate has fewer than 2 convictions:

Code

temp <- Pc %>%
  filter(Prior_convitions < 2)
sum(temp$Probability)

[1] 0.6938272

2c - Probability that a randomly selected inmate has 2 or fewer prior convictions:

Code

temp <- Pc %>%
  filter(Prior_convitions <= 2)
sum(temp$Probability)

[1] 0.891358

2d - Probability that a randomly selected inmate has more than 2 prior convictions:

Code

temp <- Pc %>%
  filter(Prior_convitions > 2)
sum(temp$Probability)

[1] 0.108642

2e - Expected value for the number of prior convictions:

Code

Pc <- mutate(Pc, Wm = Prior_convitions*Probability)
e <- sum(Pc$Wm)
e

[1] 1.28642

2f - Variance for the Prior Convictions:

Code

v <-sum(((Pc$Prior_convitions-e)^2)*Pc$Probability)
v

[1] 0.8562353

standard deviation for the Prior Convictions:

Code

sqrt(v)

[1] 0.9253298