HOME WORK1 603

hw1

descriptive statistics

megha joseph

probability

Author

Megha Joseph

Published

October 3, 2022

Code

library(tidyverse)
library(readxl)
library(ggplot2)
library(stats)

knitr::opts_chunk$set(echo = TRUE)

Answer 1

Code

readD <- read_excel("_data/LungCapData.xls")
readD

# A tibble: 725 × 6
   LungCap   Age Height Smoke Gender Caesarean
     <dbl> <dbl>  <dbl> <chr> <chr>  <chr>    
 1    6.48     6   62.1 no    male   no       
 2   10.1     18   74.7 yes   female no       
 3    9.55    16   69.7 no    female yes      
 4   11.1     14   71   no    male   no       
 5    4.8      5   56.9 no    male   no       
 6    6.22    11   58.7 no    female no       
 7    4.95     8   63.3 no    male   yes      
 8    7.32    11   70.4 no    male   no       
 9    8.88    15   70.5 no    male   no       
10    6.8     11   59.2 no    male   no       
# … with 715 more rows

Answer 1 (a)

Distribution of LungCap:

Code

hist(readD$LungCap)

The distribution is a normal distribution. ## Answer 1 (b)

Code

boxplot(readD$LungCap ~ readD$Gender)

The mean of males appear higher than females.

Answer 1 (c)

Code

readD%>%
  group_by(Smoke) %>% 
  summarize(Mean=mean(LungCap))

# A tibble: 2 × 2
  Smoke  Mean
  <chr> <dbl>
1 no     7.77
2 yes    8.65

Code

readD%>%
  group_by(Smoke) %>% 
  summarize(stdev=sd(LungCap))

# A tibble: 2 × 2
  Smoke stdev
  <chr> <dbl>
1 no     2.73
2 yes    1.88

Code

ggplot(readD, aes(x=LungCap, y=Smoke))+geom_boxplot()

The mean of smokers is higher than the mean of non smokers and therefore it is not sensible.

Answer 1 (d)

Code

class(readD$Age)

[1] "numeric"

Code

readD <- mutate(readD, AgeGroup = case_when(Age <= 13 ~ "13 and below", 
                                            Age == 14 | Age == 15 ~ "14 to 15", 
                                            Age == 16 | Age == 17 ~ "16 to 17", 
                                            Age >= 18 ~ "18 and above"))
ggplot(readD, aes(x = LungCap)) +
  geom_histogram() +
  facet_grid(AgeGroup~Smoke)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Code

readD %>%
  ggplot(aes(x = Age, y = LungCap, color = Smoke)) +
  geom_line() +
  facet_wrap(vars(Smoke)) +
  labs(y = "Lung Capacity", x = "Age")

From the above results we can say that people from age group 10 and above smoke.

Answer 1 (f)

Code

cor(readD$LungCap,readD$Age)

[1] 0.8196749

Code

cov(readD$LungCap,readD$Age)

[1] 8.738289

From the data we can see that the covariance is positive and it shows that there is a direct relationship between age and lung capacity. And the correlation is also positive, so they move in same direction. Therefore as the age increases, the lung capacity also increases that is they are directly proportional to each other.

Answer 2

Code

X<-c(0, 1, 2, 3, 4)
Frequency<-c(128, 434, 160, 64, 24)
C<- data.frame(X, Frequency)
C

  X Frequency
1 0       128
2 1       434
3 2       160
4 3        64
5 4        24

Code

C<-rename(C, PriorConvictions=X)
C

  PriorConvictions Frequency
1                0       128
2                1       434
3                2       160
4                3        64
5                4        24

Code

#visualizing df using bar chart
ggplot(C, aes(x=PriorConvictions, y=Frequency))+geom_bar(stat="identity")+geom_text(aes(label = Frequency), vjust = -.3)

Code

#There are 810 obs in df
sum(Frequency)

[1] 810

Code

PO<-Frequency/810
PO

[1] 0.15802469 0.53580247 0.19753086 0.07901235 0.02962963

Code

#A
# P(x=2)=160/810
160/810

[1] 0.1975309

Code

#B
#P(x<2)=P(0)+P(1)
(128+434)/810

[1] 0.6938272

Code

#C
#P(x<=2)=P(0)+P(1)+P(2)
(128+434+160)/810

[1] 0.891358

Code

#D
#1-P(above)
1-((128+434+160)/810)

[1] 0.108642

Code

#E
#Expected value=sum of probabilities*each value (0, 1, 2, 3 or 4)
weighted.mean(X, PO)

[1] 1.28642

Code

#F
#Calculating the Variance using the formula for variance
(sum(Frequency*((X-1.28642)^2)))/(sum(Frequency)-1)

[1] 0.8572937

Code

#Calculating the sample standard deviation from the variance
sqrt(0.8572937)

[1] 0.9259016

Answer

a: 19.75% b :9.38% c :89.14% d :10.86% e :1.28642 f: variance: 0.8572937 standard deviation: 0.9259016