Challenge 5

challenge_5

pathogen

cereal

Author

Thrishul

Published

March 22, 2023

Code

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

library(ggplot2)
library(readxl)
library(ggrepel)

Warning: package 'ggrepel' was built under R version 4.2.3

Code

library(here)

Warning: package 'here' was built under R version 4.2.3

here() starts at C:/Users/polat/OneDrive/Documents/GitHub/dacssqa/601_Spring_2023

Code

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

In 2018, data was collected on the top 15 pathogens, which includes information on the total number of cases and the estimated cost associated with each pathogen.

Code

pathogen<-here("posts","_data","Total_cost_for_top_15_pathogens_2018.xlsx") %>%
  readxl::read_excel(
  skip=5, 
  n_max=16, 
  col_names = c("pathogens", "Cases", "Cost"))

pathogen

# A tibble: 15 × 3
   pathogens                                                        Cases   Cost
   <chr>                                                            <dbl>  <dbl>
 1 Campylobacter spp. (all species)                                8.45e5 2.18e9
 2 Clostridium perfringens                                         9.66e5 3.84e8
 3 Cryptosporidium spp. (all species)                              5.76e4 5.84e7
 4 Cyclospora cayetanensis                                         1.14e4 2.57e6
 5 Listeria monocytogenes                                          1.59e3 3.19e9
 6 Norovirus                                                       5.46e6 2.57e9
 7 Salmonella (non-typhoidal species)                              1.03e6 4.14e9
 8 Shigella (all species)                                          1.31e5 1.59e8
 9 Shiga toxin-producing Escherichia coli O157 (STEC O157)         6.32e4 3.11e8
10 non-O157 Shiga toxin-producing Escherichia coli (STEC non-O157) 1.13e5 3.17e7
11 Toxoplasma gondii                                               8.67e4 3.74e9
12 Vibrio parahaemolyticus                                         3.47e4 4.57e7
13 Vibrio vulnificus                                               9.6 e1 3.59e8
14 Vibrio non-cholera species other than V. parahaemolyticus and … 1.76e4 8.17e7
15 Yersinia enterocolitica                                         9.77e4 3.13e8

Univariate Visualizations

Due to the limited number of observations and highly skewed distribution of data, the feasibility of using similar visualizations as for a larger dataset or less skewed data needs to be evaluated, as we are dealing with only 15 observations, which is even fewer than the number of cereals in some datasets.

Code

ggplot(pathogen, aes(x=Cases)) +
  geom_histogram()

Code

ggplot(pathogen, aes(x=Cases)) +
  geom_histogram()+
  scale_x_continuous(trans = "log")

Code

ggplot(pathogen, aes(x=Cases)) +
  geom_boxplot()

Code

ggplot(pathogen, aes(x=Cases)) +
  geom_boxplot()+
  scale_x_continuous(trans = "log10")

The histogram plot may not be the optimal choice for visualizing the distribution of the dataset, as it highlights the single outlier but may not provide enough insight into the cases of pathogens with lower counts. A suggestion was made to rescale the number of cases using a logarithmic or other scaling function to improve the visualization. As demonstrated in the subsequent plot, using a logarithmic scaling function for the x-axis has proven to be more informative in revealing the underlying patterns of the data.

exploring the distribution of costs by plotting a graph.

Code

ggplot(pathogen, aes(x=Cost)) +
  geom_histogram()

Code

ggplot(pathogen, aes(x=Cost)) +
  geom_histogram()+
  scale_x_continuous(trans = "log10")

Bivariate Visualization

To further investigate the relationship between cases and costs, bivariate visualizations were created using both logged and unlogged scatterplots.

Code

ggplot(pathogen, aes(x=Cases, y=Cost, label=pathogens)) +
  geom_point() +
  scale_x_continuous(labels = scales::comma)+
  geom_text()

Code

ggplot(pathogen, aes(x=Cases, y=Cost, label=pathogens)) +
  geom_point()+
  scale_x_continuous(trans = "log10", labels = scales::comma)+
  scale_y_continuous(trans = "log10", labels = scales::comma)+
  ggrepel::geom_label_repel()

Although the logged and unlogged scatterplots provided some insight, the visualizations may not be particularly informative for a layperson. It is possible that the dataset would be better utilized by someone with expertise in this field, who can use it as a reference point.