Challenge 5

challenge_5
cereal
Introduction to Visualization
Author

Paarth Tandon

Published

January 9, 2022

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • cereal ⭐
  • pathogen cost ⭐
  • Australian Marriage ⭐⭐
  • AB_NYC_2019.csv ⭐⭐⭐
  • railroads ⭐⭐⭐
  • Public School Characteristics ⭐⭐⭐⭐
  • USA Households ⭐⭐⭐⭐⭐
set.seed(42)
# read in the data using readr
cereal <- read_csv("_data/cereal.csv")
head(cereal, 5)
Cereal Sodium Sugar Type
Frosted Mini Wheats 0 11 A
Raisin Bran 340 18 A
All Bran 70 5 A
Apple Jacks 140 14 C
Captain Crunch 200 12 C

Briefly describe the data

This data set contains four columns:

  • Cereal <chr>: The name of the cereal
  • Sodium <dbl>: The amount of sodium in a serving of the cereal
  • Sugar <dbl>: The amount of sugar in a serving of cereal
  • Type <chr>: The type of the cereal (Child or Adult)

Tidy Data (as needed)

Data is already tidy. Going to mutate Type for visualization reasons.

cereal <- mutate(cereal, `Type` = recode(`Type`,
    "A" = "Adult",
    "C" = "Child"
))
head(cereal, 5)
Cereal Sodium Sugar Type
Frosted Mini Wheats 0 11 Adult
Raisin Bran 340 18 Adult
All Bran 70 5 Adult
Apple Jacks 140 14 Child
Captain Crunch 200 12 Child

Univariate Visualizations

ggplot(cereal, aes(`Sodium`)) +
geom_histogram(binwidth = 100) +
facet_grid(vars(`Type`)) +
labs(title = "Distribution of Sodium based on Type")

ggplot(cereal, aes(`Sugar`)) +
geom_histogram(binwidth = 5) +
facet_grid(vars(`Type`)) +
labs(title = "Distribution of Sugar based on Type")

I chose histograms, since we are working with univariate, continuous data. I wanted to compare by type, so I applied a facet_grid.

Bivariate Visualization(s)

ggplot(cereal, aes(`Type`, `Sugar`)) +
geom_boxplot() +
labs(title = "Amount of Sugar in Cereal by Type")

Here we can see that children’s cereal has more sugar in it than adult’s cereal, on average. I chose a box plot, since the data is very sparse. With only 20 points, we can only extrapolate so much.