library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 5 Akhilesh
Challenge Overview
Today’s challenge is to:
- read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- create at least two univariate visualizations
- try to make them “publication” ready
- Explain why you choose the specific graph type
- Create at least one bivariate visualization
- try to make them “publication” ready
- Explain why you choose the specific graph type
R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- cereal ⭐
- pathogen cost ⭐
- Australian Marriage ⭐⭐
- AB_NYC_2019.csv ⭐⭐⭐
- railroads ⭐⭐⭐
- Public School Characteristics ⭐⭐⭐⭐
- USA Households ⭐⭐⭐⭐⭐
<-read_csv("_data/cereal.csv", show_col_types = FALSE)
cerealnames(cereal)
[1] "Cereal" "Sodium" "Sugar" "Type"
dim(cereal)
[1] 20 4
summary(cereal)
Cereal Sodium Sugar Type
Length:20 Min. : 0.0 Min. : 0.00 Length:20
Class :character 1st Qu.:137.5 1st Qu.: 4.00 Class :character
Mode :character Median :180.0 Median : 9.50 Mode :character
Mean :167.0 Mean : 8.75
3rd Qu.:202.5 3rd Qu.:12.50
Max. :340.0 Max. :18.00
View(cereal)
Briefly describe the data
print(summarytools::dfSummary(cereal,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.50,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
Data Frame Summary
cereal
Dimensions: 20 x 4Duplicates: 0
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cereal [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sodium [numeric] |
|
15 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sugar [numeric] |
|
15 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Type [character] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-09-04
- The dataset ‘cereal’ contains sodium, sugar value along with ‘type of cereal’ for each cereal type.
- There are 20 different type of cereals
- Type column variable has two values: ‘A’ and ‘C’, both have frequency of 10
- ‘Sodium’ column variable has min, max, mean values of 0, 340, 167 respectively
- ‘Sugar’ column variable has min, max, mean values of 0, 18, 8.75 respectively
- For ‘A’ type cereals min, max, mean values of ‘Sodium’ is 0, 340, 149 respectively
- For ‘A’ type cereals min, max, mean values of ‘Sugar’ is 0, 18, 9.22 respectively
- For ‘C’ type cereals min, max, mean values of ‘Sodium’ is 130, 290, 185 respectively
- For ‘C’ type cereals min, max, mean values of ‘Sugar’ is 1, 14, 9.2 respectively
Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
<- pivot_longer(cereal, cols= c('Sodium', 'Sugar'), names_to='Nutrient_type', values_to ='Value') cereal
Sanity check
head(cereal)
# A tibble: 6 × 4
Cereal Type Nutrient_type Value
<chr> <chr> <chr> <dbl>
1 Frosted Mini Wheats A Sodium 0
2 Frosted Mini Wheats A Sugar 11
3 Raisin Bran A Sodium 340
4 Raisin Bran A Sugar 18
5 All Bran A Sodium 70
6 All Bran A Sugar 5
str(cereal)
tibble [40 × 4] (S3: tbl_df/tbl/data.frame)
$ Cereal : chr [1:40] "Frosted Mini Wheats" "Frosted Mini Wheats" "Raisin Bran" "Raisin Bran" ...
$ Type : chr [1:40] "A" "A" "A" "A" ...
$ Nutrient_type: chr [1:40] "Sodium" "Sugar" "Sodium" "Sugar" ...
$ Value : num [1:40] 0 11 340 18 70 5 140 14 200 12 ...
summary(cereal)
Cereal Type Nutrient_type Value
Length:40 Length:40 Length:40 Min. : 0.00
Class :character Class :character Class :character 1st Qu.: 8.50
Mode :character Mode :character Mode :character Median : 17.00
Mean : 87.88
3rd Qu.:180.00
Max. :340.00
View(cereal)
NA primary check
<- sum(is.na(cereal))
NA_SUM View(cereal)
NA_SUM is zero, so there is no NA value in the dataset
Mutate to change class of ‘Cereal’ column from character to factor
<- cereal %>%
cereal mutate_at(c('Cereal', 'Type', 'Nutrient_type'), factor)
‘Cereal’, ‘Type’& ‘Nutrient_type’ columns are character class, need to be converted to factor and reorder for graphic visualization.
Remaining columns are numeric and don’t need mutation,
String values in the column, can be represented numerically by setting numeric labels
Factors are important for indicating subsets of dataset for categorical variables, also Reordering helps in alignment of graphs
Univariate Visualizations
<- cereal %>%
b group_by(Nutrient_type, Type) %>%
summarise(sum(Value),.groups = 'keep') %>%
rename(nutrient_value = `sum(Value)`)
ggplot(b, aes(nutrient_value, fill = Type)) +
geom_histogram(binwidth = 20) +
labs(title = "Sodium/Sugar Values") +
theme_bw() +
facet_wrap(Nutrient_type~Type, scales = "free_x")
Bivariate Visualization(s)
ggplot(cereal, aes(x=Cereal, y=Value)) +
geom_bar(stat = "identity")+theme(axis.text.x=element_text(angle=90,hjust=1))+
facet_wrap(~'Nutrient_type')+
labs(title = "Bar Graph on Cereal wise Value")
Any additional comments?