library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 5: Introduction to Visualization of Data
Challenge Overview
Today’s challenge is to:
- read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- create at least two univariate visualizations
- try to make them “publication” ready
- Explain why you choose the specific graph type
- Create at least one bivariate visualization
- try to make them “publication” ready
- Explain why you choose the specific graph type
R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.
(be sure to only include the category tags for the data you use!)
Read in data
#read in cereal data csv file
<- read_csv("~/GIT/601_Fall_2022/posts/_data/cereal.csv") cereal
Error: '~/GIT/601_Fall_2022/posts/_data/cereal.csv' does not exist.
#print to confirm successfully read in data
print(cereal)
Error in print(cereal): object 'cereal' not found
##Data Summary
#get column names
colnames(cereal)
Error in is.data.frame(x): object 'cereal' not found
#get column types
spec(cereal)
Error in stopifnot(inherits(x, "tbl_df")): object 'cereal' not found
#get data set dimensions
dim(cereal)
Error in eval(expr, envir, enclos): object 'cereal' not found
#print to view dataset
print(cereal)
Error in print(cereal): object 'cereal' not found
The Cereal dataset consists of 4 variables (columns) and 20 cases (rows). The variables consist of:
‘Cereal’: a qualitative/categorical nominal variable, listing of individual cereal brands
‘Sodium’: a quantitative continuous variable of sodium concentrations, likely per serving, measured in milligrams.
‘Sugar’: a quantitative continuous variable of sugar concentrations, likely per serving, measured in grams.
‘Type’: a quantitative ordinal variable, 2 options of A or C
This dataset appears to be a collection of specific nutritional information for each brand of cereal. Likely the measurements are determined per a serving of cereal, with serving sizes likely varying per brand and size and/or amount of pieces of cereal. The ‘Type’ variable is difficult to determine, but one could surmise, from the other information in the data set, that the values may represent a grading system based upon nutritional content of each cereal.
For this dataset it is expected that some columns may need to be renamed for ease of analysis. To examine the relationship between sodium content and cereal type, through graphical modeling and statistical analysis, the ‘Type’ column may need to be rearranged (pivot_wider) to separate the cereal data by Type A or C, if this cannot be accomplished by other means.
##Tidying/Wrangling Data and Graphing Univariate Data
#rename 'Cereal' column to avoid conflict from object and column sharing designation
<- rename(cereal, "Product" = "Cereal") cereal
Error in rename(cereal, Product = "Cereal"): object 'cereal' not found
#print dataset to confirm name change
print(cereal)
Error in print(cereal): object 'cereal' not found
#get number of brands that fall under Type A and C
table(select( cereal, 'Type'))
Error in select(cereal, "Type"): object 'cereal' not found
#Make histogram of 'Type' content data
ggplot(cereal, aes(x = Type)) + geom_bar(color = "black", fill = "lightgray") + labs(title = 'Type Grade of Cereals', y = "Count", x = "Cereal Type") + theme_minimal()
Error in ggplot(cereal, aes(x = Type)): object 'cereal' not found
There are 10 cereal brands that fall under the A type and 10 that fall under the C type. This is confirmed in a histogram model of the Type data. This illustrates a relatively even split between the two sets of Type data. While more data points would assist in normalizing the data for greater precision in statistical analysis of the sample data, to reflect the greater cereal population, this is suitable for further examination.
#get summary statistics for Sodium content values
summary(cereal)
Error in summary(cereal): object 'cereal' not found
#Make histogram of 'Sodium' Content data
ggplot(cereal, aes(x = Sodium)) + geom_histogram(binwidth = 20, color = "black", fill = "lightgray") + labs(title = 'Sodium Content of Cereals', y = "Count", x = "Sodium Content (mg/serving)") + theme_minimal()
Error in ggplot(cereal, aes(x = Sodium)): object 'cereal' not found
From a graph of Sodium content, we can observe that the total sodium content across the total 20 samples ranges widely from a minimum of 0mg per serving to a maximum of 340mg per serving, with a mean of 167.0mg per serving. To further examine this data by cereal type statistical data and graphical models of the sodium content by cereal type must be rendered.
##Tidying/Wrangling Data and Graphing Bivariate Data
#want to get summary data statistics for sodium content for both types A and C
#Try to filter out each type separately to create new objects to put into vectors to create new data frame to do summary with
<- filter(cereal, Type == 'A') sodium_typeA
Error in filter(cereal, Type == "A"): object 'cereal' not found
<- filter(cereal, Type == 'C') sodium_typeC
Error in filter(cereal, Type == "C"): object 'cereal' not found
#print sodium_typeA to confirm changes
print(sodium_typeA)
Error in print(sodium_typeA): object 'sodium_typeA' not found
#print sodium_typeC to confirm changes
print(sodium_typeC)
Error in print(sodium_typeC): object 'sodium_typeC' not found
<- select(sodium_typeA, 'Sodium' ) Sodium_Content_TypeA
Error in select(sodium_typeA, "Sodium"): object 'sodium_typeA' not found
<- select(sodium_typeC, 'Sodium') Sodium_Content_TypeC
Error in select(sodium_typeC, "Sodium"): object 'sodium_typeC' not found
<- as.data.frame(c(Sodium_Content_TypeA, Sodium_Content_TypeC)) sodium_type_data
Error in as.data.frame(c(Sodium_Content_TypeA, Sodium_Content_TypeC)): object 'Sodium_Content_TypeA' not found
<- rename(sodium_type_data, 'Sodium_Content_Type_A' = "Sodium", 'Sodium_Content_Type_C' = "Sodium.1") sodium_type_data
Error in rename(sodium_type_data, Sodium_Content_Type_A = "Sodium", Sodium_Content_Type_C = "Sodium.1"): object 'sodium_type_data' not found
#print sodium content data to confirm column name changes
print(sodium_type_data)
Error in print(sodium_type_data): object 'sodium_type_data' not found
#print sodium content typeA to check data intact
print(Sodium_Content_TypeA)
Error in print(Sodium_Content_TypeA): object 'Sodium_Content_TypeA' not found
#print sodium content typeC to confirm data intact
print(Sodium_Content_TypeC)
Error in print(Sodium_Content_TypeC): object 'Sodium_Content_TypeC' not found
#get summary of sodium data for each type from new data frame
summary(sodium_type_data)
Error in summary(sodium_type_data): object 'sodium_type_data' not found
#make boxplot of cereal 'type' and 'sodium' content
ggplot(cereal, aes(Type, Sodium)) + geom_boxplot(color = "black", fill = "lightgray") + labs(title = 'Sodium Content of Type A and C Cereals', y = "Sodium Content (mg/serving)", x = "Cereal Type") + theme_minimal()
Error in ggplot(cereal, aes(Type, Sodium)): object 'cereal' not found
To further examine the sodium content data by cereal type, a statistical analysis was performed and a bivariate boxplot model was generated to display the differences in sodium content for each type. In comparison to the initial conjoined data for sodium content, we see that the Type A and C show clear differences in sodium content for each. By range, Type A had a minimum sodium content of 0mg per serving and a maximum of 340mg per serving, where Type C cereals ranged a minimum of 130mg per serving to a maximum sodium content of 290mg per serving. But despite its greater range, we see that the mean for Type A cereals is 149mg per serving, being several milligrams lower than the Type C cereal mean sodium content of 185mg per serving. If we were to conclude the nutritional value of each cereal based upon sodium content, alone, we might conclude that the Type A cereals were the healthier option for having a lower sodium content on average. But this data may be skewed by the wide range in sodium contents for Type A cereals, and it may be more efficient to examine additional factors in determining the nutritional quality of cereals. That is, if the Types were a form of nutritional grading, which has proven difficult to confirm. Regardless, if sodium content is being examined, it would be better to collect a greater number of cereal brand samples to add to the sample population to improve precision in estimating population values, as a smaller sample population may tend towards skewing with widely ranging (potential) outliers arising in the type sample populations.