DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 5: Introduction to Visualization of Data

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data

Challenge 5: Introduction to Visualization of Data

challenge_5
railroads
cereal
air_bnb
pathogen_cost
australian_marriage
public_schools
usa_households
Exercising skills in visualizing data via graphical models
Author

Christa Bonney

Published

November 24, 2022

library(tidyverse)
library(ggplot2)


knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. create at least two univariate visualizations
  • try to make them “publication” ready
  • Explain why you choose the specific graph type
  1. Create at least one bivariate visualization
  • try to make them “publication” ready
  • Explain why you choose the specific graph type

R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.

(be sure to only include the category tags for the data you use!)

Read in data

#read in cereal data csv file

cereal <- read_csv("~/GIT/601_Fall_2022/posts/_data/cereal.csv")
Error: '~/GIT/601_Fall_2022/posts/_data/cereal.csv' does not exist.
#print to confirm successfully read in data

print(cereal)
Error in print(cereal): object 'cereal' not found

##Data Summary

#get column names

colnames(cereal)
Error in is.data.frame(x): object 'cereal' not found
#get column types

spec(cereal)
Error in stopifnot(inherits(x, "tbl_df")): object 'cereal' not found
#get data set dimensions 

dim(cereal)
Error in eval(expr, envir, enclos): object 'cereal' not found
#print to view dataset

print(cereal)
Error in print(cereal): object 'cereal' not found

The Cereal dataset consists of 4 variables (columns) and 20 cases (rows). The variables consist of:

‘Cereal’: a qualitative/categorical nominal variable, listing of individual cereal brands

‘Sodium’: a quantitative continuous variable of sodium concentrations, likely per serving, measured in milligrams.

‘Sugar’: a quantitative continuous variable of sugar concentrations, likely per serving, measured in grams.

‘Type’: a quantitative ordinal variable, 2 options of A or C

This dataset appears to be a collection of specific nutritional information for each brand of cereal. Likely the measurements are determined per a serving of cereal, with serving sizes likely varying per brand and size and/or amount of pieces of cereal. The ‘Type’ variable is difficult to determine, but one could surmise, from the other information in the data set, that the values may represent a grading system based upon nutritional content of each cereal.

For this dataset it is expected that some columns may need to be renamed for ease of analysis. To examine the relationship between sodium content and cereal type, through graphical modeling and statistical analysis, the ‘Type’ column may need to be rearranged (pivot_wider) to separate the cereal data by Type A or C, if this cannot be accomplished by other means.

##Tidying/Wrangling Data and Graphing Univariate Data

#rename 'Cereal' column to avoid conflict from object and column sharing designation

cereal <- rename(cereal, "Product" = "Cereal")
Error in rename(cereal, Product = "Cereal"): object 'cereal' not found
#print dataset to confirm name change

print(cereal)
Error in print(cereal): object 'cereal' not found
#get number of brands that fall under Type A and C

table(select( cereal, 'Type'))
Error in select(cereal, "Type"): object 'cereal' not found
#Make histogram of 'Type' content data

ggplot(cereal, aes(x = Type)) + geom_bar(color = "black", fill = "lightgray") + labs(title = 'Type Grade of Cereals', y = "Count", x = "Cereal Type") + theme_minimal()
Error in ggplot(cereal, aes(x = Type)): object 'cereal' not found

There are 10 cereal brands that fall under the A type and 10 that fall under the C type. This is confirmed in a histogram model of the Type data. This illustrates a relatively even split between the two sets of Type data. While more data points would assist in normalizing the data for greater precision in statistical analysis of the sample data, to reflect the greater cereal population, this is suitable for further examination.

#get summary statistics for Sodium content values

summary(cereal)
Error in summary(cereal): object 'cereal' not found
#Make histogram of 'Sodium' Content data

ggplot(cereal, aes(x = Sodium)) + geom_histogram(binwidth = 20, color = "black", fill = "lightgray") + labs(title = 'Sodium Content of Cereals', y = "Count", x = "Sodium Content (mg/serving)") + theme_minimal()
Error in ggplot(cereal, aes(x = Sodium)): object 'cereal' not found

From a graph of Sodium content, we can observe that the total sodium content across the total 20 samples ranges widely from a minimum of 0mg per serving to a maximum of 340mg per serving, with a mean of 167.0mg per serving. To further examine this data by cereal type statistical data and graphical models of the sodium content by cereal type must be rendered.

##Tidying/Wrangling Data and Graphing Bivariate Data

#want to get summary data statistics for sodium content for both types A and C
#Try to filter out each type separately to create new objects to put into vectors to create new data frame to do summary with

sodium_typeA <- filter(cereal, Type == 'A')
Error in filter(cereal, Type == "A"): object 'cereal' not found
sodium_typeC <- filter(cereal, Type == 'C')
Error in filter(cereal, Type == "C"): object 'cereal' not found
#print sodium_typeA to confirm changes

print(sodium_typeA)
Error in print(sodium_typeA): object 'sodium_typeA' not found
#print sodium_typeC to confirm changes
print(sodium_typeC)
Error in print(sodium_typeC): object 'sodium_typeC' not found
Sodium_Content_TypeA <- select(sodium_typeA, 'Sodium' )
Error in select(sodium_typeA, "Sodium"): object 'sodium_typeA' not found
Sodium_Content_TypeC <- select(sodium_typeC, 'Sodium')
Error in select(sodium_typeC, "Sodium"): object 'sodium_typeC' not found
sodium_type_data <- as.data.frame(c(Sodium_Content_TypeA, Sodium_Content_TypeC))
Error in as.data.frame(c(Sodium_Content_TypeA, Sodium_Content_TypeC)): object 'Sodium_Content_TypeA' not found
sodium_type_data <- rename(sodium_type_data, 'Sodium_Content_Type_A' = "Sodium", 'Sodium_Content_Type_C' = "Sodium.1")
Error in rename(sodium_type_data, Sodium_Content_Type_A = "Sodium", Sodium_Content_Type_C = "Sodium.1"): object 'sodium_type_data' not found
#print sodium content data to confirm column name changes
print(sodium_type_data)
Error in print(sodium_type_data): object 'sodium_type_data' not found
#print sodium content typeA to check data intact
print(Sodium_Content_TypeA)
Error in print(Sodium_Content_TypeA): object 'Sodium_Content_TypeA' not found
#print sodium content typeC to confirm data intact
print(Sodium_Content_TypeC)
Error in print(Sodium_Content_TypeC): object 'Sodium_Content_TypeC' not found
#get summary of sodium data for each type from new data frame
summary(sodium_type_data)
Error in summary(sodium_type_data): object 'sodium_type_data' not found
#make boxplot of cereal 'type' and 'sodium' content

ggplot(cereal, aes(Type, Sodium)) + geom_boxplot(color = "black", fill = "lightgray") + labs(title = 'Sodium Content of Type A and C Cereals', y = "Sodium Content (mg/serving)", x = "Cereal Type") + theme_minimal()
Error in ggplot(cereal, aes(Type, Sodium)): object 'cereal' not found

To further examine the sodium content data by cereal type, a statistical analysis was performed and a bivariate boxplot model was generated to display the differences in sodium content for each type. In comparison to the initial conjoined data for sodium content, we see that the Type A and C show clear differences in sodium content for each. By range, Type A had a minimum sodium content of 0mg per serving and a maximum of 340mg per serving, where Type C cereals ranged a minimum of 130mg per serving to a maximum sodium content of 290mg per serving. But despite its greater range, we see that the mean for Type A cereals is 149mg per serving, being several milligrams lower than the Type C cereal mean sodium content of 185mg per serving. If we were to conclude the nutritional value of each cereal based upon sodium content, alone, we might conclude that the Type A cereals were the healthier option for having a lower sodium content on average. But this data may be skewed by the wide range in sodium contents for Type A cereals, and it may be more efficient to examine additional factors in determining the nutritional quality of cereals. That is, if the Types were a form of nutritional grading, which has proven difficult to confirm. Regardless, if sodium content is being examined, it would be better to collect a greater number of cereal brand samples to add to the sample population to improve precision in estimating population values, as a smaller sample population may tend towards skewing with widely ranging (potential) outliers arising in the type sample populations.

Source Code
---
title: "Challenge 5: Introduction to Visualization of Data"
author: "Christa Bonney"
description: "Exercising skills in visualizing data via graphical models"
date: "11/24/2022"
format:
  html:
    toc: true
    code-copy: true
    code-tools: true
categories:
  - challenge_5
  - railroads
  - cereal
  - air_bnb
  - pathogen_cost
  - australian_marriage
  - public_schools
  - usa_households
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(ggplot2)


knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1)  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2)  tidy data (as needed, including sanity checks)
3)  mutate variables as needed (including sanity checks)
4)  create at least two univariate visualizations
   - try to make them "publication" ready
   - Explain why you choose the specific graph type
5)  Create at least one bivariate visualization
   - try to make them "publication" ready
   - Explain why you choose the specific graph type

[R Graph Gallery](https://r-graph-gallery.com/) is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.

(be sure to only include the category tags for the data you use!)

## Read in data

```{r}
#read in cereal data csv file

cereal <- read_csv("~/GIT/601_Fall_2022/posts/_data/cereal.csv")
```

```{r}
#print to confirm successfully read in data

print(cereal)
```

##Data Summary

```{r}
#get column names

colnames(cereal)

#get column types

spec(cereal)
```

```{r}
#get data set dimensions 

dim(cereal)
```

```{r}
#print to view dataset

print(cereal)
```

The Cereal dataset consists of 4 variables (columns) and 20 cases (rows). The variables consist of:

'Cereal': a qualitative/categorical nominal variable, listing of individual cereal brands

'Sodium': a quantitative continuous variable of sodium concentrations, likely per serving, measured in milligrams.

'Sugar': a quantitative continuous variable of sugar concentrations, likely per serving, measured in grams.

'Type': a quantitative ordinal variable, 2 options of A or C

This dataset appears to be a collection of specific nutritional information for each brand of cereal. Likely the measurements are determined per a serving of cereal, with serving sizes likely varying per brand and size and/or amount of pieces of cereal. The 'Type' variable is difficult to determine, but one could surmise, from the other information in the data set, that the values may represent a grading system based upon nutritional content of each cereal.

For this dataset it is expected that some columns may need to be renamed for ease of analysis. To examine the relationship between sodium content and cereal type, through graphical modeling and statistical analysis, the 'Type' column may need to be rearranged (pivot_wider) to separate the cereal data by Type A or C, if this cannot be accomplished by other means.


##Tidying/Wrangling Data and Graphing Univariate Data

```{r}
#rename 'Cereal' column to avoid conflict from object and column sharing designation

cereal <- rename(cereal, "Product" = "Cereal")

```

```{r}
#print dataset to confirm name change

print(cereal)
```

```{r}
#get number of brands that fall under Type A and C

table(select( cereal, 'Type'))

```

```{r}
#Make histogram of 'Type' content data

ggplot(cereal, aes(x = Type)) + geom_bar(color = "black", fill = "lightgray") + labs(title = 'Type Grade of Cereals', y = "Count", x = "Cereal Type") + theme_minimal()
```

There are 10 cereal brands that fall under the A type and 10 that fall under the C type. This is confirmed in a histogram model of the Type data. This illustrates a relatively even split between the two sets of Type data. While more data points would assist in normalizing the data for greater precision in statistical analysis of the sample data, to reflect the greater cereal population, this is suitable for further examination.

```{r}
#get summary statistics for Sodium content values

summary(cereal)
```

```{r}
#Make histogram of 'Sodium' Content data

ggplot(cereal, aes(x = Sodium)) + geom_histogram(binwidth = 20, color = "black", fill = "lightgray") + labs(title = 'Sodium Content of Cereals', y = "Count", x = "Sodium Content (mg/serving)") + theme_minimal()

```

From a graph of Sodium content, we can observe that the total sodium content across the total 20 samples ranges widely from a minimum of 0mg per serving to a maximum of 340mg per serving, with a mean of 167.0mg per serving. To further examine this data by cereal type statistical data and graphical models of the sodium content by cereal type must be rendered.

##Tidying/Wrangling Data and Graphing Bivariate Data


```{r}
#want to get summary data statistics for sodium content for both types A and C
#Try to filter out each type separately to create new objects to put into vectors to create new data frame to do summary with

sodium_typeA <- filter(cereal, Type == 'A')

sodium_typeC <- filter(cereal, Type == 'C')
```

```{r}
#print sodium_typeA to confirm changes

print(sodium_typeA)
```

```{r}
#print sodium_typeC to confirm changes
print(sodium_typeC)
```

```{r}
Sodium_Content_TypeA <- select(sodium_typeA, 'Sodium' )
```

```{r}
Sodium_Content_TypeC <- select(sodium_typeC, 'Sodium')
```

```{r}
sodium_type_data <- as.data.frame(c(Sodium_Content_TypeA, Sodium_Content_TypeC))
```

```{r}
sodium_type_data <- rename(sodium_type_data, 'Sodium_Content_Type_A' = "Sodium", 'Sodium_Content_Type_C' = "Sodium.1")

```

```{r}
#print sodium content data to confirm column name changes
print(sodium_type_data)
```

```{r}
#print sodium content typeA to check data intact
print(Sodium_Content_TypeA)
```

```{r}
#print sodium content typeC to confirm data intact
print(Sodium_Content_TypeC)

```

```{r}
#get summary of sodium data for each type from new data frame
summary(sodium_type_data)
```

```{r}
#make boxplot of cereal 'type' and 'sodium' content

ggplot(cereal, aes(Type, Sodium)) + geom_boxplot(color = "black", fill = "lightgray") + labs(title = 'Sodium Content of Type A and C Cereals', y = "Sodium Content (mg/serving)", x = "Cereal Type") + theme_minimal()
```

To further examine the sodium content data by cereal type, a statistical analysis was performed and a bivariate boxplot model was generated to display the differences in sodium content for each type. In comparison to the initial conjoined data for sodium content, we see that the Type A and C show clear differences in sodium content for each. By range, Type A had a minimum sodium content of 0mg per serving and a maximum of 340mg per serving, where Type C cereals ranged a minimum of 130mg per serving to a maximum sodium content of 290mg per serving. But despite its greater range, we see that the mean for Type A cereals is 149mg per serving, being several milligrams lower than the Type C cereal mean sodium content of 185mg per serving. If we were to conclude the nutritional value of each cereal based upon sodium content, alone, we might conclude that the Type A cereals were the healthier option for having a lower sodium content on average. But this data may be skewed by the wide range in sodium contents for Type A cereals, and it may be more efficient to examine additional factors in determining the nutritional quality of cereals. That is, if the Types were a form of nutritional grading, which has proven difficult to confirm. Regardless, if sodium content is being examined, it would be better to collect a greater number of cereal brand samples to add to the sample population to improve precision in estimating population values, as a smaller sample population may tend towards skewing with widely ranging (potential) outliers arising in the type sample populations.