DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 7:

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read in data

Challenge 7:

challenge_7
hotel_bookings
australian_marriage
air_bnb
eggs
abc_poll
faostat
usa_households
Visualizing Multiple Dimensions
Author

Christa Bonney

Published

November 29, 2022

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

#read in cereal data csv file

cereal <- read_csv("~/GIT/601_Fall_2022/posts/_data/cereal.csv")
Error: '~/GIT/601_Fall_2022/posts/_data/cereal.csv' does not exist.
#print to confirm successfully read in data

print(cereal)
Error in print(cereal): object 'cereal' not found

##Data Summary

#get column names

colnames(cereal)
Error in is.data.frame(x): object 'cereal' not found
#get column types

spec(cereal)
Error in stopifnot(inherits(x, "tbl_df")): object 'cereal' not found
#get data set dimensions 

dim(cereal)
Error in eval(expr, envir, enclos): object 'cereal' not found
#print to view dataset

print(cereal)
Error in print(cereal): object 'cereal' not found

The Cereal dataset consists of 4 variables (columns) and 20 cases (rows). The variables consist of:

‘Cereal’: a qualitative/categorical nominal variable, listing of individual cereal brands

‘Sodium’: a quantitative continuous variable of sodium concentrations, likely per serving, measured in milligrams.

‘Sugar’: a quantitative continuous variable of sugar concentrations, likely per serving, measured in grams.

‘Type’: a quantitative ordinal variable, 2 options of A or C

This dataset appears to be a collection of specific nutritional information for each brand of cereal. Likely the measurements are determined per a serving of cereal, with serving sizes likely varying per brand and size and/or amount of pieces of cereal. The ‘Type’ variable is difficult to determine, but one could surmise, from the other information in the data set, that the values may represent a grading system based upon nutritional content of each cereal.

For this dataset it is expected that some columns may need to be renamed for ease of analysis. To examine the relationship between sodium content and cereal type, through graphical modeling and statistical analysis, the ‘Type’ column may need to be rearranged (pivot_wider) to separate the cereal data by Type A or C, if this cannot be accomplished by other means.

For challenge 7 I chose to utilize this dataset and recreate the previous graphs I had made in a prior challenge. The desire was to expand upon these existing graphs to display additional variables I had not managed to accomplish at my prior level of skill, at the time.

##Tidying/Wrangling Data and Graphing Univariate Data

#rename 'Cereal' column to avoid conflict from object and column sharing designation

cereal <- rename(cereal, "Product" = "Cereal")
Error in rename(cereal, Product = "Cereal"): object 'cereal' not found
#print dataset to confirm name change

print(cereal)
Error in print(cereal): object 'cereal' not found
#get number of brands that fall under Type A and C

table(select( cereal, 'Type'))
Error in select(cereal, "Type"): object 'cereal' not found
#Make histogram of 'Type' content data

ggplot(cereal, aes(x = Type)) + geom_bar(aes(colour = "black", fill = Product)) + labs(title = 'Type Grade of Cereals', y = "Count", x = "Cereal Type") + theme_minimal()
Error in ggplot(cereal, aes(x = Type)): object 'cereal' not found
#New: set fill to Product

There are 10 cereal brands that fall under the A type and 10 that fall under the C type. This is confirmed in a histogram model of the Type data. This illustrates a relatively even split between the two sets of Type data. While more data points would assist in normalizing the data for greater precision in statistical analysis of the sample data, to reflect the greater cereal population, this is suitable for further examination.

This is the first graph I have chosen to recreate and refine. Further being able to utilize aesthetic aspects, like color fill, as set to the Product variable, has allowed for the cereal brands to be displayed on the graph, in association with their A or C Type grade. Moreso, it also highlights that there does not seem to be an easily discerned trend distinguishing the two cereal types from their brand names or even groupings. So further exploration is required to establish the associations setting these brands apart into either type.

##Tidying/Wrangling Data and Graphing Bivariate Data

#want to get summary data statistics for sodium content for both types A and C
#Try to filter out each type separately to create new objects to put into vectors to create new data frame to do summary with

sodium_typeA <- filter(cereal, Type == 'A')
Error in filter(cereal, Type == "A"): object 'cereal' not found
sodium_typeC <- filter(cereal, Type == 'C')
Error in filter(cereal, Type == "C"): object 'cereal' not found
#print sodium_typeA to confirm changes

print(sodium_typeA)
Error in print(sodium_typeA): object 'sodium_typeA' not found
#print sodium_typeC to confirm changes
print(sodium_typeC)
Error in print(sodium_typeC): object 'sodium_typeC' not found
Sodium_Content_TypeA <- select(sodium_typeA, 'Sodium' )
Error in select(sodium_typeA, "Sodium"): object 'sodium_typeA' not found
Sodium_Content_TypeC <- select(sodium_typeC, 'Sodium')
Error in select(sodium_typeC, "Sodium"): object 'sodium_typeC' not found
sodium_type_data <- as.data.frame(c(Sodium_Content_TypeA, Sodium_Content_TypeC))
Error in as.data.frame(c(Sodium_Content_TypeA, Sodium_Content_TypeC)): object 'Sodium_Content_TypeA' not found
sodium_type_data <- rename(sodium_type_data, 'Sodium_Content_Type_A' = "Sodium", 'Sodium_Content_Type_C' = "Sodium.1")
Error in rename(sodium_type_data, Sodium_Content_Type_A = "Sodium", Sodium_Content_Type_C = "Sodium.1"): object 'sodium_type_data' not found
#print sodium content data to confirm column name changes
print(sodium_type_data)
Error in print(sodium_type_data): object 'sodium_type_data' not found
#print sodium content typeA to check data intact
print(Sodium_Content_TypeA)
Error in print(Sodium_Content_TypeA): object 'Sodium_Content_TypeA' not found
#print sodium content typeC to confirm data intact
print(Sodium_Content_TypeC)
Error in print(Sodium_Content_TypeC): object 'Sodium_Content_TypeC' not found
#get summary of sodium data for each type from new data frame
summary(sodium_type_data)
Error in summary(sodium_type_data): object 'sodium_type_data' not found
#make boxplot of cereal 'type' and 'sodium' content

ggplot(cereal, aes(Type, Sodium)) + geom_boxplot(color = "black", fill = "white") + geom_point(aes(color = Product, shape = Type)) + labs(title = 'Sodium Content of Type A and C Cereals', y = "Sodium Content (mg/serving)", x = "Cereal Type") + theme_minimal()
Error in ggplot(cereal, aes(Type, Sodium)): object 'cereal' not found
#NEW: add point graph to display cereal brands by type and sodium 
#New: discern products by color aes
#New: discern product type (A or) by shape aes
#New: change box plot fill color to better display points

To further examine the sodium content data by cereal type, a statistical analysis was performed and a bivariate boxplot model was generated to display the differences in sodium content for each type. In comparison to the initial conjoined data for sodium content, we see that the Type A and C show clear differences in sodium content for each. By range, Type A had a minimum sodium content of 0mg per serving and a maximum of 340mg per serving, where Type C cereals ranged a minimum of 130mg per serving to a maximum sodium content of 290mg per serving. But despite its greater range, we see that the mean for Type A cereals is 149mg per serving, being several milligrams lower than the Type C cereal mean sodium content of 185mg per serving. If we were to conclude the nutritional value of each cereal based upon sodium content, alone, we might conclude that the Type A cereals were the healthier option for having a lower sodium content on average. But this data may be skewed by the wide range in sodium contents for Type A cereals, and it may be more efficient to examine additional factors in determining the nutritional quality of cereals. That is, if the Types were a form of nutritional grading, which has proven difficult to confirm. Regardless, if sodium content is being examined, it would be better to collect a greater number of cereal brand samples to add to the sample population to improve precision in estimating population values, as a smaller sample population may tend towards skewing with widely ranging (potential) outliers arising in the type sample populations.

This is the second graph that I chose to recreate and refine. Learning how to combine multiple geometric displays on a single plot, alongside aesthetic arguments has allowed me to add an additional layer of information to the original plot. Specifically, the addition of a scatterplot to the plot, with fill set to the Product variable, has allowed for the cereal brands to be displayed on the graph, in association with their sodium content and A or C Type grade. Now the brands can be specifically pinpointed for where they fall in the sodium content range, and compare by type. Changing the fill of the boxes allowed for the product/sodium content points to be better displayed.

Source Code
---
title: "Challenge 7: "
author: "Christa Bonney"
description: "Visualizing Multiple Dimensions"
date: "11/29/2022"
format:
  html:
    toc: true
    code-copy: true
    code-tools: true
categories:
  - challenge_7
  - hotel_bookings
  - australian_marriage
  - air_bnb
  - eggs
  - abc_poll
  - faostat
  - usa_households
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Read in data

```{r}
#read in cereal data csv file

cereal <- read_csv("~/GIT/601_Fall_2022/posts/_data/cereal.csv")
```

```{r}
#print to confirm successfully read in data

print(cereal)
```

##Data Summary

```{r}
#get column names

colnames(cereal)

#get column types

spec(cereal)
```

```{r}
#get data set dimensions 

dim(cereal)
```

```{r}
#print to view dataset

print(cereal)
```

The Cereal dataset consists of 4 variables (columns) and 20 cases (rows). The variables consist of:

'Cereal': a qualitative/categorical nominal variable, listing of individual cereal brands

'Sodium': a quantitative continuous variable of sodium concentrations, likely per serving, measured in milligrams.

'Sugar': a quantitative continuous variable of sugar concentrations, likely per serving, measured in grams.

'Type': a quantitative ordinal variable, 2 options of A or C

This dataset appears to be a collection of specific nutritional information for each brand of cereal. Likely the measurements are determined per a serving of cereal, with serving sizes likely varying per brand and size and/or amount of pieces of cereal. The 'Type' variable is difficult to determine, but one could surmise, from the other information in the data set, that the values may represent a grading system based upon nutritional content of each cereal.

For this dataset it is expected that some columns may need to be renamed for ease of analysis. To examine the relationship between sodium content and cereal type, through graphical modeling and statistical analysis, the 'Type' column may need to be rearranged (pivot_wider) to separate the cereal data by Type A or C, if this cannot be accomplished by other means.

For challenge 7 I chose to utilize this dataset and recreate the previous graphs I had made in a prior challenge. The desire was to expand upon these existing graphs to display additional variables I had not managed to accomplish at my prior level of skill, at the time.


##Tidying/Wrangling Data and Graphing Univariate Data

```{r}
#rename 'Cereal' column to avoid conflict from object and column sharing designation

cereal <- rename(cereal, "Product" = "Cereal")

```

```{r}
#print dataset to confirm name change

print(cereal)
```

```{r}
#get number of brands that fall under Type A and C

table(select( cereal, 'Type'))

```

```{r}
#Make histogram of 'Type' content data

ggplot(cereal, aes(x = Type)) + geom_bar(aes(colour = "black", fill = Product)) + labs(title = 'Type Grade of Cereals', y = "Count", x = "Cereal Type") + theme_minimal()

#New: set fill to Product
```

There are 10 cereal brands that fall under the A type and 10 that fall under the C type. This is confirmed in a histogram model of the Type data. This illustrates a relatively even split between the two sets of Type data. While more data points would assist in normalizing the data for greater precision in statistical analysis of the sample data, to reflect the greater cereal population, this is suitable for further examination.

This is the first graph I have chosen to recreate and refine. Further being able to utilize aesthetic aspects, like color fill, as set to the Product variable, has allowed for the cereal brands to be displayed on the graph, in association with their A or C Type grade. Moreso, it also highlights that there does not seem to be an easily discerned trend distinguishing the two cereal types from their brand names or even groupings. So further exploration is required to establish the associations setting these brands apart into either type.

##Tidying/Wrangling Data and Graphing Bivariate Data

```{r}
#want to get summary data statistics for sodium content for both types A and C
#Try to filter out each type separately to create new objects to put into vectors to create new data frame to do summary with

sodium_typeA <- filter(cereal, Type == 'A')

sodium_typeC <- filter(cereal, Type == 'C')
```

```{r}
#print sodium_typeA to confirm changes

print(sodium_typeA)
```

```{r}
#print sodium_typeC to confirm changes
print(sodium_typeC)
```

```{r}
Sodium_Content_TypeA <- select(sodium_typeA, 'Sodium' )
```

```{r}
Sodium_Content_TypeC <- select(sodium_typeC, 'Sodium')
```

```{r}
sodium_type_data <- as.data.frame(c(Sodium_Content_TypeA, Sodium_Content_TypeC))
```

```{r}
sodium_type_data <- rename(sodium_type_data, 'Sodium_Content_Type_A' = "Sodium", 'Sodium_Content_Type_C' = "Sodium.1")

```

```{r}
#print sodium content data to confirm column name changes
print(sodium_type_data)
```

```{r}
#print sodium content typeA to check data intact
print(Sodium_Content_TypeA)
```

```{r}
#print sodium content typeC to confirm data intact
print(Sodium_Content_TypeC)

```

```{r}
#get summary of sodium data for each type from new data frame
summary(sodium_type_data)
```

```{r}
#make boxplot of cereal 'type' and 'sodium' content

ggplot(cereal, aes(Type, Sodium)) + geom_boxplot(color = "black", fill = "white") + geom_point(aes(color = Product, shape = Type)) + labs(title = 'Sodium Content of Type A and C Cereals', y = "Sodium Content (mg/serving)", x = "Cereal Type") + theme_minimal()

#NEW: add point graph to display cereal brands by type and sodium 
#New: discern products by color aes
#New: discern product type (A or) by shape aes
#New: change box plot fill color to better display points

```

To further examine the sodium content data by cereal type, a statistical analysis was performed and a bivariate boxplot model was generated to display the differences in sodium content for each type. In comparison to the initial conjoined data for sodium content, we see that the Type A and C show clear differences in sodium content for each. By range, Type A had a minimum sodium content of 0mg per serving and a maximum of 340mg per serving, where Type C cereals ranged a minimum of 130mg per serving to a maximum sodium content of 290mg per serving. But despite its greater range, we see that the mean for Type A cereals is 149mg per serving, being several milligrams lower than the Type C cereal mean sodium content of 185mg per serving. If we were to conclude the nutritional value of each cereal based upon sodium content, alone, we might conclude that the Type A cereals were the healthier option for having a lower sodium content on average. But this data may be skewed by the wide range in sodium contents for Type A cereals, and it may be more efficient to examine additional factors in determining the nutritional quality of cereals. That is, if the Types were a form of nutritional grading, which has proven difficult to confirm. Regardless, if sodium content is being examined, it would be better to collect a greater number of cereal brand samples to add to the sample population to improve precision in estimating population values, as a smaller sample population may tend towards skewing with widely ranging (potential) outliers arising in the type sample populations.

This is the second graph that I chose to recreate and refine. Learning how to combine multiple geometric displays on a single plot, alongside aesthetic arguments has allowed me to add an additional layer of information to the original plot. Specifically, the addition of a scatterplot to the plot, with fill set to  the Product variable, has allowed for the cereal brands to be displayed on the graph, in association with their sodium content and A or C Type grade. Now the brands can be specifically pinpointed for where they fall in the sodium content range, and compare by type. Changing the fill of the boxes allowed for the product/sodium content points to be better displayed.