DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 4 Instructions

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data
    • Briefly describe the data
  • Tidy Data (as needed)
  • Identify variables that need to be mutated

Challenge 4 Instructions

  • Show All Code
  • Hide All Code

  • View Source
challenge_4
abc_poll
eggs
fed_rates
hotel_bookings
debt
Author

Meredith Rolfe

Published

August 18, 2022

Code
library(tidyverse)
library(lubridate)
library(readxl)
library(ggplot2)
library(tidyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • abc_poll.csv ⭐
  • poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
  • FedFundsRate.csv⭐⭐⭐
  • hotel_bookings.csv⭐⭐⭐⭐
  • debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
Code
# Read csv file into dataframe
poultry_data <- read_excel("_data/poultry_tidy.xlsx")

# Printing the top 5 rows
head(poultry_data, 5)
# A tibble: 5 × 4
  Product  Year Month    Price_Dollar
  <chr>   <dbl> <chr>           <dbl>
1 Whole    2013 January          2.38
2 Whole    2013 February         2.38
3 Whole    2013 March            2.38
4 Whole    2013 April            2.38
5 Whole    2013 May              2.38
Code
dim(poultry_data)
[1] 600   4
Code
# Getting some initial statistics
unique(poultry_data$Product)
[1] "Whole"          "B/S Breast"     "Bone-in Breast" "Whole Legs"    
[5] "Thighs"        
Code
min(poultry_data$Year)
[1] 2004
Code
max(poultry_data$Year)
[1] 2013

Briefly describe the data

The data consists of different poultry pieces (breast, thigh, etc.), and their prices in dollars through different years, starting from 2004, right upto 2013. The prices are documented month-wise for this period. The dataset is composed of 600 rows and 4 columns, which holistically represent this data.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

Code
# Transforming Year and Month columns into "Date" column
poultry_data$Date <- format(as.Date((paste(poultry_data$Year,poultry_data$Month,"01",sep="")), "%Y%b%e"), "%Y-%m")

head(poultry_data, 5)
# A tibble: 5 × 5
  Product  Year Month    Price_Dollar Date   
  <chr>   <dbl> <chr>           <dbl> <chr>  
1 Whole    2013 January          2.38 2013-01
2 Whole    2013 February         2.38 2013-02
3 Whole    2013 March            2.38 2013-03
4 Whole    2013 April            2.38 2013-04
5 Whole    2013 May              2.38 2013-05
Code
# Get rows having no values in "Price_Dollar column"
poultry_data[is.na(poultry_data$Price_Dollar), ]
# A tibble: 7 × 5
  Product         Year Month    Price_Dollar Date   
  <chr>          <dbl> <chr>           <dbl> <chr>  
1 Bone-in Breast  2004 January            NA 2004-01
2 Bone-in Breast  2004 February           NA 2004-02
3 Bone-in Breast  2004 March              NA 2004-03
4 Bone-in Breast  2004 April              NA 2004-04
5 Bone-in Breast  2004 May                NA 2004-05
6 Bone-in Breast  2004 June               NA 2004-06
7 Thighs          2004 January            NA 2004-01
Code
# Populating the columns with 0, if no price value is available
poultry_data <- poultry_data %>%
  mutate(
    Price_Dollar = ifelse(is.na(Price_Dollar), 0 , Price_Dollar)
  )

# Grouping data by Product types
poultry_grouped <- poultry_data %>%
  group_by(Product) %>%
  summarise(
    price_mean = mean(Price_Dollar),
    price_median = median(Price_Dollar)
  )

poultry_grouped
# A tibble: 5 × 3
  Product        price_mean price_median
  <chr>               <dbl>        <dbl>
1 B/S Breast           6.55         6.46
2 Bone-in Breast       3.71         3.90
3 Thighs               2.16         2.22
4 Whole                2.31         2.35
5 Whole Legs           2.03         2.04
Code
ggplot(poultry_grouped) + 
  geom_point(aes(x=Product, y=price_mean)) +
  ylab("Price Mean (in Dollars)") + 
  xlab("Product type") + 
  ggtitle("Graph to compare product type to mean price by product")

Code
# Boxplot to find price variations in each product type
ggplot(poultry_data, aes(x=Product, y=Price_Dollar, group=Product)) + 
  geom_boxplot() +
  ylab("Price (in dollars)") + 
  ggtitle("Boxplot for product type price comparison")

Code
# Data grouped by date
poultry_grouped_date <- poultry_data %>%
  group_by(Date) %>%
  summarise(
    price_mean = mean(Price_Dollar),
    price_median = median(Price_Dollar)
  )

poultry_grouped_date
# A tibble: 120 × 3
   Date    price_mean price_median
   <chr>        <dbl>        <dbl>
 1 2004-01       2.07         1.94
 2 2004-02       2.47         1.98
 3 2004-03       2.50         2.03
 4 2004-04       2.50         2.03
 5 2004-05       2.51         2.03
 6 2004-06       2.52         2.02
 7 2004-07       3.31         2.17
 8 2004-08       3.31         2.17
 9 2004-09       3.31         2.17
10 2004-10       3.31         2.17
# … with 110 more rows
Code
options(repr.plot.width = 6, repr.plot.height =3)

graph <- ggplot(poultry_grouped_date, aes(x=Date, y=price_mean)) + 
  geom_point() + 
  ylab("Price Mean (in Dollars)") + 
  theme_minimal() +
  theme(text = element_text(size = 6),element_line(size =1)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  ggtitle("Price mean per month (2004 - 2013)")

graph

Code
filter(poultry_data, Year=="2013" & Month=="November")
# A tibble: 5 × 5
  Product         Year Month    Price_Dollar Date   
  <chr>          <dbl> <chr>           <dbl> <chr>  
1 Whole           2013 November         2.38 2013-11
2 B/S Breast      2013 November         7.04 2013-11
3 Bone-in Breast  2013 November         3.90 2013-11
4 Whole Legs      2013 November         2.04 2013-11
5 Thighs          2013 November         2.16 2013-11

Any additional comments?

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

For this data, I have used the Month and Year values to create a new “Date” column. I have then used this column, as well as the Product column, to group the data to get mean price and median price values. While the graphs formed by this grouping are not very informative, on scaling the data, this could be extremely informative.

Document your work here.

Code
# Reading into dataframe
data <- read_excel('_data/debt_in_trillions.xlsx')
data <- data %>%
  separate(c("Year and Quarter"), c("Year", "Quarter"), ":")
data
# A tibble: 74 × 9
   Year  Quarter Mortgage `HE Revolving` `Auto Loan` Credi…¹ Stude…² Other Total
   <chr> <chr>      <dbl>          <dbl>       <dbl>   <dbl>   <dbl> <dbl> <dbl>
 1 03    Q1          4.94          0.242       0.641   0.688   0.241 0.478  7.23
 2 03    Q2          5.08          0.26        0.622   0.693   0.243 0.486  7.38
 3 03    Q3          5.18          0.269       0.684   0.693   0.249 0.477  7.56
 4 03    Q4          5.66          0.302       0.704   0.698   0.253 0.449  8.07
 5 04    Q1          5.84          0.328       0.72    0.695   0.260 0.446  8.29
 6 04    Q2          5.97          0.367       0.743   0.697   0.263 0.423  8.46
 7 04    Q3          6.21          0.426       0.751   0.706   0.33  0.41   8.83
 8 04    Q4          6.36          0.468       0.728   0.717   0.346 0.423  9.04
 9 05    Q1          6.51          0.502       0.725   0.71    0.364 0.394  9.21
10 05    Q2          6.70          0.528       0.774   0.717   0.374 0.402  9.49
# … with 64 more rows, and abbreviated variable names ¹​`Credit Card`,
#   ²​`Student Loan`
Code
# Group data by year, and get total and mean debt per year
data_grouped <- data %>%
  group_by(Year) %>%
  summarise(total_debt_by_year = sum(Total),
            mean_debt_by_year = mean(Total))

# Plot the mean debt per year
ggplot(data_grouped, aes(x=Year, y=mean_debt_by_year, group=1)) + ylab("Mean debt (in Trillions)") + 
  geom_point() + 
  geom_line() + 
  ggtitle("Year vs Mean debt (in Trillions) ")

Code
data_grouped
# A tibble: 19 × 3
   Year  total_debt_by_year mean_debt_by_year
   <chr>              <dbl>             <dbl>
 1 03                  30.2              7.56
 2 04                  34.6              8.66
 3 05                  38.5              9.62
 4 06                  43.6             10.9 
 5 07                  47.8             12.0 
 6 08                  50.5             12.6 
 7 09                  49.4             12.3 
 8 10                  47.6             11.9 
 9 11                  46.7             11.7 
10 12                  45.5             11.4 
11 13                  45.2             11.3 
12 14                  46.8             11.7 
13 15                  47.9             12.0 
14 16                  49.5             12.4 
15 17                  51.7             12.9 
16 18                  53.6             13.4 
17 19                  55.6             13.9 
18 20                  57.5             14.4 
19 21                  29.6             14.8 
Code
# Group data by year and get mean value statistics for each debt type
data_by_debt_type <- data %>%
   group_by(Year) %>% 
   summarise(mort_mean = mean(Mortgage),
             herev_mean = mean(`HE Revolving`),
             auto_mean = mean(`Auto Loan`),
             cred_mean = mean(`Credit Card`),
             stud_mean = mean(`Student Loan`))

data_by_debt_type
# A tibble: 19 × 6
   Year  mort_mean herev_mean auto_mean cred_mean stud_mean
   <chr>     <dbl>      <dbl>     <dbl>     <dbl>     <dbl>
 1 03         5.22      0.268     0.663     0.693     0.246
 2 04         6.09      0.397     0.736     0.704     0.300
 3 05         6.80      0.534     0.780     0.724     0.377
 4 06         7.87      0.595     0.806     0.746     0.450
 5 07         8.79      0.626     0.808     0.804     0.524
 6 08         9.26      0.685     0.804     0.853     0.604
 7 09         9.00      0.710     0.742     0.818     0.688
 8 10         8.65      0.680     0.707     0.742     0.777
 9 11         8.43      0.633     0.721     0.697     0.859
10 12         8.10      0.584     0.759     0.676     0.935
11 13         7.93      0.539     0.829     0.671     1.02 
12 14         8.14      0.517     0.917     0.677     1.13 
13 15         8.20      0.497     1.02      0.708     1.20 
14 16         8.39      0.477     1.12      0.742     1.28 
15 17         8.74      0.45      1.20      0.798     1.36 
16 18         9.05      0.426     1.25      0.839     1.43 
17 19         9.41      0.398     1.31      0.881     1.49 
18 20         9.85      0.368     1.36      0.834     1.54 
19 21        10.3       0.328     1.40      0.778     1.58 
Code
# Preprocess the data, to pivot it for graphing
col_names = names(data_by_debt_type)
col_names <- col_names[!col_names %in% c("Year")]

col_names 
[1] "mort_mean"  "herev_mean" "auto_mean"  "cred_mean"  "stud_mean" 
Code
# Pivot the data such that each case represents the mean debt type for a given year
pivoted_debt_data <- pivot_longer(data_by_debt_type, cols=col_names, 
                          names_to = "debt_type",
                          values_to = "debt_mean_value")

# Graphing the data
ggplot(pivoted_debt_data, aes(x = Year)) +
  geom_line(aes(y = debt_mean_value, group=debt_type, color=debt_type)) + 
  xlab("Year") + 
  ylab("Debt mean (in trillions) ") + 
  ggtitle("Mean debt value across years, per debt type") + 
  guides(color = guide_legend(title="Debt type"))

Code
pivoted_debt_data
# A tibble: 95 × 3
   Year  debt_type  debt_mean_value
   <chr> <chr>                <dbl>
 1 03    mort_mean            5.22 
 2 03    herev_mean           0.268
 3 03    auto_mean            0.663
 4 03    cred_mean            0.693
 5 03    stud_mean            0.246
 6 04    mort_mean            6.09 
 7 04    herev_mean           0.397
 8 04    auto_mean            0.736
 9 04    cred_mean            0.704
10 04    stud_mean            0.300
# … with 85 more rows

Any additional comments?

We can see that the mortgage debt has consistently been higher across all the years that the data has been collated for. The mortgage debt is almost 5 time the average debt of all other debt types.

Source Code
---
title: "Challenge 4 Instructions"
author: "Meredith Rolfe"
desription: "More data wrangling: pivoting"
date: "08/18/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_4
  - abc_poll
  - eggs
  - fed_rates
  - hotel_bookings
  - debt
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(lubridate)
library(readxl)
library(ggplot2)
library(tidyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1)  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2)  tidy data (as needed, including sanity checks)
3)  identify variables that need to be mutated
4)  mutate variables and sanity check all mutations

## Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

-   abc_poll.csv ⭐
-   poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
-   FedFundsRate.csv⭐⭐⭐
-   hotel_bookings.csv⭐⭐⭐⭐
-   debt_in_trillions.xlsx ⭐⭐⭐⭐⭐

```{r}
# Read csv file into dataframe
poultry_data <- read_excel("_data/poultry_tidy.xlsx")

# Printing the top 5 rows
head(poultry_data, 5)

dim(poultry_data)

# Getting some initial statistics
unique(poultry_data$Product)

min(poultry_data$Year)

max(poultry_data$Year)
```

### Briefly describe the data

The data consists of different poultry pieces (breast, thigh, etc.), and their prices in dollars through different years, starting from 2004, right upto 2013. The prices are documented month-wise for this period. 
The dataset is composed of 600 rows and 4 columns, which holistically represent this data.


## Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

```{r}
# Transforming Year and Month columns into "Date" column
poultry_data$Date <- format(as.Date((paste(poultry_data$Year,poultry_data$Month,"01",sep="")), "%Y%b%e"), "%Y-%m")

head(poultry_data, 5)

# Get rows having no values in "Price_Dollar column"
poultry_data[is.na(poultry_data$Price_Dollar), ]

# Populating the columns with 0, if no price value is available
poultry_data <- poultry_data %>%
  mutate(
    Price_Dollar = ifelse(is.na(Price_Dollar), 0 , Price_Dollar)
  )

# Grouping data by Product types
poultry_grouped <- poultry_data %>%
  group_by(Product) %>%
  summarise(
    price_mean = mean(Price_Dollar),
    price_median = median(Price_Dollar)
  )

poultry_grouped

ggplot(poultry_grouped) + 
  geom_point(aes(x=Product, y=price_mean)) +
  ylab("Price Mean (in Dollars)") + 
  xlab("Product type") + 
  ggtitle("Graph to compare product type to mean price by product")

# Boxplot to find price variations in each product type
ggplot(poultry_data, aes(x=Product, y=Price_Dollar, group=Product)) + 
  geom_boxplot() +
  ylab("Price (in dollars)") + 
  ggtitle("Boxplot for product type price comparison")

# Data grouped by date
poultry_grouped_date <- poultry_data %>%
  group_by(Date) %>%
  summarise(
    price_mean = mean(Price_Dollar),
    price_median = median(Price_Dollar)
  )

poultry_grouped_date

options(repr.plot.width = 6, repr.plot.height =3)

graph <- ggplot(poultry_grouped_date, aes(x=Date, y=price_mean)) + 
  geom_point() + 
  ylab("Price Mean (in Dollars)") + 
  theme_minimal() +
  theme(text = element_text(size = 6),element_line(size =1)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  ggtitle("Price mean per month (2004 - 2013)")

graph
```
```{r}
filter(poultry_data, Year=="2013" & Month=="November")
```
Any additional comments?

## Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

For this data, I have used the Month and Year values to create a new "Date" column. I have then used this column, as well as the Product column, to group the data to get mean price and median price values. While the graphs formed by this grouping are not very informative, on scaling the data, this could be extremely informative. 

Document your work here.

```{r}
# Reading into dataframe
data <- read_excel('_data/debt_in_trillions.xlsx')
data <- data %>%
  separate(c("Year and Quarter"), c("Year", "Quarter"), ":")
data

# Group data by year, and get total and mean debt per year
data_grouped <- data %>%
  group_by(Year) %>%
  summarise(total_debt_by_year = sum(Total),
            mean_debt_by_year = mean(Total))

# Plot the mean debt per year
ggplot(data_grouped, aes(x=Year, y=mean_debt_by_year, group=1)) + ylab("Mean debt (in Trillions)") + 
  geom_point() + 
  geom_line() + 
  ggtitle("Year vs Mean debt (in Trillions) ")

data_grouped

# Group data by year and get mean value statistics for each debt type
data_by_debt_type <- data %>%
   group_by(Year) %>% 
   summarise(mort_mean = mean(Mortgage),
             herev_mean = mean(`HE Revolving`),
             auto_mean = mean(`Auto Loan`),
             cred_mean = mean(`Credit Card`),
             stud_mean = mean(`Student Loan`))

data_by_debt_type

# Preprocess the data, to pivot it for graphing
col_names = names(data_by_debt_type)
col_names <- col_names[!col_names %in% c("Year")]

col_names 

# Pivot the data such that each case represents the mean debt type for a given year
pivoted_debt_data <- pivot_longer(data_by_debt_type, cols=col_names, 
                          names_to = "debt_type",
                          values_to = "debt_mean_value")

# Graphing the data
ggplot(pivoted_debt_data, aes(x = Year)) +
  geom_line(aes(y = debt_mean_value, group=debt_type, color=debt_type)) + 
  xlab("Year") + 
  ylab("Debt mean (in trillions) ") + 
  ggtitle("Mean debt value across years, per debt type") + 
  guides(color = guide_legend(title="Debt type"))

pivoted_debt_data

```

Any additional comments?

We can see that the mortgage debt has consistently been higher across all the years that the data has been collated for. The mortgage debt is almost 5 time the average debt of all other debt types.