Challenge 4 Submission

challenge_4

More data wrangling: mutate

Author

Tanmay Agrawal

Published

January 11, 2023

Code

library(tidyverse)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
identify variables that need to be mutated
mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

abc_poll.csv ⭐
poultry_tidy.csv⭐⭐
FedFundsRate.csv⭐⭐⭐
hotel_bookings.csv⭐⭐⭐⭐
debt_in_trillions ⭐⭐⭐⭐⭐

Code

data <- read_csv("_data/hotel_bookings.csv")
nrow(data)

[1] 119390

Code

head(data)

Briefly describe the data

The dataset is a record of customer bookings for a chain of hotels across the world. It contains a total of 119390 booking records and the columns describe the attributes of the booking which are pretty self-explanatory.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

The data isn’t already tidy. We observe a lot of NULL values in the dataset during manual inspection as well as just by looking at the first few rows using the head command above. Since I have worked with this dataset before, I expect columns agent and company to have a lot of grabage values. The catch though is that the NULL values are represented as a string, and they aren’t a separate datatype. We can use the stringr library to find those strings are get rid of them.

First we rank the columns by the percentage of garbage values in them.

Code

# we use the stringr package for this
library(stringr)

# get the null percentages
null_percentages <- sapply(data, function(x) sum(str_detect(x, "NULL"))/length(x))

# put it into a tibble so that it's visually appealing

null_tibble <- tibble(column = names(data), null_percentage = null_percentages)

null_tibble %>%
  arrange(desc(null_percentage))

Since >94% of the values under company are NULL, we can drop that column. We can keep agent because it might still have some explanatory power.

Code

# drop company
data <- data %>% select(-company)

Any additional comments? The data is clean now, we now have a tibble with only the necessary columns in the formats that are easier to work during analysis.

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

We want to change “NULL” values under the column agent to an integer like 0. We will first change the whole columns data-type to Integer and since they are all natural numbers, we can assign "NULL" <- 0. This makes sense because all the values in that column refer to a positive integer corresponding to an agent, and 0 could mean that the lead was without an agent. This could be useful information.
We can change the names of the arrival date months to their month number making the data types in the arrival date columns consistent.

Document your work here.

Code

# change "NULL" under agent to 0 then convert the datatype to integer
data <- data %>% mutate(agent = as.integer(str_replace(agent, "NULL", "0")))
head(data)

Code

# create a vector of month names and their corresponding numbers
month_names <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
month_numbers <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)

# use the match() function to replace the month names with their corresponding numbers
data$arrival_date_month <- month_numbers[match(data$arrival_date_month, month_names)]

head(data$arrival_date_month)

[1] 7 7 7 7 7 7

Any additional comments?

We have a clean dataset where we applied a bunch of techniques to mutate and wrangle the data to make it more appropriate and valuable for a variety of downstream purposes like plotting, EDA, machine learning and analytics.

--- title: "Challenge 4 Submission" author: "Tanmay Agrawal" description: "More data wrangling: mutate" date: "1/11/2023" format: html: toc: true code-fold: true code-copy: true code-tools: true df-print: paged categories: - challenge_4 --- ```{r} #| label: setup #| warning: false #| message: false library(tidyverse) library(dplyr) knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) ``` ## Challenge Overview Today's challenge is to: 1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc) 2) tidy data (as needed, including sanity checks) 3) identify variables that need to be mutated 4) mutate variables and sanity check all mutations ## Read in data Read in one (or more) of the following datasets, using the correct R package and command. - abc_poll.csv ⭐ - poultry_tidy.csv⭐⭐ - FedFundsRate.csv⭐⭐⭐ - hotel_bookings.csv⭐⭐⭐⭐ - debt_in_trillions ⭐⭐⭐⭐⭐ ```{r} data <- read_csv("_data/hotel_bookings.csv") nrow(data) head(data) ``` ### Briefly describe the data The dataset is a record of customer bookings for a chain of hotels across the world. It contains a total of `119390` booking records and the columns describe the attributes of the booking which are pretty self-explanatory. ## Tidy Data (as needed) Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here. The data isn't already tidy. We observe a lot of `NULL` values in the dataset during manual inspection as well as just by looking at the first few rows using the `head` command above. Since I have worked with this dataset before, I expect columns `agent` and `company` to have a lot of grabage values. The catch though is that the `NULL` values are represented as a string, and they aren't a separate datatype. We can use the `stringr` library to find those strings are get rid of them. First we rank the columns by the percentage of garbage values in them. ```{r} # we use the stringr package for this library(stringr) # get the null percentages null_percentages <- sapply(data, function(x) sum(str_detect(x, "NULL"))/length(x)) # put it into a tibble so that it's visually appealing null_tibble <- tibble(column = names(data), null_percentage = null_percentages) null_tibble %>% arrange(desc(null_percentage)) ``` Since >94% of the values under `company` are NULL, we can drop that column. We can keep `agent` because it might still have some explanatory power. ```{r} # drop company data <- data %>% select(-company) ``` Any additional comments? The data is clean now, we now have a tibble with only the necessary columns in the formats that are easier to work during analysis. ## Identify variables that need to be mutated Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization? - We want to change "NULL" values under the column `agent` to an integer like `0`. We will first change the whole columns data-type to Integer and since they are all natural numbers, we can assign `"NULL" <- 0`. This makes sense because all the values in that column refer to a positive integer corresponding to an agent, and `0` could mean that the lead was without an agent. This could be useful information. - We can change the names of the arrival date months to their month number making the data types in the arrival date columns consistent. Document your work here. ```{r} # change "NULL" under agent to 0 then convert the datatype to integer data <- data %>% mutate(agent = as.integer(str_replace(agent, "NULL", "0"))) head(data) # create a vector of month names and their corresponding numbers month_names <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December") month_numbers <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) # use the match() function to replace the month names with their corresponding numbers data$arrival_date_month <- month_numbers[match(data$arrival_date_month, month_names)] head(data$arrival_date_month) ``` Any additional comments? We have a clean dataset where we applied a bunch of techniques to mutate and wrangle the data to make it more appropriate and valuable for a variety of downstream purposes like plotting, EDA, machine learning and analytics.