DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 4

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data
    • Briefly describe the data
  • Identify variables that need to be mutated

Challenge 4

  • Show All Code
  • Hide All Code

  • View Source
challenge_4
abc_poll
eggs
fed_rates
hotel_bookings
debt
Author

Sanjana Jhaveri

Published

August 18, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • abc_poll.csv ⭐
  • poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
  • FedFundsRate.csv⭐⭐⭐
  • hotel_bookings.csv⭐⭐⭐⭐
  • debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
Code
library(readr)
library(knitr)
library(summarytools)
abc_poll_2021 <- read_csv("_data/abc_poll_2021.csv")
View(abc_poll_2021)

Briefly describe the data

The dimensions of the data set is huge, so trimming all the unnecessary information will be needed. I can already see that the column ppeduc5 and ppeducat are identical so one can be removed. I’m not sure what weight_pid indicates so I will be removing that as well. Column 3 (complete_status) seems irrelevant to the data that I want to decipher so I will be deleting that as well. ## Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

Code
abc_poll_2021 <- read_csv("_data/abc_poll_2021.csv",
                          skip = 1,
                          col_names = c("id", "delete", "delete", "delete", "delete", "education", "gender", "ethnicity", "household_size", "income",
"marital status", "delete", "delete", "rental", "state", "employment", "delete", "delete", "delete", "delete", "delete", "delete", "delete", "delete", "delete", "delete", "feeling", "party", "age_range", "delete", "delete")) %>%
 
  select(!starts_with("delete")) %>%
  na_if("Skipped") %>%
  
  mutate(rental = fct_recode(rental,
                             "Owned" = "Owned or being bought by you or someone in your household",
                             "Rental" = "Rented for cash",
                             "Other" = "Occupied without payment of cash rent"))

Any additional comments?

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

I only mutated the rental column to cut down the wordiness of the row. and make it more sensible. Now it is just Owned, Rental and Other.

Code
print(summarytools::dfSummary(abc_poll_2021,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

abc_poll_2021

Dimensions: 527 x 13
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
id [numeric]
Mean (sd) : 7230264 (152.3)
min ≤ med ≤ max:
7230001 ≤ 7230264 ≤ 7230527
IQR (CV) : 263 (0)
527 distinct values 0 (0.0%)
education [character]
1. Bachelors degree or highe
2. High school
3. Less than high school
4. Some college
207(39.3%)
133(25.2%)
29(5.5%)
158(30.0%)
0 (0.0%)
gender [character]
1. Female
2. Male
254(48.2%)
273(51.8%)
0 (0.0%)
ethnicity [character]
1. 2+ Races, Non-Hispanic
2. Black, Non-Hispanic
3. Hispanic
4. Other, Non-Hispanic
5. White, Non-Hispanic
21(4.0%)
27(5.1%)
51(9.7%)
24(4.6%)
404(76.7%)
0 (0.0%)
household_size [character]
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6 or more
80(15.2%)
219(41.6%)
102(19.4%)
76(14.4%)
35(6.6%)
15(2.8%)
0 (0.0%)
income [character]
1. $10,000 to $24,999
2. $100,000 to $149,999
3. $150,000 or more
4. $25,000 to $49,999
5. $50,000 to $74,999
6. $75,000 to $99,999
7. Less than $10,000
32(6.1%)
105(19.9%)
137(26.0%)
82(15.6%)
85(16.1%)
69(13.1%)
17(3.2%)
0 (0.0%)
marital status [character]
1. Divorced
2. Never married
3. Now Married
4. Separated
5. Widowed
43(8.2%)
111(21.1%)
337(63.9%)
8(1.5%)
28(5.3%)
0 (0.0%)
rental [factor]
1. Other
2. Owned
3. Rental
10(1.9%)
406(77.0%)
111(21.1%)
0 (0.0%)
state [character]
1. California
2. Texas
3. Florida
4. Pennsylvania
5. Illinois
6. New Jersey
7. Ohio
8. Michigan
9. New York
10. Washington
[ 39 others ]
51(9.7%)
42(8.0%)
34(6.5%)
28(5.3%)
23(4.4%)
21(4.0%)
21(4.0%)
18(3.4%)
18(3.4%)
18(3.4%)
253(48.0%)
0 (0.0%)
employment [character]
1. Currently laid off
2. Employed full-time (by so
3. Employed part-time (by so
4. Full Time Student
5. Homemaker
6. On furlough
7. Other
8. Retired
9. Self-employed
13(2.5%)
220(41.7%)
31(5.9%)
8(1.5%)
37(7.0%)
1(0.2%)
20(3.8%)
165(31.3%)
32(6.1%)
0 (0.0%)
feeling [character]
1. Optimistic
2. Pessimistic
229(43.7%)
295(56.3%)
3 (0.6%)
party [character]
1. A Democrat
2. A Republican
3. An Independent
4. Something else
176(33.6%)
152(29.0%)
168(32.1%)
28(5.3%)
3 (0.6%)
age_range [character]
1. 18-29
2. 30-49
3. 50-64
4. 65+
60(11.4%)
148(28.1%)
157(29.8%)
162(30.7%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

Any additional comments?

Source Code
---
title: "Challenge 4"
author: "Sanjana Jhaveri"
desription: "More data wrangling: pivoting"
date: "08/18/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_4
  - abc_poll
  - eggs
  - fed_rates
  - hotel_bookings
  - debt
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1)  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2)  tidy data (as needed, including sanity checks)
3)  identify variables that need to be mutated
4)  mutate variables and sanity check all mutations

## Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

-   abc_poll.csv ⭐
-   poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
-   FedFundsRate.csv⭐⭐⭐
-   hotel_bookings.csv⭐⭐⭐⭐
-   debt_in_trillions.xlsx ⭐⭐⭐⭐⭐

```{r}
library(readr)
library(knitr)
library(summarytools)
abc_poll_2021 <- read_csv("_data/abc_poll_2021.csv")
View(abc_poll_2021)

```

### Briefly describe the data
The dimensions of the data set is huge, so trimming all the unnecessary information will be needed. I can already see that the column ppeduc5 and ppeducat are identical so one can be removed. I'm not sure what weight_pid indicates so I will be removing that as well. Column 3 (complete_status) seems irrelevant to the data that I want to decipher so I will be deleting that as well.
## Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

```{r}
abc_poll_2021 <- read_csv("_data/abc_poll_2021.csv",
                          skip = 1,
                          col_names = c("id", "delete", "delete", "delete", "delete", "education", "gender", "ethnicity", "household_size", "income",
"marital status", "delete", "delete", "rental", "state", "employment", "delete", "delete", "delete", "delete", "delete", "delete", "delete", "delete", "delete", "delete", "feeling", "party", "age_range", "delete", "delete")) %>%
 
  select(!starts_with("delete")) %>%
  na_if("Skipped") %>%
  
  mutate(rental = fct_recode(rental,
                             "Owned" = "Owned or being bought by you or someone in your household",
                             "Rental" = "Rented for cash",
                             "Other" = "Occupied without payment of cash rent"))
  
```

Any additional comments?

## Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

I only mutated the rental column to cut down the wordiness of the row. and make it more sensible. Now it is just Owned, Rental and Other. 

```{r}
print(summarytools::dfSummary(abc_poll_2021,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')
```

Any additional comments?