DACSS 601 HW3

Data Wrangling

Snehal Prabhu
2/20/2022

R Markdown

Read the organic egg poultry data set

egg <- read_excel("~/organiceggpoultry.xls", sheet = 1, skip = 4)
head(egg)
# A tibble: 6 x 11
  ...1     `Extra Large \nDozen` `Extra Large 1/2 Do~` `Large \nDozen`
  <chr>                    <dbl>                 <dbl>           <dbl>
1 Jan 2004                  230                   132             230 
2 February                  230                   134.            226.
3 March                     230                   137             225 
4 April                     234.                  137             225 
5 May                       236                   137             225 
6 June                      241                   137             231.
# ... with 7 more variables: `Large \n1/2 Doz.` <dbl>, ...6 <lgl>,
#   Whole <dbl>, `B/S Breast` <dbl>, `Bone-in Breast` <chr>,
#   `Whole Legs` <dbl>, Thighs <chr>
#colnames(egg)
egg <- select(egg,"...1", "Extra Large \nDozen", "Extra Large 1/2 Doz.\n1/2 Dozen", "Large \nDozen", "Large \n1/2 Doz.")
colnames(egg) <- c("month","extra_large_dozen","extra_large_halfdozen","large_dozen","large_halfdozen")
head(egg, 20)
# A tibble: 20 x 5
   month extra_large_doz~ extra_large_hal~ large_dozen large_halfdozen
   <chr>            <dbl>            <dbl>       <dbl>           <dbl>
 1 Jan ~             230              132         230             126 
 2 Febr~             230              134.        226.            128.
 3 March             230              137         225             131 
 4 April             234.             137         225             131 
 5 May               236              137         225             131 
 6 June              241              137         231.            134.
 7 July              241              137         234.            134.
 8 Augu~             241              137         234.            134.
 9 Sept~             241              136.        234.            130.
10 Octo~             241              136.        234.            128.
11 Nove~             241              136.        234.            128.
12 Dece~             241              136.        234.            128.
13 Jan ~             241              136.        234.            128.
14 Febr~             241              136.        234.            128.
15 March             241              136.        234.            128.
16 April             241              136.        234.            128.
17 May               241              136.        234.            128.
18 June              241              136.        234.            128.
19 July              241              136.        234.            128.
20 Augu~             241              136.        234.            128.
# gsub substitutes the unwanted characters 
egg$month <- gsub(" /1", "", as.character(egg$month))
egg <- separate(egg, col=month, into=c('month', 'year'), sep=' ')
#egg <- replace_na(egg, list("year", "/1"))
egg
# A tibble: 120 x 6
   month     year  extra_large_dozen extra_large_halfdozen large_dozen
   <chr>     <chr>             <dbl>                 <dbl>       <dbl>
 1 Jan       2004               230                   132         230 
 2 February  <NA>               230                   134.        226.
 3 March     <NA>               230                   137         225 
 4 April     <NA>               234.                  137         225 
 5 May       <NA>               236                   137         225 
 6 June      <NA>               241                   137         231.
 7 July      <NA>               241                   137         234.
 8 August    <NA>               241                   137         234.
 9 September <NA>               241                   136.        234.
10 October   <NA>               241                   136.        234.
# ... with 110 more rows, and 1 more variable: large_halfdozen <dbl>
# fill() populates the cells with the values from the previous cell
clean_eggs <- egg %>% 
  fill(year)
clean_eggs
# A tibble: 120 x 6
   month     year  extra_large_dozen extra_large_halfdozen large_dozen
   <chr>     <chr>             <dbl>                 <dbl>       <dbl>
 1 Jan       2004               230                   132         230 
 2 February  2004               230                   134.        226.
 3 March     2004               230                   137         225 
 4 April     2004               234.                  137         225 
 5 May       2004               236                   137         225 
 6 June      2004               241                   137         231.
 7 July      2004               241                   137         234.
 8 August    2004               241                   137         234.
 9 September 2004               241                   136.        234.
10 October   2004               241                   136.        234.
# ... with 110 more rows, and 1 more variable: large_halfdozen <dbl>

We can now study the changes in the prices of eggs throughout the years.

eggs_year <- clean_eggs %>%
  group_by(year) %>%
  select(ends_with("dozen")) %>%
  summarise_all(mean)

eggs_year
# A tibble: 10 x 5
   year  extra_large_doz~ extra_large_hal~ large_dozen large_halfdozen
   <chr>            <dbl>            <dbl>       <dbl>           <dbl>
 1 2004              237.             136.        230.            130.
 2 2005              241              136.        234.            128.
 3 2006              241.             136.        234.            128.
 4 2007              245.             139.        237.            132.
 5 2008              269.             166.        261.            157.
 6 2009              286.             186.        275             174.
 7 2010              286.             186.        268.            174.
 8 2011              286.             186.        269.            174.
 9 2012              288.             186.        268.            174.
10 2013              290              188.        268.            178 

Plotting a graph to visualize these changes. We see a sudden rise in the prices of eggs in the year 2008 and 2009.

ggplot(eggs_year, aes(x = year)) +
  geom_line(aes(y = extra_large_dozen, group = 1), color = "red") +
  geom_point(aes(y = extra_large_dozen, group = 1), color = "red") +
  geom_line(aes(y = extra_large_halfdozen, group = 1), color = "green") +
  geom_point(aes(y = extra_large_halfdozen, group = 1), color = "green") +
  geom_line(aes(y = large_dozen, group = 1), color = "blue") +
  geom_point(aes(y = large_dozen, group = 1), color = "blue") +
  geom_line(aes(y = large_halfdozen, group = 1), color = "purple") +
  geom_point(aes(y = large_halfdozen, group = 1), color = "purple") +
  labs(x = "Year", y = "Cost") +
  ggtitle("Change in Egg Cost")

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Prabhu (2022, Feb. 27). Data Analytics and Computational Social Science: DACSS 601 HW3. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httprpubscomsnehalhw3/

BibTeX citation

@misc{prabhu2022dacss,
  author = {Prabhu, Snehal},
  title = {Data Analytics and Computational Social Science: DACSS 601 HW3},
  url = {https://github.com/DACSS/dacss_course_website/posts/httprpubscomsnehalhw3/},
  year = {2022}
}