Data Wrangling
Read the organic egg poultry data set
egg <- read_excel("~/organiceggpoultry.xls", sheet = 1, skip = 4)
head(egg)
# A tibble: 6 x 11
...1 `Extra Large \nDozen` `Extra Large 1/2 Do~` `Large \nDozen`
<chr> <dbl> <dbl> <dbl>
1 Jan 2004 230 132 230
2 February 230 134. 226.
3 March 230 137 225
4 April 234. 137 225
5 May 236 137 225
6 June 241 137 231.
# ... with 7 more variables: `Large \n1/2 Doz.` <dbl>, ...6 <lgl>,
# Whole <dbl>, `B/S Breast` <dbl>, `Bone-in Breast` <chr>,
# `Whole Legs` <dbl>, Thighs <chr>
#colnames(egg)
egg <- select(egg,"...1", "Extra Large \nDozen", "Extra Large 1/2 Doz.\n1/2 Dozen", "Large \nDozen", "Large \n1/2 Doz.")
colnames(egg) <- c("month","extra_large_dozen","extra_large_halfdozen","large_dozen","large_halfdozen")
head(egg, 20)
# A tibble: 20 x 5
month extra_large_doz~ extra_large_hal~ large_dozen large_halfdozen
<chr> <dbl> <dbl> <dbl> <dbl>
1 Jan ~ 230 132 230 126
2 Febr~ 230 134. 226. 128.
3 March 230 137 225 131
4 April 234. 137 225 131
5 May 236 137 225 131
6 June 241 137 231. 134.
7 July 241 137 234. 134.
8 Augu~ 241 137 234. 134.
9 Sept~ 241 136. 234. 130.
10 Octo~ 241 136. 234. 128.
11 Nove~ 241 136. 234. 128.
12 Dece~ 241 136. 234. 128.
13 Jan ~ 241 136. 234. 128.
14 Febr~ 241 136. 234. 128.
15 March 241 136. 234. 128.
16 April 241 136. 234. 128.
17 May 241 136. 234. 128.
18 June 241 136. 234. 128.
19 July 241 136. 234. 128.
20 Augu~ 241 136. 234. 128.
# gsub substitutes the unwanted characters
egg$month <- gsub(" /1", "", as.character(egg$month))
egg <- separate(egg, col=month, into=c('month', 'year'), sep=' ')
#egg <- replace_na(egg, list("year", "/1"))
egg
# A tibble: 120 x 6
month year extra_large_dozen extra_large_halfdozen large_dozen
<chr> <chr> <dbl> <dbl> <dbl>
1 Jan 2004 230 132 230
2 February <NA> 230 134. 226.
3 March <NA> 230 137 225
4 April <NA> 234. 137 225
5 May <NA> 236 137 225
6 June <NA> 241 137 231.
7 July <NA> 241 137 234.
8 August <NA> 241 137 234.
9 September <NA> 241 136. 234.
10 October <NA> 241 136. 234.
# ... with 110 more rows, and 1 more variable: large_halfdozen <dbl>
# fill() populates the cells with the values from the previous cell
clean_eggs <- egg %>%
fill(year)
clean_eggs
# A tibble: 120 x 6
month year extra_large_dozen extra_large_halfdozen large_dozen
<chr> <chr> <dbl> <dbl> <dbl>
1 Jan 2004 230 132 230
2 February 2004 230 134. 226.
3 March 2004 230 137 225
4 April 2004 234. 137 225
5 May 2004 236 137 225
6 June 2004 241 137 231.
7 July 2004 241 137 234.
8 August 2004 241 137 234.
9 September 2004 241 136. 234.
10 October 2004 241 136. 234.
# ... with 110 more rows, and 1 more variable: large_halfdozen <dbl>
We can now study the changes in the prices of eggs throughout the years.
eggs_year <- clean_eggs %>%
group_by(year) %>%
select(ends_with("dozen")) %>%
summarise_all(mean)
eggs_year
# A tibble: 10 x 5
year extra_large_doz~ extra_large_hal~ large_dozen large_halfdozen
<chr> <dbl> <dbl> <dbl> <dbl>
1 2004 237. 136. 230. 130.
2 2005 241 136. 234. 128.
3 2006 241. 136. 234. 128.
4 2007 245. 139. 237. 132.
5 2008 269. 166. 261. 157.
6 2009 286. 186. 275 174.
7 2010 286. 186. 268. 174.
8 2011 286. 186. 269. 174.
9 2012 288. 186. 268. 174.
10 2013 290 188. 268. 178
Plotting a graph to visualize these changes. We see a sudden rise in the prices of eggs in the year 2008 and 2009.
ggplot(eggs_year, aes(x = year)) +
geom_line(aes(y = extra_large_dozen, group = 1), color = "red") +
geom_point(aes(y = extra_large_dozen, group = 1), color = "red") +
geom_line(aes(y = extra_large_halfdozen, group = 1), color = "green") +
geom_point(aes(y = extra_large_halfdozen, group = 1), color = "green") +
geom_line(aes(y = large_dozen, group = 1), color = "blue") +
geom_point(aes(y = large_dozen, group = 1), color = "blue") +
geom_line(aes(y = large_halfdozen, group = 1), color = "purple") +
geom_point(aes(y = large_halfdozen, group = 1), color = "purple") +
labs(x = "Year", y = "Cost") +
ggtitle("Change in Egg Cost")
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Prabhu (2022, Feb. 27). Data Analytics and Computational Social Science: DACSS 601 HW3. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httprpubscomsnehalhw3/
BibTeX citation
@misc{prabhu2022dacss, author = {Prabhu, Snehal}, title = {Data Analytics and Computational Social Science: DACSS 601 HW3}, url = {https://github.com/DACSS/dacss_course_website/posts/httprpubscomsnehalhw3/}, year = {2022} }