HW3
Below is the example we did in class with some extra notes.
# OPTION 1: select names of column names that you want - not good for lots of data
trains_path <- paste(data_path, "StateCounty2012.xls", sep="")
trains <- read_excel(trains_path, skip=3) %>%
select(STATE, COUNTY, TOTAL)
# further improvements - insert column names and call certain ones delete to remove later
Notes, select is for columns and filter is for rows
# OPTION 2: name columns to make it easy to delete them - be sure to skip the old col names
trains <- read_excel(trains_path, skip=4,
col_names = c("State", "D1", "County", "D2", "Total")) %>%
select(!starts_with("d")) %>%
filter(!str_detect(State, "[Tt]otal")) #will look for both cases Total or total
trains
# A tibble: 2,933 x 3
State County Total
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# ... with 2,923 more rows
Now we want to take a unstructured dataset and convert it to “tidy” data. This example will use the organiceggpoultry.xlsx found in the "_data" folder. The first step is to define what the end result data should look like. We can figure this out by defining the case. A case encompasses the things that uniquely identify a value. For the Organic Poultry dataset, the case is the product, year, and month. For this particular dataset, it makes sense to further break down the product into two groups, chicken and egg products. The value for these two groups is the price for some quantity.
First let us read in the data by defining the data path from our base path, "C:/Users/ajb22/Documents/school/dacss_601/DACSS601Fall21/_data/".
# here we
# 1. skip the empty rows and the inaccurate header row
# 2. we rename the first column with the header "egg_month"
eggs_path <- paste(data_path, "organiceggpoultry.xls", sep="")
eggs_dirty <- read_excel(eggs_path, sheet=1, skip=4) %>%
rename(egg_month = 1)
# show the raw data
print(eggs_dirty, width = Inf)
# A tibble: 120 x 11
egg_month `Extra Large \nDozen` `Extra Large 1/2 Doz.\n1/2 Dozen`
<chr> <dbl> <dbl>
1 Jan 2004 230 132
2 February 230 134.
3 March 230 137
4 April 234. 137
5 May 236 137
6 June 241 137
7 July 241 137
8 August 241 137
9 September 241 136.
10 October 241 136.
`Large \nDozen` `Large \n1/2 Doz.` ...6 Whole `B/S Breast`
<dbl> <dbl> <lgl> <dbl> <dbl>
1 230 126 NA 198. 646.
2 226. 128. NA 198. 642.
3 225 131 NA 209 642.
4 225 131 NA 212 642.
5 225 131 NA 214. 642.
6 231. 134. NA 216. 641
7 234. 134. NA 217 642.
8 234. 134. NA 217 642.
9 234. 130. NA 217 642.
10 234. 128. NA 217 642.
`Bone-in Breast` `Whole Legs` Thighs
<chr> <dbl> <chr>
1 too few 194. too few
2 too few 194. 203
3 too few 194. 203
4 too few 194. 203
5 too few 194. 203
6 too few 202. 200.375
7 390.5 204. 199.5
8 390.5 204. 199.5
9 390.5 204. 199.5
10 390.5 204. 199.5
# ... with 110 more rows
After successfully reading in the data, it is time to start cleaning up the data. A good first step is to delete any unnecessary columns. First we will delete the columns that only contain “NA” as the data using the comparison of the amount of empty rows to the total number of rows.
# we then take out any columns that have only NA - compare #NA to #rows
eggs_dirty <- eggs_dirty %>% select_if(!colSums(is.na(eggs_dirty)) == nrow(eggs_dirty))
eggs_dirty
# A tibble: 120 x 10
egg_month `Extra Large \nDozen` `Extra Large 1/2 D~ `Large \nDozen`
<chr> <dbl> <dbl> <dbl>
1 Jan 2004 230 132 230
2 February 230 134. 226.
3 March 230 137 225
4 April 234. 137 225
5 May 236 137 225
6 June 241 137 231.
7 July 241 137 234.
8 August 241 137 234.
9 September 241 136. 234.
10 October 241 136. 234.
# ... with 110 more rows, and 6 more variables:
# Large
1/2 Doz. <dbl>, Whole <dbl>, B/S Breast <dbl>,
# Bone-in Breast <chr>, Whole Legs <dbl>, Thighs <chr>
Next we make everything the same data type, so that we can pivot. Then we pivot the columns to make them longer and to contain the products as a variable. Then we replace any response that is not valid, such as “too few” with 0.
# convert all types to characters so we can pivot, then we pivot the products to price and replace all of the "too few" responses in price to 0
eggs_dirty <- eggs_dirty %>% mutate(across(where(is.double), as.character)) %>%
pivot_longer(cols = contains("doz"),
names_to = "egg products",
values_to = "price per carton") %>%
pivot_longer(cols=contains(c("whole", "breast", "leg", "thigh")),
names_to = "chicken products",
values_to = "price per lb") %>%
mutate(`price per carton` = str_replace(`price per carton`, "[a-zA-Z ]+", "0")) %>%
mutate(`price per lb`= str_replace(`price per lb`, "[a-zA-Z ]+", "0"))
eggs_dirty
# A tibble: 2,400 x 5
egg_month `egg products` `price per carto~ `chicken produc~
<chr> <chr> <chr> <chr>
1 Jan 2004 "Extra Large \nDozen" 230 Whole
2 Jan 2004 "Extra Large \nDozen" 230 Whole Legs
3 Jan 2004 "Extra Large \nDozen" 230 B/S Breast
4 Jan 2004 "Extra Large \nDozen" 230 Bone-in Breast
5 Jan 2004 "Extra Large \nDozen" 230 Thighs
6 Jan 2004 "Extra Large 1/2 Doz.~ 132 Whole
7 Jan 2004 "Extra Large 1/2 Doz.~ 132 Whole Legs
8 Jan 2004 "Extra Large 1/2 Doz.~ 132 B/S Breast
9 Jan 2004 "Extra Large 1/2 Doz.~ 132 Bone-in Breast
10 Jan 2004 "Extra Large 1/2 Doz.~ 132 Thighs
# ... with 2,390 more rows, and 1 more variable: price per lb <chr>
Now we separate out the month and the year within the first column to make it into two separate columns.
# take out any of the "/1" that got filled in instead of the year (month (o)year|/1) to (month (o)year), then separate the column egg_month to month and year
eggs_dirty <- eggs_dirty %>%
mutate(egg_month = str_remove(egg_month, "/[0-9]+")) %>%
separate(egg_month, c("month", "year"),
extra = "drop", fill = "right")
# filling in the year from the jan month to the rest of the months, default direction is down
eggs_dirty <- eggs_dirty %>% fill(year)
eggs_dirty
# A tibble: 2,400 x 6
month year `egg products` `price per carto~ `chicken produc~
<chr> <chr> <chr> <chr> <chr>
1 Jan 2004 "Extra Large \nDoze~ 230 Whole
2 Jan 2004 "Extra Large \nDoze~ 230 Whole Legs
3 Jan 2004 "Extra Large \nDoze~ 230 B/S Breast
4 Jan 2004 "Extra Large \nDoze~ 230 Bone-in Breast
5 Jan 2004 "Extra Large \nDoze~ 230 Thighs
6 Jan 2004 "Extra Large 1/2 Do~ 132 Whole
7 Jan 2004 "Extra Large 1/2 Do~ 132 Whole Legs
8 Jan 2004 "Extra Large 1/2 Do~ 132 B/S Breast
9 Jan 2004 "Extra Large 1/2 Do~ 132 Bone-in Breast
10 Jan 2004 "Extra Large 1/2 Do~ 132 Thighs
# ... with 2,390 more rows, and 1 more variable: price per lb <chr>
Now we can return the value of price back to the double data type instead of a character. We can further process this data by breaking up the products. However, this might be excessive, since there is not a main pattern between the product types.
# further separate the products into type and amount?? not sure
eggs_clean <- eggs_dirty %>%
mutate(`price per carton`, `price per carton`= as.double(`price per carton`)) %>%
mutate(`price per lb`, `price per lb`= as.double(`price per lb`))
# show the clean data
eggs_clean
# A tibble: 2,400 x 6
month year `egg products` `price per carto~ `chicken produc~
<chr> <chr> <chr> <dbl> <chr>
1 Jan 2004 "Extra Large \nDoze~ 230 Whole
2 Jan 2004 "Extra Large \nDoze~ 230 Whole Legs
3 Jan 2004 "Extra Large \nDoze~ 230 B/S Breast
4 Jan 2004 "Extra Large \nDoze~ 230 Bone-in Breast
5 Jan 2004 "Extra Large \nDoze~ 230 Thighs
6 Jan 2004 "Extra Large 1/2 Do~ 132 Whole
7 Jan 2004 "Extra Large 1/2 Do~ 132 Whole Legs
8 Jan 2004 "Extra Large 1/2 Do~ 132 B/S Breast
9 Jan 2004 "Extra Large 1/2 Do~ 132 Bone-in Breast
10 Jan 2004 "Extra Large 1/2 Do~ 132 Thighs
# ... with 2,390 more rows, and 1 more variable: price per lb <dbl>
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Beach (2021, Oct. 1). DACSS 601 Fall 2021: HW3 with the Organic Poultry Dataset. Retrieved from https://mrolfe.github.io/DACSS601Fall21/posts/2021-10-01-hw3-allyson-beach/
BibTeX citation
@misc{beach2021hw3, author = {Beach, Allyson}, title = {DACSS 601 Fall 2021: HW3 with the Organic Poultry Dataset}, url = {https://mrolfe.github.io/DACSS601Fall21/posts/2021-10-01-hw3-allyson-beach/}, year = {2021} }