Two examples of reading in Excel Tables with the advanced data sets.
This post will take a closer look at some tools that can be used to read in tabular data (i.e., tables) that are often published based on government sources or by other organizations. Tabular data is often made available in Excel format (.xls or .xlsx) and is formatted for ease of reading - but this can make it tricky to read into R and reshape into a usable dataset.
Reading in tabular data will follow the same general work flow or work process regardless of formatting differences. We will work through the steps in detail below, but here is an overview. Note that not every step is needed for every file.
filter
(and stringr
package)mutate
new variables as required, using separate
, pivot_longer`, etcThe railroad data set is a fairly straightforward formatted table published by the Railroad Retirement Board. The value variable is a count of the number of employees in each county and state combination.
Looking at the excel file, we can see that there are only a few issues: 1. There are three rows at the top of the sheet that are not needed 2. There are blank columns that are not needed. 3. There are Total rows for each state that are not needed
For the first issue, we use the “skip” option on read_excel
from the readxl
package to skip the rows at the top.
read_excel("../../_data/StateCounty2012.xls",
skip = 3)
# A tibble: 2,990 × 5
STATE ...2 COUNTY ...4 TOTAL
<chr> <lgl> <chr> <lgl> <dbl>
1 AE NA APO NA 2
2 AE Total1 NA <NA> NA 2
3 AK NA ANCHORAGE NA 7
4 AK NA FAIRBANKS NORTH STAR NA 2
5 AK NA JUNEAU NA 3
6 AK NA MATANUSKA-SUSITNA NA 2
7 AK NA SITKA NA 1
8 AK NA SKAGWAY MUNICIPALITY NA 88
9 AK Total NA <NA> NA 103
10 AL NA AUTAUGA NA 102
# … with 2,980 more rows
For the second issue, I name the blank columns “delete” to make is easy to remove the unwanted columns. I then use select
(with the ! sign to designate the complement or NOT) to select columns we wish to keep in the dataset - the rest are removed. Note that I skip 4 rows this time as I do not need the original header row.
There are other approaches you could use for this task (e.g., remove all columns that have no valid volues), but hard coding of variable names and types during data read in is not considered a violation of best practices and - if used strategically - can often make later data cleaning much easier.
read_excel("../../_data/StateCounty2012.xls",
skip = 4,
col_names= c("State", "delete", "County", "delete", "Employees"))%>%
select(!contains("delete"))
# A tibble: 2,990 × 3
State County Employees
<chr> <chr> <dbl>
1 AE APO 2
2 AE Total1 <NA> 2
3 AK ANCHORAGE 7
4 AK FAIRBANKS NORTH STAR 2
5 AK JUNEAU 3
6 AK MATANUSKA-SUSITNA 2
7 AK SITKA 1
8 AK SKAGWAY MUNICIPALITY 88
9 AK Total <NA> 103
10 AL AUTAUGA 102
# … with 2,980 more rows
For the third issue, we are going to use filter
to identify (and drop the rows that have the word “Total” in the State column). str_detect
can be used to find specific rows within a column that have the designated “pattern”, while the “!” designates the complement of the selected rows (i.e., those without the “pattern” we are searching for.)
The str_detect
command is from the stringr
package, and is a powerful and easy to use implementation of grep and regex in the tidyverse - the base R functions (grep, gsub, etc) are classic but far more difficult to use, particularly for those not in practice. Be sure to explore the stringr
package on your own.
railroad<-read_excel("../../_data/StateCounty2012.xls",
skip = 4,
col_names= c("State", "delete", "County", "delete", "Employees"))%>%
select(!contains("delete"))%>%
filter(!str_detect(State, "Total"))
railroad
# A tibble: 2,933 × 3
State County Employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# … with 2,923 more rows
Tables often have notes in the last few table rows. You can check table limits and use this information during data read-in to not read the notes by setting the n-max
option at the total number of rows to read, or less commonly, the range
option to specify the spreadsheet range in standard excel naming (e.g., “B4:R142”). If you didn’t handle this on read in, you can use the tail
command to check for notes and either tail
or head
to keep only the rows that you need.
tail(railroad, 10)
# A tibble: 10 × 3
State County Employees
<chr> <chr> <dbl>
1 WY PLATTE 129
2 WY SHERI… 252
3 WY SUBLE… 3
4 WY SWEET… 196
5 WY UINTA 49
6 WY WASHA… 10
7 WY WESTON 37
8 CANADA <NA> 662
9 1 Military designation. <NA> NA
10 NOTE: Excludes 2,896 employees without an address. <NA> NA
# A tibble: 10 × 3
State County Employees
<chr> <chr> <dbl>
1 WY NIOBRARA 51
2 WY PARK 29
3 WY PLATTE 129
4 WY SHERIDAN 252
5 WY SUBLETTE 3
6 WY SWEETWATER 196
7 WY UINTA 49
8 WY WASHAKIE 10
9 WY WESTON 37
10 CANADA <NA> 662
And that is all it takes! The data are now ready for analysis. For example, suppose we wished to recover the information about state totals. This is easy to do using group_by
.
# A tibble: 54 × 2
State `State Employees`
<chr> <dbl>
1 AE 2
2 AK 103
3 AL 4257
4 AP 1
5 AR 3871
6 AZ 3153
7 CA 13137
8 CANADA 662
9 CO 3650
10 CT 2592
# … with 44 more rows
This is another government published tabular data source. In 2017, Australia conducted a postal survey to gauge citizens’ opinions towards same sex marriage. The survey questions was straightforward: “Should the law be changed to allow same-sex couples to marry?” Here is a quick image showing the original table format.
While similar in some respect to the State Railroad data above, the Australian survey data are clearly more complex in several respects. - There are two values (vote count and percentage) in the dataset - The values appear to be redundent (percentage is easy to recover from vote count data) - There may be other redundant information in - Grouped information instead of individual observations where variables appear elsewhere - Redundent vvariables (Total:Response Clear on the left and ResponeClear)
If we decide to temporarily ignore the proportions data (as suggested above) and the “totals” columns, we can identify four potentially distinct pieces of information in addition to the vote count columns: County, Division, response(yes, no, response unclear, and non-responding) and vote count. Our goal is to create this desirable data set.
We will once again use skip
and col_names to read in the data, select
to get rid of unneeded columns, and filter
to get rid of unneeded rows. We also use the drop_na
function to filter unwanted rows.
votes <- read_excel("../../_data/australian_marriage_law_postal_survey_2017_-_response_final.xls",
sheet="Table 2",
skip=7,
col_names = c("Town", "Yes", "d", "No", rep("d", 6), "Illegible", "d", "No Response", rep("d", 3)))%>%
select(!starts_with("d"))%>%
drop_na(Town)%>%
filter(!str_detect(Town, "(Total)"))%>%
filter(!str_starts(Town, "\\("))
votes
# A tibble: 160 × 5
Town Yes No Illegible `No Response`
<chr> <dbl> <dbl> <dbl> <dbl>
1 New South Wales Divisions NA NA NA NA
2 Banks 37736 46343 247 20928
3 Barton 37153 47984 226 24008
4 Bennelong 42943 43215 244 19973
5 Berowra 48471 40369 212 16038
6 Blaxland 20406 57926 220 25883
7 Bradfield 53681 34927 202 17261
8 Calare 54091 35779 285 25342
9 Chifley 32871 46702 263 28180
10 Cook 47505 38804 229 18713
# … with 150 more rows
At this point, you can see we are REALLY close. We have yes and no variable plus illegible and no response. That said, tthe current step is more complicated. Each observation (county) needs a variable for administrative “division”, but this is displayed at the top of each block.
The following code uses case_when to make a new “Divisions” variables with an entry (e.g., New South Wales Division) where there is a Division name in the town column, and otherwise create just an empty space.
At that point, the following loop (with seq_along) can be used to fill in empty spaces with the most recent Divisions name, and then filter out rows with only the title information.
votes<- votes%>%
mutate(Divisions = case_when(
str_ends(Town, "Divisions") ~ Town,
TRUE ~ NA_character_
))
for(i in seq_along(votes$Divisions)){
votes$Divisions[i]<-ifelse(is.na(votes$Divisions[i]),votes$Divisions[i-1], votes$Divisions[i])
}
votes<- filter(votes,!str_detect(Town, "Divisions|Australia"))
votes
# A tibble: 150 × 6
Town Yes No Illegible `No Response` Divisions
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Banks 37736 46343 247 20928 New South Wales Divi…
2 Barton 37153 47984 226 24008 New South Wales Divi…
3 Bennelong 42943 43215 244 19973 New South Wales Divi…
4 Berowra 48471 40369 212 16038 New South Wales Divi…
5 Blaxland 20406 57926 220 25883 New South Wales Divi…
6 Bradfield 53681 34927 202 17261 New South Wales Divi…
7 Calare 54091 35779 285 25342 New South Wales Divi…
8 Chifley 32871 46702 263 28180 New South Wales Divi…
9 Cook 47505 38804 229 18713 New South Wales Divi…
10 Cowper 57493 38317 315 25197 New South Wales Divi…
# … with 140 more rows
Supposed we wanted to create a stacked bar chart to compare the % who votes Yes to the people who either said No or didn’t vote. The easiest way is to pivot longer into the original data format: State, Division, surveyResponse, count.
votes%>%
pivot_longer(
cols = Yes:`No Response`,
names_to = "Response",
values_to = "Count"
)
# A tibble: 600 × 4
Town Divisions Response Count
<chr> <chr> <chr> <dbl>
1 Banks New South Wales Divisions Yes 37736
2 Banks New South Wales Divisions No 46343
3 Banks New South Wales Divisions Illegible 247
4 Banks New South Wales Divisions No Response 20928
5 Barton New South Wales Divisions Yes 37153
6 Barton New South Wales Divisions No 47984
7 Barton New South Wales Divisions Illegible 226
8 Barton New South Wales Divisions No Response 24008
9 Bennelong New South Wales Divisions Yes 42943
10 Bennelong New South Wales Divisions No 43215
# … with 590 more rows
Special thanks to Karl, Shih-Yen, Mohit, and the other students in the advanced group for allowing me to use their blog submissions as a starting point for this demonstration!
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Rolfe (2021, Aug. 18). DACSS 601 August 2021: Reading in Tabular Data. Retrieved from https://mrolfe.github.io/DACSS601August2021/posts/2021-08-18-reading-in-tabular-data/
BibTeX citation
@misc{rolfe2021reading, author = {Rolfe, Meredith}, title = {DACSS 601 August 2021: Reading in Tabular Data}, url = {https://mrolfe.github.io/DACSS601August2021/posts/2021-08-18-reading-in-tabular-data/}, year = {2021} }