reading in 2 clean data sets and one untidy dataset
This is an example of reading in 3 files of data:
The first step is verifying that working directory and library directory are set to the same path.
The next step is to install the necessary packages and load the necessary libraries.
Then load in the railroad datasets.
There are 2 bits of data manipuation performed on the data.
- ARRANGE()
- FILTER()
To confirm: The data in both railroad files is:
# Verify library path
.libPaths()
[1] "C:/Users/theresa/Documents/R/win-library/4.1"
[2] "C:/Program Files/R/R-4.1.2/library"
# set the working directory to be the same as the library path
setwd("C:/Users/theresa/Documents/R/win-library/4.1")
# verify the working directory
getwd()
[1] "C:/Users/theresa/Documents/R/win-library/4.1"
# Installing Tidyverse and readxl packages with explicitly defining the URL of where it lives. This is to get around a Mirror error.
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
package 'tidyverse' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\theresa\AppData\Local\Temp\Rtmpwp0pBK\downloaded_packages
install.packages("readxl", repos = "http://cran.us.r-project.org")
package 'readxl' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\theresa\AppData\Local\Temp\Rtmpwp0pBK\downloaded_packages
# load the necessary libraries for the processing
library(tidyverse)
library(dbplyr)
library(readxl)
library(readr)
# Load in the files and display them for clarification.
railroad_csv <- read_csv("c:/users/theresa/Documents/DACSS Local/DataSets/railroad_2012_clean_state.csv", show_col_types = TRUE, skip=0)
railroad_exl <- read_excel("c:/users/theresa/Documents/DACSS Local/DataSets/railroad_2012_clean_state.xlsx", skip = 0)
railroad_csv
# A tibble: 53 x 2
state total_employees
<chr> <dbl>
1 AE 2
2 AK 103
3 AL 4257
4 AP 1
5 AR 3871
6 AZ 3153
7 CA 13137
8 CO 3650
9 CT 2592
10 DC 279
# ... with 43 more rows
railroad_exl
# A tibble: 53 x 2
state total_employees
<chr> <dbl>
1 AE 2
2 AK 103
3 AL 4257
4 AP 1
5 AR 3871
6 AZ 3153
7 CA 13137
8 CO 3650
9 CT 2592
10 DC 279
# ... with 43 more rows
# Arrange the resulting CSV by number of employees - Demonstration of the use of ARRANGE()
by_num_employee <- railroad_csv %>% arrange(desc(total_employees))
# Display the new order
by_num_employee
# A tibble: 53 x 2
state total_employees
<chr> <dbl>
1 TX 19839
2 IL 19131
3 NY 17050
4 NE 13176
5 CA 13137
6 PA 12769
7 OH 9056
8 GA 8605
9 IN 8537
10 MO 8419
# ... with 43 more rows
# See only MA data - Demonstrate the use of of FILTER()
Massachusetts <- railroad_csv %>% filter(state == 'MA')
Massachusetts
# A tibble: 1 x 2
state total_employees
<chr> <dbl>
1 MA 3379
# multi dplyr functions in 1 statement
highest_employee_count <- railroad_csv %>%
filter(total_employees > 10000) %>%
arrange(state)
highest_employee_count
# A tibble: 6 x 2
state total_employees
<chr> <dbl>
1 CA 13137
2 IL 19131
3 NE 13176
4 NY 17050
5 PA 12769
6 TX 19839
This is loading in the Australian Marriage Data file with multiple tabs. I did this with a user defined function and lapply. The data in these files is all loaded as characters. This loaded all 4 sheets. Not ideal because the first and last sheets do not contain data.
# create a function that will take in the file and load the separate sheets into tibbles.
multiplesheets <- function(fname) {
# getting info about all excel sheet tab names
sheet_names <- readxl::excel_sheets(fname)
# display the tab names for easy understanding
print(sheet_names)
# load the data from each sheet into its own data set using lapply.
data_sheets <- lapply(sheet_names, function(x)
{
readxl::read_excel(fname, sheet = x)
}
)
list_data <- lapply(data_sheets, as.data.frame)
# display the data in all the generated data sheets
print(data_sheets)
}
# specifying the path and file named to be used in my function.
# execute the function
path <- "c:/users/theresa/Documents/DACSS Local/DataSets/australian_marriage_law_postal_survey_2017_-_response_final_mod.xls"
multiplesheets(path)
[1] "Contents" "Table 1" "Table 2"
[4] "Explanatory Notes"
[[1]]
# A tibble: 23 x 3
`Australian Bureau of Statistics` ...2 ...3
<chr> <chr> <chr>
1 1800.0 Australian Marriage Law Postal Survey, 2017 <NA> <NA>
2 Released on 15 November 2017 <NA> <NA>
3 <NA> <NA> <NA>
4 <NA> Contents <NA>
5 <NA> Tables <NA>
6 <NA> Table 1 Resp~
7 <NA> Table 2 Resp~
8 <NA> <NA> <NA>
9 <NA> Explanato~ <NA>
10 <NA> <NA> <NA>
# ... with 13 more rows
[[2]]
# A tibble: 21 x 16
`Australian Burea~` ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
1 1800.0 Australian ~ <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
2 Released on 15 Nov~ <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
3 Table 1 Response b~ <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
4 <NA> Resp~ <NA> <NA> <NA> <NA> <NA> NA Elig~
5 <NA> Yes <NA> No <NA> Total <NA> NA Resp~
6 Country no. % no. % no. % NA no.
7 New South Wales 2374~ 57.7~ 1736~ 42.2~ 4111~ 100 NA 4111~
8 Victoria 2145~ 64.9~ 1161~ 35.1~ 3306~ 100 NA 3306~
9 Queensland 1487~ 60.7~ 9610~ 39.2~ 2448~ 100 NA 2448~
10 South Australia 5925~ 62.5 3562~ 37.5 9487~ 100 NA 9487~
# ... with 11 more rows, and 7 more variables: ...10 <chr>,
# ...11 <chr>, ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>,
# ...16 <chr>
[[3]]
# A tibble: 190 x 16
`Australian Burea~` ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
1 1800.0 Australian ~ <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
2 Released on 15 Nov~ <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
3 Table 2 Response b~ <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
4 <NA> Resp~ <NA> <NA> <NA> <NA> <NA> NA Elig~
5 <NA> Yes <NA> No <NA> Total <NA> NA Resp~
6 Region no. % no. % no. % NA no.
7 New South Wales Di~ <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
8 Banks 37736 44.8~ 46343 55.1~ 84079 100 NA 84079
9 Barton 37153 43.6~ 47984 56.3~ 85137 100 NA 85137
10 Bennelong 42943 49.7~ 43215 50.2~ 86158 100 NA 86158
# ... with 180 more rows, and 7 more variables: ...10 <chr>,
# ...11 <chr>, ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>,
# ...16 <chr>
[[4]]
# A tibble: 31 x 2
`Australian Bureau of Statistics` ...2
<chr> <chr>
1 1800.0 Australian Marriage Law Postal Survey, 2017 <NA>
2 Released on 15 November 2017 <NA>
3 <NA> <NA>
4 <NA> Explanatory Not~
5 <NA> <NA>
6 <NA> Australian Marr~
7 <NA> <NA>
8 <NA> <NA>
9 <NA> Definitions
10 <NA> <NA>
# ... with 21 more rows
This is loading only the 2 sheets within the Australian Marriage Data file that contain data. Then take the resulting data and perform a multi-step clean up using:
# getting info about all excel sheets
sheet_names <- readxl::excel_sheets(path)
print(sheet_names)
[1] "Contents" "Table 1" "Table 2"
[4] "Explanatory Notes"
# moving data from just the interesting sheets to their own data tables
table_1_data <- readxl::read_excel(path, sheet = "Table 1")
table_2_data <- readxl::read_excel(path, sheet = "Table 2")
# Print the data as loaded to check the values
print(table_1_data)
# A tibble: 21 x 16
`Australian Burea~` ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
1 1800.0 Australian ~ <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
2 Released on 15 Nov~ <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
3 Table 1 Response b~ <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
4 <NA> Resp~ <NA> <NA> <NA> <NA> <NA> NA Elig~
5 <NA> Yes <NA> No <NA> Total <NA> NA Resp~
6 Country no. % no. % no. % NA no.
7 New South Wales 2374~ 57.7~ 1736~ 42.2~ 4111~ 100 NA 4111~
8 Victoria 2145~ 64.9~ 1161~ 35.1~ 3306~ 100 NA 3306~
9 Queensland 1487~ 60.7~ 9610~ 39.2~ 2448~ 100 NA 2448~
10 South Australia 5925~ 62.5 3562~ 37.5 9487~ 100 NA 9487~
# ... with 11 more rows, and 7 more variables: ...10 <chr>,
# ...11 <chr>, ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>,
# ...16 <chr>
print(table_2_data)
# A tibble: 190 x 16
`Australian Burea~` ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
1 1800.0 Australian ~ <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
2 Released on 15 Nov~ <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
3 Table 2 Response b~ <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
4 <NA> Resp~ <NA> <NA> <NA> <NA> <NA> NA Elig~
5 <NA> Yes <NA> No <NA> Total <NA> NA Resp~
6 Region no. % no. % no. % NA no.
7 New South Wales Di~ <NA> <NA> <NA> <NA> <NA> <NA> NA <NA>
8 Banks 37736 44.8~ 46343 55.1~ 84079 100 NA 84079
9 Barton 37153 43.6~ 47984 56.3~ 85137 100 NA 85137
10 Bennelong 42943 49.7~ 43215 50.2~ 86158 100 NA 86158
# ... with 180 more rows, and 7 more variables: ...10 <chr>,
# ...11 <chr>, ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>,
# ...16 <chr>
# Beginning the clean up process
# 1) Remove the rows with no values using the complete.cases command.
# Print the data after the removal of the empty rows.
table_1_data <- table_1_data[complete.cases(table_1_data[ , 3]),]
print(table_1_data)
# A tibble: 10 x 16
`Australian Burea~` ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
1 Country no. % no. % no. % NA no.
2 New South Wales 2374~ 57.7~ 1736~ 42.2~ 4111~ 100 NA 4111~
3 Victoria 2145~ 64.9~ 1161~ 35.1~ 3306~ 100 NA 3306~
4 Queensland 1487~ 60.7~ 9610~ 39.2~ 2448~ 100 NA 2448~
5 South Australia 5925~ 62.5 3562~ 37.5 9487~ 100 NA 9487~
6 Western Australia 8015~ 63.7~ 4559~ 36.2~ 1257~ 100 NA 1257~
7 Tasmania 1919~ 63.6~ 1096~ 36.3~ 3016~ 100 NA 3016~
8 Northern Territory~ 48686 60.6~ 31690 39.3~ 80376 100 NA 80376
9 Australian Capital~ 1754~ 74 61520 26 2369~ 100 NA 2369~
10 Australia 7817~ 61.6~ 4873~ 38.3~ 1269~ 100 NA 1269~
# ... with 7 more variables: ...10 <chr>, ...11 <chr>, ...12 <chr>,
# ...13 <chr>, ...14 <chr>, ...15 <chr>, ...16 <chr>
table_2_data <- table_2_data[complete.cases(table_2_data[ , 3]),]
print(table_2_data)
# A tibble: 160 x 16
`Australian Burea~` ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
1 Region no. % no. % no. % NA no.
2 Banks 37736 44.8~ 46343 55.1~ 84079 100 NA 84079
3 Barton 37153 43.6~ 47984 56.3~ 85137 100 NA 85137
4 Bennelong 42943 49.7~ 43215 50.2~ 86158 100 NA 86158
5 Berowra 48471 54.6~ 40369 45.3~ 88840 100 NA 88840
6 Blaxland 20406 26.1~ 57926 73.9~ 78332 100 NA 78332
7 Bradfield 53681 60.6~ 34927 39.3~ 88608 100 NA 88608
8 Calare 54091 60.2~ 35779 39.7~ 89870 100 NA 89870
9 Chifley 32871 41.2~ 46702 58.7~ 79573 100 NA 79573
10 Cook 47505 55 38804 45 86309 100 NA 86309
# ... with 150 more rows, and 7 more variables: ...10 <chr>,
# ...11 <chr>, ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>,
# ...16 <chr>
# 2) Remove the empty column which happens to be column 8 on both sheets
table_1_data <- select(table_1_data, -8)
table_2_data <- select(table_2_data, -8)
# 3) Rename the column headers to make the data clear.
names(table_1_data) <- c('Region 1','Num Yes 1' , '% Yes 1', 'Num No 1', '% No 1', 'Num Total 1', '% Total 1','Clear Response 1', '% Clear Response 1', 'Not Clear Response 1', '% Not Clear Response 1','Total Response 1', '% Total 1')
names(table_2_data) <- c('Region 2','Num Yes 2' , '% Yes 2', 'Num No 2', '% No 2', 'Num Total 2', '% Total 2','Clear Response 2', '% Clear Response 2', 'Not Clear Response 2', '% Not Clear Response 2', 'Not Responding 2', '% Not Responding 2', 'Total Response 2', '% Total 2')
# Print the data after the removal of column and renaming of the columns.
print(table_1_data)
# A tibble: 10 x 15
`Region 1` `Num Yes 1` `% Yes 1` `Num No 1` `% No 1` `Num Total 1`
<chr> <chr> <chr> <chr> <chr> <chr>
1 Country no. % no. % no.
2 New South ~ 2374362 57.79999~ 1736838 42.2000~ 4111200
3 Victoria 2145629 64.90000~ 1161098 35.1000~ 3306727
4 Queensland 1487060 60.70000~ 961015 39.2999~ 2448075
5 South Aust~ 592528 62.5 356247 37.5 948775
6 Western Au~ 801575 63.70000~ 455924 36.2999~ 1257499
7 Tasmania 191948 63.60000~ 109655 36.3999~ 301603
8 Northern T~ 48686 60.60000~ 31690 39.3999~ 80376
9 Australian~ 175459 74 61520 26 236979
10 Australia 7817247 61.60000~ 4873987 38.3999~ 12691234
# ... with 9 more variables: `% Total 1` <chr>,
# `Clear Response 1` <chr>, `% Clear Response 1` <chr>,
# `Not Clear Response 1` <chr>, `% Not Clear Response 1` <chr>,
# `Total Response 1` <chr>, `% Total 1` <chr>, `` <chr>, `` <chr>
print(table_2_data)
# A tibble: 160 x 15
`Region 2` `Num Yes 2` `% Yes 2` `Num No 2` `% No 2` `Num Total 2`
<chr> <chr> <chr> <chr> <chr> <chr>
1 Region no. % no. % no.
2 Banks 37736 44.899999~ 46343 55.1000~ 84079
3 Barton 37153 43.600000~ 47984 56.3999~ 85137
4 Bennelong 42943 49.799999~ 43215 50.2000~ 86158
5 Berowra 48471 54.600000~ 40369 45.3999~ 88840
6 Blaxland 20406 26.100000~ 57926 73.9000~ 78332
7 Bradfield 53681 60.600000~ 34927 39.3999~ 88608
8 Calare 54091 60.200000~ 35779 39.7999~ 89870
9 Chifley 32871 41.299999~ 46702 58.7000~ 79573
10 Cook 47505 55 38804 45 86309
# ... with 150 more rows, and 9 more variables: `% Total 2` <chr>,
# `Clear Response 2` <chr>, `% Clear Response 2` <chr>,
# `Not Clear Response 2` <chr>, `% Not Clear Response 2` <chr>,
# `Not Responding 2` <chr>, `% Not Responding 2` <chr>,
# `Total Response 2` <chr>, `% Total 2` <chr>
# 3) remove the total lines on table 2
# Having trouble getting this to work.
# table_2_data %>% filter(grepl('total', 1))
# remove any rows with the word "Total" in them
# table_2_data <- (!grep("Total", table_2_data))
# print(table_2_data)
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
TLamkin (2022, Feb. 13). Data Analytics and Computational Social Science: HW2. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtlamkin862725/
BibTeX citation
@misc{tlamkin2022hw2, author = {TLamkin, }, title = {Data Analytics and Computational Social Science: HW2}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtlamkin862725/}, year = {2022} }