Data Analytics and Computational Social Science: HW2

TLamkin

INTRODUCTION

This is an example of reading in 3 files of data:

Clean CSV
Clean Excel
Uncleaned Excel

Load Necessary Packages & Libraries and Load Clean CSV & Excel

The first step is verifying that working directory and library directory are set to the same path.

The next step is to install the necessary packages and load the necessary libraries.

Then load in the railroad datasets.

There are 2 bits of data manipuation performed on the data.

- ARRANGE()
- FILTER()

To confirm: The data in both railroad files is:

Column 1 - character
Column 2 - Double class (dbl - double percision float)

# Verify library path

.libPaths()

[1] "C:/Users/theresa/Documents/R/win-library/4.1"
[2] "C:/Program Files/R/R-4.1.2/library"

# set the working directory to be the same as the library path

setwd("C:/Users/theresa/Documents/R/win-library/4.1")

# verify the working directory

getwd()

[1] "C:/Users/theresa/Documents/R/win-library/4.1"

# Installing Tidyverse and readxl packages with explicitly defining the URL of where it lives. This is to get around a Mirror error. 

install.packages("tidyverse", repos = "http://cran.us.r-project.org")

package 'tidyverse' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\theresa\AppData\Local\Temp\Rtmpwp0pBK\downloaded_packages

install.packages("readxl", repos = "http://cran.us.r-project.org")

package 'readxl' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\theresa\AppData\Local\Temp\Rtmpwp0pBK\downloaded_packages

# load the necessary libraries for the processing

library(tidyverse)
library(dbplyr)
library(readxl)
library(readr)

# Load in the files and display them for clarification. 

railroad_csv <- read_csv("c:/users/theresa/Documents/DACSS Local/DataSets/railroad_2012_clean_state.csv", show_col_types = TRUE, skip=0)
railroad_exl <- read_excel("c:/users/theresa/Documents/DACSS Local/DataSets/railroad_2012_clean_state.xlsx", skip = 0)
railroad_csv

# A tibble: 53 x 2
   state total_employees
   <chr>           <dbl>
 1 AE                  2
 2 AK                103
 3 AL               4257
 4 AP                  1
 5 AR               3871
 6 AZ               3153
 7 CA              13137
 8 CO               3650
 9 CT               2592
10 DC                279
# ... with 43 more rows

railroad_exl

# A tibble: 53 x 2
   state total_employees
   <chr>           <dbl>
 1 AE                  2
 2 AK                103
 3 AL               4257
 4 AP                  1
 5 AR               3871
 6 AZ               3153
 7 CA              13137
 8 CO               3650
 9 CT               2592
10 DC                279
# ... with 43 more rows

# Arrange the resulting CSV by number of employees - Demonstration of the use of ARRANGE()

by_num_employee <- railroad_csv %>% arrange(desc(total_employees))

# Display the new order 

by_num_employee

# A tibble: 53 x 2
   state total_employees
   <chr>           <dbl>
 1 TX              19839
 2 IL              19131
 3 NY              17050
 4 NE              13176
 5 CA              13137
 6 PA              12769
 7 OH               9056
 8 GA               8605
 9 IN               8537
10 MO               8419
# ... with 43 more rows

# See only MA data - Demonstrate the use of of FILTER()

Massachusetts <- railroad_csv %>% filter(state == 'MA')

Massachusetts

# A tibble: 1 x 2
  state total_employees
  <chr>           <dbl>
1 MA               3379

# multi dplyr functions in 1 statement

highest_employee_count <- railroad_csv %>% 
    filter(total_employees > 10000) %>%
    arrange(state)

highest_employee_count

# A tibble: 6 x 2
  state total_employees
  <chr>           <dbl>
1 CA              13137
2 IL              19131
3 NE              13176
4 NY              17050
5 PA              12769
6 TX              19839

Load Unclean Excel Approach 1 - Load All Sheets - Data not cleaned After Loading

This is loading in the Australian Marriage Data file with multiple tabs. I did this with a user defined function and lapply. The data in these files is all loaded as characters. This loaded all 4 sheets. Not ideal because the first and last sheets do not contain data.

# create a function that will take in the file and load the separate sheets into tibbles.  

multiplesheets <- function(fname) {
   
# getting info about all excel sheet tab names
  
  sheet_names <- readxl::excel_sheets(fname)
  
# display the tab names for easy understanding
  
  print(sheet_names)
  
# load the data from each sheet into its own data set using lapply. 

  data_sheets <- lapply(sheet_names, function(x) 
          { 
          readxl::read_excel(fname, sheet = x)
          }
          )
 
 list_data <- lapply(data_sheets, as.data.frame)

# display the data in all the generated data sheets
  
  print(data_sheets)

}
  
# specifying the path and file named to be used in my function.  
# execute the function


path <- "c:/users/theresa/Documents/DACSS Local/DataSets/australian_marriage_law_postal_survey_2017_-_response_final_mod.xls"
multiplesheets(path)

[1] "Contents"          "Table 1"           "Table 2"          
[4] "Explanatory Notes"
[[1]]
# A tibble: 23 x 3
   `Australian Bureau of Statistics`                  ...2       ...3 
   <chr>                                              <chr>      <chr>
 1 1800.0 Australian Marriage Law Postal Survey, 2017 <NA>       <NA> 
 2 Released on 15 November 2017                       <NA>       <NA> 
 3 <NA>                                               <NA>       <NA> 
 4 <NA>                                               Contents   <NA> 
 5 <NA>                                               Tables     <NA> 
 6 <NA>                                               Table 1    Resp~
 7 <NA>                                               Table 2    Resp~
 8 <NA>                                               <NA>       <NA> 
 9 <NA>                                               Explanato~ <NA> 
10 <NA>                                               <NA>       <NA> 
# ... with 13 more rows

[[2]]
# A tibble: 21 x 16
   `Australian Burea~` ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9 
   <chr>               <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
 1 1800.0 Australian ~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 2 Released on 15 Nov~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 3 Table 1 Response b~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 4 <NA>                Resp~ <NA>  <NA>  <NA>  <NA>  <NA>  NA    Elig~
 5 <NA>                Yes   <NA>  No    <NA>  Total <NA>  NA    Resp~
 6 Country             no.   %     no.   %     no.   %     NA    no.  
 7 New South Wales     2374~ 57.7~ 1736~ 42.2~ 4111~ 100   NA    4111~
 8 Victoria            2145~ 64.9~ 1161~ 35.1~ 3306~ 100   NA    3306~
 9 Queensland          1487~ 60.7~ 9610~ 39.2~ 2448~ 100   NA    2448~
10 South Australia     5925~ 62.5  3562~ 37.5  9487~ 100   NA    9487~
# ... with 11 more rows, and 7 more variables: ...10 <chr>,
#   ...11 <chr>, ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>,
#   ...16 <chr>

[[3]]
# A tibble: 190 x 16
   `Australian Burea~` ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9 
   <chr>               <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
 1 1800.0 Australian ~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 2 Released on 15 Nov~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 3 Table 2 Response b~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 4 <NA>                Resp~ <NA>  <NA>  <NA>  <NA>  <NA>  NA    Elig~
 5 <NA>                Yes   <NA>  No    <NA>  Total <NA>  NA    Resp~
 6 Region              no.   %     no.   %     no.   %     NA    no.  
 7 New South Wales Di~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 8 Banks               37736 44.8~ 46343 55.1~ 84079 100   NA    84079
 9 Barton              37153 43.6~ 47984 56.3~ 85137 100   NA    85137
10 Bennelong           42943 49.7~ 43215 50.2~ 86158 100   NA    86158
# ... with 180 more rows, and 7 more variables: ...10 <chr>,
#   ...11 <chr>, ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>,
#   ...16 <chr>

[[4]]
# A tibble: 31 x 2
   `Australian Bureau of Statistics`                  ...2            
   <chr>                                              <chr>           
 1 1800.0 Australian Marriage Law Postal Survey, 2017 <NA>            
 2 Released on 15 November 2017                       <NA>            
 3 <NA>                                               <NA>            
 4 <NA>                                               Explanatory Not~
 5 <NA>                                               <NA>            
 6 <NA>                                               Australian Marr~
 7 <NA>                                               <NA>            
 8 <NA>                                               <NA>            
 9 <NA>                                               Definitions     
10 <NA>                                               <NA>            
# ... with 21 more rows

Load and Clean Only Select Sheets Within Unclean Excel

This is loading only the 2 sheets within the Australian Marriage Data file that contain data. Then take the resulting data and perform a multi-step clean up using:

complete.case to remove rows with NA
SELECT() to remove the column with no data
C() to rename the columns

# getting info about all excel sheets

  sheet_names <- readxl::excel_sheets(path)
  print(sheet_names)

[1] "Contents"          "Table 1"           "Table 2"          
[4] "Explanatory Notes"

# moving data from just the interesting sheets to their own data tables
  
  table_1_data <- readxl::read_excel(path, sheet = "Table 1")
  table_2_data <- readxl::read_excel(path, sheet = "Table 2")
  
# Print the data as loaded to check the values
  
  print(table_1_data)

# A tibble: 21 x 16
   `Australian Burea~` ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9 
   <chr>               <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
 1 1800.0 Australian ~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 2 Released on 15 Nov~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 3 Table 1 Response b~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 4 <NA>                Resp~ <NA>  <NA>  <NA>  <NA>  <NA>  NA    Elig~
 5 <NA>                Yes   <NA>  No    <NA>  Total <NA>  NA    Resp~
 6 Country             no.   %     no.   %     no.   %     NA    no.  
 7 New South Wales     2374~ 57.7~ 1736~ 42.2~ 4111~ 100   NA    4111~
 8 Victoria            2145~ 64.9~ 1161~ 35.1~ 3306~ 100   NA    3306~
 9 Queensland          1487~ 60.7~ 9610~ 39.2~ 2448~ 100   NA    2448~
10 South Australia     5925~ 62.5  3562~ 37.5  9487~ 100   NA    9487~
# ... with 11 more rows, and 7 more variables: ...10 <chr>,
#   ...11 <chr>, ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>,
#   ...16 <chr>

  print(table_2_data)

# A tibble: 190 x 16
   `Australian Burea~` ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9 
   <chr>               <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
 1 1800.0 Australian ~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 2 Released on 15 Nov~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 3 Table 2 Response b~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 4 <NA>                Resp~ <NA>  <NA>  <NA>  <NA>  <NA>  NA    Elig~
 5 <NA>                Yes   <NA>  No    <NA>  Total <NA>  NA    Resp~
 6 Region              no.   %     no.   %     no.   %     NA    no.  
 7 New South Wales Di~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  NA    <NA> 
 8 Banks               37736 44.8~ 46343 55.1~ 84079 100   NA    84079
 9 Barton              37153 43.6~ 47984 56.3~ 85137 100   NA    85137
10 Bennelong           42943 49.7~ 43215 50.2~ 86158 100   NA    86158
# ... with 180 more rows, and 7 more variables: ...10 <chr>,
#   ...11 <chr>, ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>,
#   ...16 <chr>

  # Beginning the clean up process
  
# 1) Remove the rows with no values using the complete.cases command.
#    Print the data after the removal of the empty rows.
    
  table_1_data <- table_1_data[complete.cases(table_1_data[ , 3]),]
  print(table_1_data)

# A tibble: 10 x 16
   `Australian Burea~` ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9 
   <chr>               <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
 1 Country             no.   %     no.   %     no.   %     NA    no.  
 2 New South Wales     2374~ 57.7~ 1736~ 42.2~ 4111~ 100   NA    4111~
 3 Victoria            2145~ 64.9~ 1161~ 35.1~ 3306~ 100   NA    3306~
 4 Queensland          1487~ 60.7~ 9610~ 39.2~ 2448~ 100   NA    2448~
 5 South Australia     5925~ 62.5  3562~ 37.5  9487~ 100   NA    9487~
 6 Western Australia   8015~ 63.7~ 4559~ 36.2~ 1257~ 100   NA    1257~
 7 Tasmania            1919~ 63.6~ 1096~ 36.3~ 3016~ 100   NA    3016~
 8 Northern Territory~ 48686 60.6~ 31690 39.3~ 80376 100   NA    80376
 9 Australian Capital~ 1754~ 74    61520 26    2369~ 100   NA    2369~
10 Australia           7817~ 61.6~ 4873~ 38.3~ 1269~ 100   NA    1269~
# ... with 7 more variables: ...10 <chr>, ...11 <chr>, ...12 <chr>,
#   ...13 <chr>, ...14 <chr>, ...15 <chr>, ...16 <chr>

  table_2_data <- table_2_data[complete.cases(table_2_data[ , 3]),]
  print(table_2_data)

# A tibble: 160 x 16
   `Australian Burea~` ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9 
   <chr>               <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
 1 Region              no.   %     no.   %     no.   %     NA    no.  
 2 Banks               37736 44.8~ 46343 55.1~ 84079 100   NA    84079
 3 Barton              37153 43.6~ 47984 56.3~ 85137 100   NA    85137
 4 Bennelong           42943 49.7~ 43215 50.2~ 86158 100   NA    86158
 5 Berowra             48471 54.6~ 40369 45.3~ 88840 100   NA    88840
 6 Blaxland            20406 26.1~ 57926 73.9~ 78332 100   NA    78332
 7 Bradfield           53681 60.6~ 34927 39.3~ 88608 100   NA    88608
 8 Calare              54091 60.2~ 35779 39.7~ 89870 100   NA    89870
 9 Chifley             32871 41.2~ 46702 58.7~ 79573 100   NA    79573
10 Cook                47505 55    38804 45    86309 100   NA    86309
# ... with 150 more rows, and 7 more variables: ...10 <chr>,
#   ...11 <chr>, ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>,
#   ...16 <chr>

# 2) Remove the empty column which happens to be column 8 on both sheets
  
  table_1_data <- select(table_1_data, -8)
  table_2_data <- select(table_2_data, -8)
  
# 3) Rename the column headers to make the data clear. 
  
  names(table_1_data) <- c('Region 1','Num Yes 1' , '% Yes 1', 'Num No 1', '% No 1', 'Num Total 1', '% Total 1','Clear Response 1', '% Clear Response 1', 'Not Clear Response 1', '% Not Clear Response 1','Total Response 1', '% Total 1')
  
  names(table_2_data) <- c('Region 2','Num Yes 2' , '% Yes 2', 'Num No 2', '% No 2', 'Num Total 2', '% Total 2','Clear Response 2', '% Clear Response 2', 'Not Clear Response 2', '% Not Clear Response 2', 'Not Responding 2', '% Not Responding 2', 'Total Response 2', '% Total 2')
  
#    Print the data after the removal of column and renaming of the columns.
  
  print(table_1_data)

# A tibble: 10 x 15
   `Region 1`  `Num Yes 1` `% Yes 1` `Num No 1` `% No 1` `Num Total 1`
   <chr>       <chr>       <chr>     <chr>      <chr>    <chr>        
 1 Country     no.         %         no.        %        no.          
 2 New South ~ 2374362     57.79999~ 1736838    42.2000~ 4111200      
 3 Victoria    2145629     64.90000~ 1161098    35.1000~ 3306727      
 4 Queensland  1487060     60.70000~ 961015     39.2999~ 2448075      
 5 South Aust~ 592528      62.5      356247     37.5     948775       
 6 Western Au~ 801575      63.70000~ 455924     36.2999~ 1257499      
 7 Tasmania    191948      63.60000~ 109655     36.3999~ 301603       
 8 Northern T~ 48686       60.60000~ 31690      39.3999~ 80376        
 9 Australian~ 175459      74        61520      26       236979       
10 Australia   7817247     61.60000~ 4873987    38.3999~ 12691234     
# ... with 9 more variables: `% Total 1` <chr>,
#   `Clear Response 1` <chr>, `% Clear Response 1` <chr>,
#   `Not Clear Response 1` <chr>, `% Not Clear Response 1` <chr>,
#   `Total Response 1` <chr>, `% Total 1` <chr>, `` <chr>, `` <chr>

  print(table_2_data)

# A tibble: 160 x 15
   `Region 2` `Num Yes 2` `% Yes 2`  `Num No 2` `% No 2` `Num Total 2`
   <chr>      <chr>       <chr>      <chr>      <chr>    <chr>        
 1 Region     no.         %          no.        %        no.          
 2 Banks      37736       44.899999~ 46343      55.1000~ 84079        
 3 Barton     37153       43.600000~ 47984      56.3999~ 85137        
 4 Bennelong  42943       49.799999~ 43215      50.2000~ 86158        
 5 Berowra    48471       54.600000~ 40369      45.3999~ 88840        
 6 Blaxland   20406       26.100000~ 57926      73.9000~ 78332        
 7 Bradfield  53681       60.600000~ 34927      39.3999~ 88608        
 8 Calare     54091       60.200000~ 35779      39.7999~ 89870        
 9 Chifley    32871       41.299999~ 46702      58.7000~ 79573        
10 Cook       47505       55         38804      45       86309        
# ... with 150 more rows, and 9 more variables: `% Total 2` <chr>,
#   `Clear Response 2` <chr>, `% Clear Response 2` <chr>,
#   `Not Clear Response 2` <chr>, `% Not Clear Response 2` <chr>,
#   `Not Responding 2` <chr>, `% Not Responding 2` <chr>,
#   `Total Response 2` <chr>, `% Total 2` <chr>

# 3) remove the total lines on table 2
  
  # Having trouble getting this to work.  
  # table_2_data %>% filter(grepl('total', 1))
  
 
  
    # remove any rows with the word "Total" in them
    # table_2_data <- (!grep("Total", table_2_data))
    # print(table_2_data)

Comment on this article Share:

HW2

INTRODUCTION

Load Necessary Packages & Libraries and Load Clean CSV & Excel

Load Unclean Excel Approach 1 - Load All Sheets - Data not cleaned After Loading

Load and Clean Only Select Sheets Within Unclean Excel

Reuse

Citation