DACSS 601 August 2021: Using A Function to Read Active Duty Marital Data

Sean Conway and Meredith Rolfe

This is a quick piece of example code to read in the active duty marital status tables from Excel. This example code builds on the earlier examples of reading in Excel tables by creating a function and applying it to multiple sheets.

Cleaning a single sheet

We will start off by cleaning a single sheet from the workbook, trying to create a set of generic functions that can be used to iterate through all sheets in the workbook to read them one after the other. To read the single sheet, follow the same process we followed in *Reading in Tabular Data."

Identify grouping variables and values to extract from the table

Lets first look at an example sheet from the workbook.

We can see a few things from this example sheet. First, we will need to skip 8 or 9 rows - the data first appears in row 10. Second, the tabular cells represent count values that capture the number of employees falling into subcategories created by 6 distinct grouping values: 1) Pay Grade Type: Enlisted/Officer/Warrent Officer 2) Pay Grade Level: 1-10 (fewer for non-Enlisted) 3) Marital status: Married/Single 4) Parent: Kids/noKids (Single only) 5) Spouse affiliation: Civilian/Military (Married only) 6) Gender: Male/Female

Our goal is to recover cases that have these 6 (or really 5, if we collapse parent and spouse variables as we don’t have complete information) grouping variables to identify the case and the single value (count of active duty employees who fall into each of the resulting subcategories.)

Looking back at the original excel sheet, we can see that we will need to not just skip the top rows, we will also need to delete several columns, and also rename variables in order to make it easy to separate out the three pieces of information contained in the column names. First, I create a vector with column names (to make it easier to reuse later in the functional programming) then I read in the data and inspect it to see if the columns worked as intended.

marital <-c("d", "payGrade_payLevel",
            "single_nokids_male", "single_nokids_female", "d",
            "single_kids_male", "single_kids_female", "d",
            "married_military_male", "married_military_female", "d",
            "married_civilian_male", "married_civilian_female", "d",
            rep("d", 3))
read_excel("../../_data/ActiveDuty_MaritalStatus.xls", 
           skip=8,
           col_names = marital
           )

# A tibble: 31 × 17
   d...1 payGrade_payLevel single_nokids_male single_nokids_fem… d...5
   <chr> <chr>             <chr>              <chr>              <chr>
 1 <NA>  Pay Grade         Male               Female             Total
 2 <NA>  E-1               31229              5717               36946
 3 <NA>  E-2               53094              8388               61482
 4 <NA>  E-3               131091             21019              1521…
 5 <NA>  E-4               112710             16381              1290…
 6 <NA>  E-5               57989              11021              69010
 7 <NA>  E-6               19125              4654               23779
 8 <NA>  E-7               5446               1913               7359 
 9 <NA>  E-8               1009               438                1447 
10 <NA>  E-9               381                202                583  
# … with 21 more rows, and 12 more variables: single_kids_male <chr>,
#   single_kids_female <chr>, d...8 <chr>,
#   married_military_male <chr>, married_military_female <chr>,
#   d...11 <chr>, married_civilian_male <chr>,
#   married_civilian_female <chr>, d...14 <chr>, d...15 <chr>,
#   d...16 <chr>, d...17 <chr>

I can see that the variable names worked well, so this time I will read in the data and omit the original header row, and also filter out the various “TOTAL” rows that we don’t need to keep.

military<-read_excel("../../_data/ActiveDuty_MaritalStatus.xls", 
           skip=9,
           col_names = marital
           )%>%
  select(!starts_with("d"))%>%
  filter(str_detect(payGrade_payLevel, "TOTAL", negate=TRUE))
military

# A tibble: 24 × 9
   payGrade_payLev… single_nokids_m… single_nokids_f… single_kids_male
   <chr>                       <dbl>            <dbl>            <dbl>
 1 E-1                         31229             5717              563
 2 E-2                         53094             8388             1457
 3 E-3                        131091            21019             4264
 4 E-4                        112710            16381             9491
 5 E-5                         57989            11021            10937
 6 E-6                         19125             4654            10369
 7 E-7                          5446             1913             6530
 8 E-8                          1009              438             1786
 9 E-9                           381              202              579
10 O-1                         13495             3081              402
# … with 14 more rows, and 5 more variables:
#   single_kids_female <dbl>, married_military_male <dbl>,
#   married_military_female <dbl>, married_civilian_male <dbl>,
#   married_civilian_female <dbl>

It looks like this worked well! Now we just need to pivot_longer with 3 columns (similar to what we did in the Tabular Data example). Then we will separate out the information in the payGrade_payLevel variable and do a quick mutate to make paygrade easier to remember.

military_long <-military %>%
  pivot_longer(cols = -1,
               names_to = c("Marital", "Other", "Gender"),
               names_sep = "_",
               values_to = "count")%>%
  separate(payGrade_payLevel, 
           into = c("payGrade", "payLevel"),
           sep="-")%>%
  mutate(payGrade = case_when(
    payGrade == "E" ~ "Enlisted",
    payGrade == "O" ~ "Officer",
    payGrade == "W" ~ "Warrant Officer"
  ))
military_long

# A tibble: 192 × 6
   payGrade payLevel Marital Other    Gender count
   <chr>    <chr>    <chr>   <chr>    <chr>  <dbl>
 1 Enlisted 1        single  nokids   male   31229
 2 Enlisted 1        single  nokids   female  5717
 3 Enlisted 1        single  kids     male     563
 4 Enlisted 1        single  kids     female   122
 5 Enlisted 1        married military male     139
 6 Enlisted 1        married military female   141
 7 Enlisted 1        married civilian male    5060
 8 Enlisted 1        married civilian female   719
 9 Enlisted 2        single  nokids   male   53094
10 Enlisted 2        single  nokids   female  8388
# … with 182 more rows

This all looks like it works well. So now we will go on to creating a function with the steps, then applying it to multiple sheets.

Create a new function

We will call our new function read_military, and we will basically use the exact same commands as above. The big difference is that we will have a placeholder name (or argument) for the data sheet that will be passed to the new function.

Next we have to use read_excel(), but we have to specify a number of arguments. We first specify the path (note - file_path is a variable that I created ahead of time to be specific to my computer. Yours will be different). Next, we specify sheet, the number of the sheet we wish to read in (we can also specify the sheet name). We also need to manually specify col_names from our marital vector that we created above.

mutate() creates a new column called branch, which comes from our sheet name. select(!starts_with("d")) removes all columns that start with "d". We also filter out the word “Total” from payGrade_payLevel. pivot_longer()

read_military<-function(sheet_name){
  read_excel("../../_data/ActiveDuty_MaritalStatus.xls", 
             sheet = sheet_name,
             skip=9,
             col_names = marital
             )%>%
  mutate("branch"=sheet_name) %>%
  select(!starts_with("d"))%>%
  filter(str_detect(payGrade_payLevel, "TOTAL", negate=TRUE))%>%
  pivot_longer(cols = contains(c("male", "female")),
               names_to = c("Marital", "Other", "Gender"),
               names_sep = "_",
               values_to = "count")%>%
  separate(payGrade_payLevel, 
           into = c("payGrade", "payLevel"),
           sep="-")%>%
  mutate(payGrade = case_when(
    payGrade == "E" ~ "Enlisted",
    payGrade == "O" ~ "Officer",
    payGrade == "W" ~ "Warrant Officer"
  ))
}

We now have a function that is customized to read in the mmilitary active duty marital status sheets. We just need to use purrr - a package that is part of tidyverse but which may need to be installed and loaded on its own - to iterate through the list of sheets in the workbook.

excel_sheets("../../_data/ActiveDuty_MaritalStatus.xls")

[1] "TotalDoD"    "AirForce"    "MarineCorps" "Navy"       
[5] "Army"

map_dfr(
  excel_sheets("../../_data/ActiveDuty_MaritalStatus.xls")[2:5],
  read_military)

# A tibble: 720 × 7
   payGrade payLevel branch   Marital Other    Gender count
   <chr>    <chr>    <chr>    <chr>   <chr>    <chr>  <dbl>
 1 Enlisted 1        AirForce single  nokids   male    7721
 2 Enlisted 1        AirForce single  nokids   female  1550
 3 Enlisted 1        AirForce single  kids     male      27
 4 Enlisted 1        AirForce single  kids     female     5
 5 Enlisted 1        AirForce married military male      49
 6 Enlisted 1        AirForce married military female    27
 7 Enlisted 1        AirForce married civilian male    1064
 8 Enlisted 1        AirForce married civilian female   178
 9 Enlisted 2        AirForce single  nokids   male    4380
10 Enlisted 2        AirForce single  nokids   female  1010
# … with 710 more rows

Comment on this article Share:

Using A Function to Read Active Duty Marital Data

Cleaning a single sheet

Identify grouping variables and values to extract from the table

Create a new function

Reuse

Citation