Data Analytics and Computational Social Science: Challenge 1 Solution

Sean Conway

library(dplyr)
library(readxl)
library(tidyr)
library(stringr)
file_path <- "../data/ActiveDuty_MaritalStatus.xls"

Before we get started…

This solution makes frequent use of the “pipe”, the %>% operator. Introduced in the magrittr package (but now commonly associated with dplyr, which it comes preloaded in), the pipe allows the user to pass the results of operations to further lines of code.

For example, below we work with the dataframe mtcars. We first use the pipe to pass the dataframe downstream. Then, we select the columns mpg and wt, passing a dataframe with only these two columns to the next line of code, which filters the data to only include rows where mpg>=15.

mtcars <- as_tibble(mtcars)
mtcars %>% 
  select(mpg, wt) %>%
  filter(mpg >= 15)

# A tibble: 27 × 2
     mpg    wt
   <dbl> <dbl>
 1  21    2.62
 2  21    2.88
 3  22.8  2.32
 4  21.4  3.22
 5  18.7  3.44
 6  18.1  3.46
 7  24.4  3.19
 8  22.8  3.15
 9  19.2  3.44
10  17.8  3.44
# … with 17 more rows

For more info on the pipe, watch the video on piping, read the chapter on piping in R4DS, or visit https://magrittr.tidyverse.org/.

The data

Reading in the data - First Excel Sheet

While eventually we will want to read in all the sheets at once, we will start out by reading in the first Excel Sheet, TotalDoD (see above).

We will use the read_excel() function from the readxl package.

First, we’re going to manually specify the names of our columns. This involves doing a bit hard coding (reading in messy data is the only time hard coding is recommended!).

col_names_dod <- c("pay_grade",
                   "single_withoutchildren_male",
                    "single_withoutchildren_female",
                    "single_withoutchildren_total",
                    "single_withchildren_male",
                    "single_withchildren_female",
                    "single_withchildren_total",
                    "married_jointservice_male",
                    "married_jointservice_female",
                    "married_jointservice_total",
                    "married_civilian_female",
                    "married_civilian_male",
                    "married_civilian_total",
                    "married_male_total",
                    "married_female_total",
                    "married_total_total")

Note that we named these columns so they can be adequately separated later on.

Next we have to use read_excel(), but we have to specify a number of arguments. We first specify the path (note - file_path is a variable that I created ahead of time to be specific to my computer. Yours will be different). Next, we specify sheet, the number of the sheet we wish to read in (we can also specify the sheet name). Next, we carefully chooose the range of cells in the file we read in, based on our visual inspection of the file. This is another case where we must hard-code it. Here, we want the range to go from B10 to Q37. We also need to manually specify col_names from our col_names_dod vector that we created above. Once we read it in, we will remove any rows containing the word “total” from the column pay_grade. We do so using the function str_detect() from the stringr package, which takes looks for the pattern “total” in the column pay_grade. We use the ! operator to tell filter() we do NOT want these rows.

file_path

[1] "../data/ActiveDuty_MaritalStatus.xls"

marital_dod_1 <- read_excel(path=file_path,
                      sheet = 1, 
                      range = "B10:Q37",
                      col_names = col_names_dod) %>%
  filter(!str_detect(pay_grade, "total"))#regex("total",ignore_case = TRUE),negate = T))
marital_dod_1

# A tibble: 28 × 16
   pay_grade      single_withoutch… single_withoutch… single_withoutc…
   <chr>                      <dbl>             <dbl>            <dbl>
 1 E-1                        31229              5717            36946
 2 E-2                        53094              8388            61482
 3 E-3                       131091             21019           152110
 4 E-4                       112710             16381           129091
 5 E-5                        57989             11021            69010
 6 E-6                        19125              4654            23779
 7 E-7                         5446              1913             7359
 8 E-8                         1009               438             1447
 9 E-9                          381               202              583
10 TOTAL ENLISTED            412074             69733           481807
# … with 18 more rows, and 12 more variables:
#   single_withchildren_male <dbl>, single_withchildren_female <dbl>,
#   single_withchildren_total <dbl>, married_jointservice_male <dbl>,
#   married_jointservice_female <dbl>,
#   married_jointservice_total <dbl>, married_civilian_female <dbl>,
#   married_civilian_male <dbl>, married_civilian_total <dbl>,
#   married_male_total <dbl>, married_female_total <dbl>, …

We’ve read in the data!

This tibble looks okay, but there’s still much work to be done. First, we need to remove any of the columns that contain the word "total". We don’t need these aggregated totals, as they will only muddle the data (plus we can calculate them ourselves if needed).

marital_dod_2 <- marital_dod_1 %>%
  select(c(pay_grade,!contains("total")))
marital_dod_2

# A tibble: 28 × 9
   pay_grade      single_withoutch… single_withoutch… single_withchil…
   <chr>                      <dbl>             <dbl>            <dbl>
 1 E-1                        31229              5717              563
 2 E-2                        53094              8388             1457
 3 E-3                       131091             21019             4264
 4 E-4                       112710             16381             9491
 5 E-5                        57989             11021            10937
 6 E-6                        19125              4654            10369
 7 E-7                         5446              1913             6530
 8 E-8                         1009               438             1786
 9 E-9                          381               202              579
10 TOTAL ENLISTED            412074             69733            45976
# … with 18 more rows, and 5 more variables:
#   single_withchildren_female <dbl>,
#   married_jointservice_male <dbl>,
#   married_jointservice_female <dbl>, married_civilian_female <dbl>,
#   married_civilian_male <dbl>

Next, we’ll use pivot_longer() to combine the column names (except for pay_grade) into a single column, status. We do this because the variable status is currently spread across columns (it is wide). We want our data to be tidy - where each row is a single observation.

We specify cols as every column except pay_grade with !contains(pay_grade). We also specify that the column names will be moved to status and the column values will be moved to count.

marital_dod_3 <- marital_dod_2 %>%
  pivot_longer(cols = !contains("pay_grade"),
               names_to = "status", values_to = "count")
marital_dod_3

# A tibble: 224 × 3
   pay_grade status                        count
   <chr>     <chr>                         <dbl>
 1 E-1       single_withoutchildren_male   31229
 2 E-1       single_withoutchildren_female  5717
 3 E-1       single_withchildren_male        563
 4 E-1       single_withchildren_female      122
 5 E-1       married_jointservice_male       139
 6 E-1       married_jointservice_female     141
 7 E-1       married_civilian_female        5060
 8 E-1       married_civilian_male           719
 9 E-2       single_withoutchildren_male   53094
10 E-2       single_withoutchildren_female  8388
# … with 214 more rows

Our data is now tidy! Now we have just a bit more to do. We should use separate() to separate pay_grade into two columns (enlisted and pay_grade), as well as separate status into three columns (relationship, family_status, and gender).

marital_dod_tidy <- marital_dod_3 %>%
  separate(col=pay_grade, into=c("enlisted","pay_grade"),
           sep="-") %>% 
  separate(col=status, into = c("relationship", "family_status","gender"),
           sep = "_")
marital_dod_tidy

# A tibble: 224 × 6
   enlisted pay_grade relationship family_status   gender count
   <chr>    <chr>     <chr>        <chr>           <chr>  <dbl>
 1 E        1         single       withoutchildren male   31229
 2 E        1         single       withoutchildren female  5717
 3 E        1         single       withchildren    male     563
 4 E        1         single       withchildren    female   122
 5 E        1         married      jointservice    male     139
 6 E        1         married      jointservice    female   141
 7 E        1         married      civilian        female  5060
 8 E        1         married      civilian        male     719
 9 E        2         single       withoutchildren male   53094
10 E        2         single       withoutchildren female  8388
# … with 214 more rows

Whala! We successfully read in and cleaned a very messy Excel spreadsheet.

Comment on this article Share:

Challenge 1 Solution

Before we get started…

The data

Reading in the data - First Excel Sheet

Reuse

Citation