Reading in the first sheet from the Active Duty Marital Data.
This solution makes frequent use of the “pipe”, the %>%
operator. Introduced in the magrittr
package (but now commonly associated with dplyr
, which it comes preloaded in), the pipe allows the user to pass the results of operations to further lines of code.
For example, below we work with the dataframe mtcars
. We first use the pipe to pass the dataframe downstream. Then, we select the columns mpg
and wt
, passing a dataframe with only these two columns to the next line of code, which filters the data to only include rows where mpg
>=15.
# A tibble: 27 × 2
mpg wt
<dbl> <dbl>
1 21 2.62
2 21 2.88
3 22.8 2.32
4 21.4 3.22
5 18.7 3.44
6 18.1 3.46
7 24.4 3.19
8 22.8 3.15
9 19.2 3.44
10 17.8 3.44
# … with 17 more rows
For more info on the pipe, watch the video on piping, read the chapter on piping in R4DS, or visit https://magrittr.tidyverse.org/.
While eventually we will want to read in all the sheets at once, we will start out by reading in the first Excel Sheet, TotalDoD (see above).
We will use the read_excel()
function from the readxl
package.
First, we’re going to manually specify the names of our columns. This involves doing a bit hard coding (reading in messy data is the only time hard coding is recommended!).
col_names_dod <- c("pay_grade",
"single_withoutchildren_male",
"single_withoutchildren_female",
"single_withoutchildren_total",
"single_withchildren_male",
"single_withchildren_female",
"single_withchildren_total",
"married_jointservice_male",
"married_jointservice_female",
"married_jointservice_total",
"married_civilian_female",
"married_civilian_male",
"married_civilian_total",
"married_male_total",
"married_female_total",
"married_total_total")
Note that we named these columns so they can be adequately separated later on.
Next we have to use read_excel()
, but we have to specify a number of arguments. We first specify the path
(note - file_path
is a variable that I created ahead of time to be specific to my computer. Yours will be different). Next, we specify sheet
, the number of the sheet we wish to read in (we can also specify the sheet name). Next, we carefully chooose the range
of cells in the file we read in, based on our visual inspection of the file. This is another case where we must hard-code it. Here, we want the range to go from B10
to Q37
. We also need to manually specify col_names
from our col_names_dod
vector that we created above. Once we read it in, we will remove any rows containing the word “total” from the column pay_grade
. We do so using the function str_detect()
from the stringr
package, which takes looks for the pattern “total” in the column pay_grade
. We use the !
operator to tell filter()
we do NOT want these rows.
file_path
[1] "../data/ActiveDuty_MaritalStatus.xls"
marital_dod_1 <- read_excel(path=file_path,
sheet = 1,
range = "B10:Q37",
col_names = col_names_dod) %>%
filter(!str_detect(pay_grade, "total"))#regex("total",ignore_case = TRUE),negate = T))
marital_dod_1
# A tibble: 28 × 16
pay_grade single_withoutch… single_withoutch… single_withoutc…
<chr> <dbl> <dbl> <dbl>
1 E-1 31229 5717 36946
2 E-2 53094 8388 61482
3 E-3 131091 21019 152110
4 E-4 112710 16381 129091
5 E-5 57989 11021 69010
6 E-6 19125 4654 23779
7 E-7 5446 1913 7359
8 E-8 1009 438 1447
9 E-9 381 202 583
10 TOTAL ENLISTED 412074 69733 481807
# … with 18 more rows, and 12 more variables:
# single_withchildren_male <dbl>, single_withchildren_female <dbl>,
# single_withchildren_total <dbl>, married_jointservice_male <dbl>,
# married_jointservice_female <dbl>,
# married_jointservice_total <dbl>, married_civilian_female <dbl>,
# married_civilian_male <dbl>, married_civilian_total <dbl>,
# married_male_total <dbl>, married_female_total <dbl>, …
We’ve read in the data!
This tibble
looks okay, but there’s still much work to be done. First, we need to remove any of the columns that contain the word "total"
. We don’t need these aggregated totals, as they will only muddle the data (plus we can calculate them ourselves if needed).
# A tibble: 28 × 9
pay_grade single_withoutch… single_withoutch… single_withchil…
<chr> <dbl> <dbl> <dbl>
1 E-1 31229 5717 563
2 E-2 53094 8388 1457
3 E-3 131091 21019 4264
4 E-4 112710 16381 9491
5 E-5 57989 11021 10937
6 E-6 19125 4654 10369
7 E-7 5446 1913 6530
8 E-8 1009 438 1786
9 E-9 381 202 579
10 TOTAL ENLISTED 412074 69733 45976
# … with 18 more rows, and 5 more variables:
# single_withchildren_female <dbl>,
# married_jointservice_male <dbl>,
# married_jointservice_female <dbl>, married_civilian_female <dbl>,
# married_civilian_male <dbl>
Next, we’ll use pivot_longer()
to combine the column names (except for pay_grade
) into a single column, status
. We do this because the variable status
is currently spread across columns (it is wide). We want our data to be tidy - where each row is a single observation.
We specify cols
as every column except pay_grade
with !contains(pay_grade)
. We also specify that the column names will be moved to status
and the column values will be moved to count
.
marital_dod_3 <- marital_dod_2 %>%
pivot_longer(cols = !contains("pay_grade"),
names_to = "status", values_to = "count")
marital_dod_3
# A tibble: 224 × 3
pay_grade status count
<chr> <chr> <dbl>
1 E-1 single_withoutchildren_male 31229
2 E-1 single_withoutchildren_female 5717
3 E-1 single_withchildren_male 563
4 E-1 single_withchildren_female 122
5 E-1 married_jointservice_male 139
6 E-1 married_jointservice_female 141
7 E-1 married_civilian_female 5060
8 E-1 married_civilian_male 719
9 E-2 single_withoutchildren_male 53094
10 E-2 single_withoutchildren_female 8388
# … with 214 more rows
Our data is now tidy! Now we have just a bit more to do. We should use separate()
to separate pay_grade
into two columns (enlisted
and pay_grade
), as well as separate status
into three columns (relationship
, family_status
, and gender
).
marital_dod_tidy <- marital_dod_3 %>%
separate(col=pay_grade, into=c("enlisted","pay_grade"),
sep="-") %>%
separate(col=status, into = c("relationship", "family_status","gender"),
sep = "_")
marital_dod_tidy
# A tibble: 224 × 6
enlisted pay_grade relationship family_status gender count
<chr> <chr> <chr> <chr> <chr> <dbl>
1 E 1 single withoutchildren male 31229
2 E 1 single withoutchildren female 5717
3 E 1 single withchildren male 563
4 E 1 single withchildren female 122
5 E 1 married jointservice male 139
6 E 1 married jointservice female 141
7 E 1 married civilian female 5060
8 E 1 married civilian male 719
9 E 2 single withoutchildren male 53094
10 E 2 single withoutchildren female 8388
# … with 214 more rows
Whala! We successfully read in and cleaned a very messy Excel spreadsheet.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Conway (2021, Dec. 28). Data Analytics and Computational Social Science: Challenge 1 Solution. Retrieved from https://github.com/DACSS/dacss_course_website/posts/challenge-1-solution/
BibTeX citation
@misc{conway2021challenge, author = {Conway, Sean}, title = {Data Analytics and Computational Social Science: Challenge 1 Solution}, url = {https://github.com/DACSS/dacss_course_website/posts/challenge-1-solution/}, year = {2021} }