Manipulating Active Duty Data
This homework demonstrates the tidy-ing and manipulation of the Total Department of Defense active miliary marital data. It shows the use of functions from the following packages:
I like to verify that my working directory and library paths are the same and that I have the correct packages installed.
# Verify library path
.libPaths()
[1] "C:/Users/theresa/Documents/R/win-library/4.1"
[2] "C:/Program Files/R/R-4.1.2/library"
# set the working directory to be the same as the library path
setwd("C:/Users/theresa/Documents/R/win-library/4.1")
# verify the working directory
getwd()
[1] "C:/Users/theresa/Documents/R/win-library/4.1"
# Installing Tidyverse and readxl packages with explicitly defining the URL of where it lives. This is to get around a Mirror error.
# install.packages("tidyverse", repos = "http://cran.us.r-project.org")
# install.packages("readxl", repos = "http://cran.us.r-project.org")
# load the necessary libraries for the processing
library(tidyverse)
library(dplyr)
library(readxl)
library(readr)
library(stringr)
library(ggplot2)
library(quantreg)
Load in the Totals tab of the dataset. The read-in also performs the following data clean up tasks:
Skipping the header rows
Skipping the first column
Skipping any column that contains the TOTAL of other columns
Skipping the final detail columns (…15 and …16) where they contain totals
Keeping only rows contain ‘-’ (this will not grab rows containing totals)
Renaming the columns with names that are more descriptive and easier to manipulate (identifying gender in a uniform and predictable location is key)
Pay Grade field contained 2 bits of information - Category and level. The loading step also cleaned this field by:
Adding a column that contains category
Adding a column that just contains level
Moving both of theses new columns to the beginning of the dataset for easier pivoting
raw_active_duty_data <-read_excel("c:/users/theresa/Documents/DACSS Local/DataSets/ActiveDuty_MaritalStatus.xls", skip = 8) %>%
select(!contains('Total')) %>%
select(-1) %>%
select(!contains('Male...15')) %>%
select(!contains('Female...16')) %>%
filter(str_detect(`Pay Grade`,"-")) %>%
rename(singwochild_male = `Male...3`, singwochild_female = `Female...4`, singwchild_male = `Male...6`, singwchild_female = `Female...7`, marjointserv_male = `Male...9`, marjointserv_female = `Female...10`, civilmar_male = `Male...12`,civilmar_female = `Female...13`) %>%
mutate(category = case_when(str_detect(`Pay Grade`, 'E') ~ 'ENLISTED',
str_detect(`Pay Grade`, 'O') ~ 'OFFICER',
str_detect(`Pay Grade`, 'W') ~ 'WARRANT')) %>%
relocate(category)%>%
mutate(final_pay_grade = str_sub(`Pay Grade`, start = 3, end = 3)) %>%
relocate(final_pay_grade)
colnames(raw_active_duty_data)
[1] "final_pay_grade" "category" "Pay Grade"
[4] "singwochild_male" "singwochild_female" "singwchild_male"
[7] "singwchild_female" "marjointserv_male" "marjointserv_female"
[10] "civilmar_male" "civilmar_female"
raw_active_duty_data
# A tibble: 24 x 11
final_pay_grade category `Pay Grade` singwochild_male
<chr> <chr> <chr> <dbl>
1 1 ENLISTED E-1 31229
2 2 ENLISTED E-2 53094
3 3 ENLISTED E-3 131091
4 4 ENLISTED E-4 112710
5 5 ENLISTED E-5 57989
6 6 ENLISTED E-6 19125
7 7 ENLISTED E-7 5446
8 8 ENLISTED E-8 1009
9 9 ENLISTED E-9 381
10 1 OFFICER O-1 13495
# ... with 14 more rows, and 7 more variables:
# singwochild_female <dbl>, singwchild_male <dbl>,
# singwchild_female <dbl>, marjointserv_male <dbl>,
# marjointserv_female <dbl>, civilmar_male <dbl>,
# civilmar_female <dbl>
#raw_active_duty_data <- raw_active_duty_data %>%
# str_remove("*_")
# raw_active_duty_data
The columnar data contains transaction information - specifically:
The designation of male or female is a response option value for the category gender
Single w/children, single w/o children, etc are also transactional responses to the category of marital status
This pivots the data to expand each row into a single instance of the scenario.
changed_data <- raw_active_duty_data %>%
pivot_longer(
cols = singwochild_male:civilmar_female,
names_to = c("marital", "gender"),
names_sep = "_",
values_to = "count")
colnames(changed_data)
[1] "final_pay_grade" "category" "Pay Grade"
[4] "marital" "gender" "count"
changed_data
# A tibble: 192 x 6
final_pay_grade category `Pay Grade` marital gender count
<chr> <chr> <chr> <chr> <chr> <dbl>
1 1 ENLISTED E-1 singwochild male 31229
2 1 ENLISTED E-1 singwochild female 5717
3 1 ENLISTED E-1 singwchild male 563
4 1 ENLISTED E-1 singwchild female 122
5 1 ENLISTED E-1 marjointserv male 139
6 1 ENLISTED E-1 marjointserv female 141
7 1 ENLISTED E-1 civilmar male 5060
8 1 ENLISTED E-1 civilmar female 719
9 2 ENLISTED E-2 singwochild male 53094
10 2 ENLISTED E-2 singwochild female 8388
# ... with 182 more rows
Just in case there are any non-numberics in the data, doing a quick check and population of an NAs.
Then grouping the data by the final pay grade and graphing by paygrade.
# Globally replace any value that wasn't provided (was a non-numeric), replace it with the median of all the other values in that instance.
changed_data <- changed_data %>%
mutate_if(is.numeric, function(x) ifelse(is.na(x), median(x, na.rm = T), x))
changed_data
# A tibble: 192 x 6
final_pay_grade category `Pay Grade` marital gender count
<chr> <chr> <chr> <chr> <chr> <dbl>
1 1 ENLISTED E-1 singwochild male 31229
2 1 ENLISTED E-1 singwochild female 5717
3 1 ENLISTED E-1 singwchild male 563
4 1 ENLISTED E-1 singwchild female 122
5 1 ENLISTED E-1 marjointserv male 139
6 1 ENLISTED E-1 marjointserv female 141
7 1 ENLISTED E-1 civilmar male 5060
8 1 ENLISTED E-1 civilmar female 719
9 2 ENLISTED E-2 singwochild male 53094
10 2 ENLISTED E-2 singwochild female 8388
# ... with 182 more rows
by_category <- changed_data %>%
group_by(final_pay_grade)
# Graph 1: Attempt to plot that to find anything interesting
ggplot(by_category, aes(x = `category`, y = count)) + geom_col(fill = "turquoise3",color = "black") + facet_wrap(vars(final_pay_grade))
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
TLamkin (2022, March 6). Data Analytics and Computational Social Science: HW5. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtlamkin874336/
BibTeX citation
@misc{tlamkin2022hw5, author = {TLamkin, }, title = {Data Analytics and Computational Social Science: HW5}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtlamkin874336/}, year = {2022} }