HW5

Manipulating Active Duty Data

TLamkin
2022-03-06

INTRODUCTION

This homework demonstrates the tidy-ing and manipulation of the Total Department of Defense active miliary marital data. It shows the use of functions from the following packages:

Checking the Environment

I like to verify that my working directory and library paths are the same and that I have the correct packages installed.

# Verify library path

.libPaths()
[1] "C:/Users/theresa/Documents/R/win-library/4.1"
[2] "C:/Program Files/R/R-4.1.2/library"          
# set the working directory to be the same as the library path

setwd("C:/Users/theresa/Documents/R/win-library/4.1")

# verify the working directory

getwd()
[1] "C:/Users/theresa/Documents/R/win-library/4.1"
# Installing Tidyverse and readxl packages with explicitly defining the URL of where it lives. This is to get around a Mirror error. 

# install.packages("tidyverse", repos = "http://cran.us.r-project.org")
# install.packages("readxl", repos = "http://cran.us.r-project.org")

# load the necessary libraries for the processing

library(tidyverse)
library(dplyr)
library(readxl)
library(readr)
library(stringr)
library(ggplot2)
library(quantreg)

Loading the Data

Load in the Totals tab of the dataset. The read-in also performs the following data clean up tasks:

Pay Grade field contained 2 bits of information - Category and level. The loading step also cleaned this field by:

raw_active_duty_data <-read_excel("c:/users/theresa/Documents/DACSS Local/DataSets/ActiveDuty_MaritalStatus.xls", skip = 8) %>%
  select(!contains('Total')) %>%
  select(-1) %>%
  select(!contains('Male...15')) %>%
  select(!contains('Female...16')) %>%
  filter(str_detect(`Pay Grade`,"-")) %>%
  rename(singwochild_male = `Male...3`, singwochild_female = `Female...4`, singwchild_male = `Male...6`, singwchild_female = `Female...7`, marjointserv_male = `Male...9`, marjointserv_female = `Female...10`, civilmar_male = `Male...12`,civilmar_female = `Female...13`) %>%
  mutate(category = case_when(str_detect(`Pay Grade`, 'E') ~ 'ENLISTED',
                              str_detect(`Pay Grade`, 'O') ~ 'OFFICER',
                              str_detect(`Pay Grade`, 'W') ~ 'WARRANT')) %>%
  relocate(category)%>%
  mutate(final_pay_grade = str_sub(`Pay Grade`, start = 3, end = 3)) %>%
  relocate(final_pay_grade)

colnames(raw_active_duty_data)
 [1] "final_pay_grade"     "category"            "Pay Grade"          
 [4] "singwochild_male"    "singwochild_female"  "singwchild_male"    
 [7] "singwchild_female"   "marjointserv_male"   "marjointserv_female"
[10] "civilmar_male"       "civilmar_female"    
raw_active_duty_data
# A tibble: 24 x 11
   final_pay_grade category `Pay Grade` singwochild_male
   <chr>           <chr>    <chr>                  <dbl>
 1 1               ENLISTED E-1                    31229
 2 2               ENLISTED E-2                    53094
 3 3               ENLISTED E-3                   131091
 4 4               ENLISTED E-4                   112710
 5 5               ENLISTED E-5                    57989
 6 6               ENLISTED E-6                    19125
 7 7               ENLISTED E-7                     5446
 8 8               ENLISTED E-8                     1009
 9 9               ENLISTED E-9                      381
10 1               OFFICER  O-1                    13495
# ... with 14 more rows, and 7 more variables:
#   singwochild_female <dbl>, singwchild_male <dbl>,
#   singwchild_female <dbl>, marjointserv_male <dbl>,
#   marjointserv_female <dbl>, civilmar_male <dbl>,
#   civilmar_female <dbl>
#raw_active_duty_data <- raw_active_duty_data %>%
 #  str_remove("*_")
# raw_active_duty_data

Pivoting the Data

The columnar data contains transaction information - specifically:

This pivots the data to expand each row into a single instance of the scenario.

changed_data <- raw_active_duty_data %>%
pivot_longer(
    cols = singwochild_male:civilmar_female,
    names_to = c("marital", "gender"),
    names_sep = "_",
    values_to = "count")

colnames(changed_data)
[1] "final_pay_grade" "category"        "Pay Grade"      
[4] "marital"         "gender"          "count"          
changed_data
# A tibble: 192 x 6
   final_pay_grade category `Pay Grade` marital      gender count
   <chr>           <chr>    <chr>       <chr>        <chr>  <dbl>
 1 1               ENLISTED E-1         singwochild  male   31229
 2 1               ENLISTED E-1         singwochild  female  5717
 3 1               ENLISTED E-1         singwchild   male     563
 4 1               ENLISTED E-1         singwchild   female   122
 5 1               ENLISTED E-1         marjointserv male     139
 6 1               ENLISTED E-1         marjointserv female   141
 7 1               ENLISTED E-1         civilmar     male    5060
 8 1               ENLISTED E-1         civilmar     female   719
 9 2               ENLISTED E-2         singwochild  male   53094
10 2               ENLISTED E-2         singwochild  female  8388
# ... with 182 more rows

Tidy-ing the Numeric Data and Grouping:

Just in case there are any non-numberics in the data, doing a quick check and population of an NAs.

Then grouping the data by the final pay grade and graphing by paygrade.

# Globally replace any value that wasn't provided (was a non-numeric), replace it with the median of all the other values in that instance. 

changed_data <- changed_data %>%
    mutate_if(is.numeric, function(x) ifelse(is.na(x), median(x, na.rm = T), x))

changed_data
# A tibble: 192 x 6
   final_pay_grade category `Pay Grade` marital      gender count
   <chr>           <chr>    <chr>       <chr>        <chr>  <dbl>
 1 1               ENLISTED E-1         singwochild  male   31229
 2 1               ENLISTED E-1         singwochild  female  5717
 3 1               ENLISTED E-1         singwchild   male     563
 4 1               ENLISTED E-1         singwchild   female   122
 5 1               ENLISTED E-1         marjointserv male     139
 6 1               ENLISTED E-1         marjointserv female   141
 7 1               ENLISTED E-1         civilmar     male    5060
 8 1               ENLISTED E-1         civilmar     female   719
 9 2               ENLISTED E-2         singwochild  male   53094
10 2               ENLISTED E-2         singwochild  female  8388
# ... with 182 more rows
by_category <- changed_data %>%
  group_by(final_pay_grade)


# Graph 1: Attempt to plot that to find anything interesting

ggplot(by_category, aes(x = `category`, y = count)) + geom_col(fill = "turquoise3",color = "black") + facet_wrap(vars(final_pay_grade))

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

TLamkin (2022, March 6). Data Analytics and Computational Social Science: HW5. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtlamkin874336/

BibTeX citation

@misc{tlamkin2022hw5,
  author = {TLamkin, },
  title = {Data Analytics and Computational Social Science: HW5},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtlamkin874336/},
  year = {2022}
}