Data Analytics and Computational Social Science: HW5

TLamkin

INTRODUCTION

This homework demonstrates the tidy-ing and manipulation of the Total Department of Defense active miliary marital data. It shows the use of functions from the following packages:

STRINGR
TIDYVERSE
DPLYR
GGPLOT2

Checking the Environment

I like to verify that my working directory and library paths are the same and that I have the correct packages installed.

# Verify library path

.libPaths()

[1] "C:/Users/theresa/Documents/R/win-library/4.1"
[2] "C:/Program Files/R/R-4.1.2/library"

# set the working directory to be the same as the library path

setwd("C:/Users/theresa/Documents/R/win-library/4.1")

# verify the working directory

getwd()

[1] "C:/Users/theresa/Documents/R/win-library/4.1"

# Installing Tidyverse and readxl packages with explicitly defining the URL of where it lives. This is to get around a Mirror error. 

# install.packages("tidyverse", repos = "http://cran.us.r-project.org")
# install.packages("readxl", repos = "http://cran.us.r-project.org")

# load the necessary libraries for the processing

library(tidyverse)
library(dplyr)
library(readxl)
library(readr)
library(stringr)
library(ggplot2)
library(quantreg)

Loading the Data

Load in the Totals tab of the dataset. The read-in also performs the following data clean up tasks:

Skipping the header rows
Skipping the first column
Skipping any column that contains the TOTAL of other columns
Skipping the final detail columns (…15 and …16) where they contain totals
Keeping only rows contain ‘-’ (this will not grab rows containing totals)
Renaming the columns with names that are more descriptive and easier to manipulate (identifying gender in a uniform and predictable location is key)

Pay Grade field contained 2 bits of information - Category and level. The loading step also cleaned this field by:

Adding a column that contains category
Adding a column that just contains level
Moving both of theses new columns to the beginning of the dataset for easier pivoting

raw_active_duty_data <-read_excel("c:/users/theresa/Documents/DACSS Local/DataSets/ActiveDuty_MaritalStatus.xls", skip = 8) %>%
  select(!contains('Total')) %>%
  select(-1) %>%
  select(!contains('Male...15')) %>%
  select(!contains('Female...16')) %>%
  filter(str_detect(`Pay Grade`,"-")) %>%
  rename(singwochild_male = `Male...3`, singwochild_female = `Female...4`, singwchild_male = `Male...6`, singwchild_female = `Female...7`, marjointserv_male = `Male...9`, marjointserv_female = `Female...10`, civilmar_male = `Male...12`,civilmar_female = `Female...13`) %>%
  mutate(category = case_when(str_detect(`Pay Grade`, 'E') ~ 'ENLISTED',
                              str_detect(`Pay Grade`, 'O') ~ 'OFFICER',
                              str_detect(`Pay Grade`, 'W') ~ 'WARRANT')) %>%
  relocate(category)%>%
  mutate(final_pay_grade = str_sub(`Pay Grade`, start = 3, end = 3)) %>%
  relocate(final_pay_grade)

colnames(raw_active_duty_data)

 [1] "final_pay_grade"     "category"            "Pay Grade"          
 [4] "singwochild_male"    "singwochild_female"  "singwchild_male"    
 [7] "singwchild_female"   "marjointserv_male"   "marjointserv_female"
[10] "civilmar_male"       "civilmar_female"

raw_active_duty_data

# A tibble: 24 x 11
   final_pay_grade category `Pay Grade` singwochild_male
   <chr>           <chr>    <chr>                  <dbl>
 1 1               ENLISTED E-1                    31229
 2 2               ENLISTED E-2                    53094
 3 3               ENLISTED E-3                   131091
 4 4               ENLISTED E-4                   112710
 5 5               ENLISTED E-5                    57989
 6 6               ENLISTED E-6                    19125
 7 7               ENLISTED E-7                     5446
 8 8               ENLISTED E-8                     1009
 9 9               ENLISTED E-9                      381
10 1               OFFICER  O-1                    13495
# ... with 14 more rows, and 7 more variables:
#   singwochild_female <dbl>, singwchild_male <dbl>,
#   singwchild_female <dbl>, marjointserv_male <dbl>,
#   marjointserv_female <dbl>, civilmar_male <dbl>,
#   civilmar_female <dbl>

#raw_active_duty_data <- raw_active_duty_data %>%
 #  str_remove("*_")
# raw_active_duty_data

Pivoting the Data

The columnar data contains transaction information - specifically:

The designation of male or female is a response option value for the category gender
Single w/children, single w/o children, etc are also transactional responses to the category of marital status

This pivots the data to expand each row into a single instance of the scenario.

changed_data <- raw_active_duty_data %>%
pivot_longer(
    cols = singwochild_male:civilmar_female,
    names_to = c("marital", "gender"),
    names_sep = "_",
    values_to = "count")

colnames(changed_data)

[1] "final_pay_grade" "category"        "Pay Grade"      
[4] "marital"         "gender"          "count"

changed_data

# A tibble: 192 x 6
   final_pay_grade category `Pay Grade` marital      gender count
   <chr>           <chr>    <chr>       <chr>        <chr>  <dbl>
 1 1               ENLISTED E-1         singwochild  male   31229
 2 1               ENLISTED E-1         singwochild  female  5717
 3 1               ENLISTED E-1         singwchild   male     563
 4 1               ENLISTED E-1         singwchild   female   122
 5 1               ENLISTED E-1         marjointserv male     139
 6 1               ENLISTED E-1         marjointserv female   141
 7 1               ENLISTED E-1         civilmar     male    5060
 8 1               ENLISTED E-1         civilmar     female   719
 9 2               ENLISTED E-2         singwochild  male   53094
10 2               ENLISTED E-2         singwochild  female  8388
# ... with 182 more rows

Tidy-ing the Numeric Data and Grouping:

Just in case there are any non-numberics in the data, doing a quick check and population of an NAs.

Then grouping the data by the final pay grade and graphing by paygrade.

# Globally replace any value that wasn't provided (was a non-numeric), replace it with the median of all the other values in that instance. 

changed_data <- changed_data %>%
    mutate_if(is.numeric, function(x) ifelse(is.na(x), median(x, na.rm = T), x))

changed_data

# A tibble: 192 x 6
   final_pay_grade category `Pay Grade` marital      gender count
   <chr>           <chr>    <chr>       <chr>        <chr>  <dbl>
 1 1               ENLISTED E-1         singwochild  male   31229
 2 1               ENLISTED E-1         singwochild  female  5717
 3 1               ENLISTED E-1         singwchild   male     563
 4 1               ENLISTED E-1         singwchild   female   122
 5 1               ENLISTED E-1         marjointserv male     139
 6 1               ENLISTED E-1         marjointserv female   141
 7 1               ENLISTED E-1         civilmar     male    5060
 8 1               ENLISTED E-1         civilmar     female   719
 9 2               ENLISTED E-2         singwochild  male   53094
10 2               ENLISTED E-2         singwochild  female  8388
# ... with 182 more rows

by_category <- changed_data %>%
  group_by(final_pay_grade)


# Graph 1: Attempt to plot that to find anything interesting

ggplot(by_category, aes(x = `category`, y = count)) + geom_col(fill = "turquoise3",color = "black") + facet_wrap(vars(final_pay_grade))

Comment on this article Share:

HW5

INTRODUCTION

Checking the Environment

Loading the Data

Pivoting the Data

Tidy-ing the Numeric Data and Grouping:

Reuse

Citation