hw2
ActiveDuty_MartialStatus
Author

Mani Shanker Kamarapu

Published

August 21, 2022

Code
library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

The data set is about the marital status of the active duty DOD service. The data set focuses on four types of services(Air Force, Marine Corps, Navy and Army). The data set gives us the detailed description of marital status of the DOD service people and divided them on basis of gender and four types of statuses(Single without Children(SWC), Single with Children(SC), Joint Service Marriage(JSM) and Civilian Marriage(CM)). The enrolled people are also divided on the basis of pay grade into 3 groups(Enlisted, Officer and Warrant).

Code
AirForce_MaritalStatus <- read_excel("_data/ActiveDuty_MaritalStatus.xls", 
    sheet = "AirForce", skip = 9, col_names = c("remove", "Pay Grade", "SWC_Male", "SWC_Female", "SWC_Total", "SC_Male", "SC_Female", "SC_Total", "JSM_Male", "JSM_Female", "JSM_Total", "CM_Male", "CM_Female", "CM_Total", "Male", "Female", "Total")) %>%
  select("Pay Grade", "SWC_Male", "SWC_Female", "SC_Male", "SC_Female", "JSM_Male", "JSM_Female", "CM_Male", "CM_Female") %>%
  drop_na("Pay Grade") %>%
  filter(!grepl("TOTAL", `Pay Grade`)) %>%
    mutate(DOD_type="AirForce")

MarineCorps_MaritalStatus <- read_excel("_data/ActiveDuty_MaritalStatus.xls", 
    sheet = "MarineCorps", skip = 9, col_names = c("remove", "Pay Grade", "SWC_Male", "SWC_Female", "SWC_Total", "SC_Male", "SC_Female", "SC_Total", "JSM_Male", "JSM_Female", "JSM_Total", "CM_Male", "CM_Female", "CM_Total", "M_Male", "M_Female", "M_Total")) %>%
  select("Pay Grade", "SWC_Male", "SWC_Female", "SC_Male", "SC_Female", "JSM_Male", "JSM_Female", "CM_Male", "CM_Female") %>%
  drop_na("Pay Grade") %>%
  filter(!grepl("TOTAL", `Pay Grade`)) %>%
    mutate(DOD_type="MarineCorps")

Navy_MaritalStatus <- read_excel("_data/ActiveDuty_MaritalStatus.xls", 
    sheet = "Navy", skip = 9, col_names = c("remove", "Pay Grade", "SWC_Male", "SWC_Female", "SWC_Total", "SC_Male", "SC_Female", "SC_Total", "JSM_Male", "JSM_Female", "JSM_Total", "CM_Male", "CM_Female", "CM_Total", "N_Male", "N_Female", "N_Total")) %>%
  select("Pay Grade", "SWC_Male", "SWC_Female", "SC_Male", "SC_Female", "JSM_Male", "JSM_Female", "CM_Male", "CM_Female") %>%
  drop_na("Pay Grade") %>%
  filter(!grepl("TOTAL", `Pay Grade`)) %>%
    mutate(DOD_type="Navy")

Army_MaritalStatus <- read_excel("_data/ActiveDuty_MaritalStatus.xls", 
    sheet = "Army", skip = 9, col_names = c("remove", "Pay Grade", "SWC_Male", "SWC_Female", "SWC_Total", "SC_Male", "SC_Female", "SC_Total", "JSM_Male", "JSM_Female", "JSM_Total", "CM_Male", "CM_Female", "CM_Total", "Army_Male", "Army_Female", "Army_Total")) %>%
  select("Pay Grade", "SWC_Male", "SWC_Female", "SC_Male", "SC_Female", "JSM_Male", "JSM_Female", "CM_Male", "CM_Female") %>%
  drop_na("Pay Grade") %>%
  filter(!grepl("TOTAL", `Pay Grade`)) %>%
    mutate(DOD_type="Army")

TotalDOD_MaritalStatus <- rbind(AirForce_MaritalStatus,MarineCorps_MaritalStatus, Navy_MaritalStatus, Army_MaritalStatus)

TotalDOD_MaritalStatus
# A tibble: 90 × 10
   `Pay Grade` SWC_Male SWC_Fe…¹ SC_Male SC_Fe…² JSM_M…³ JSM_F…⁴ CM_Male CM_Fe…⁵
   <chr>          <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 E-1             7721     1550      27       5      49      27    1064     178
 2 E-2             4380     1010      33       9      97     105     802     163
 3 E-3            29725     7108     396     266    1258    1687   10436    1631
 4 E-4            20805     4756     987     842    3036    3207   15363    1769
 5 E-5            14623     4104    2755    2171    6154    5519   31711    2889
 6 E-6             3660     1377    2446    1449    3654    3263   23868    2026
 7 E-7             1441      617    1539     734    2118    1419   17290    1188
 8 E-8              182      139     236     110     505     241    3655     228
 9 E-9               83       48     106      39     204      73    1975     108
10 O-1             3831     1068      45      43     182     265    1693     211
# … with 80 more rows, 1 more variable: DOD_type <chr>, and abbreviated
#   variable names ¹​SWC_Female, ²​SC_Female, ³​JSM_Male, ⁴​JSM_Female, ⁵​CM_Female
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Briefly describe the data

I have read the data sets and there were many rows empty, so I used the skip to remove first nine rows of data and also removed some columns to reduce redundancy in data set by using select function and then I removed empty rows by drop function and filtered out the values which can calculated if wanted and added a new column using mutate function in regards to combine the data sets and at last combined multiple data sets into one data set using rbind function.

Mutate function

It is used to create new columns with the values of old columns.

Rbind function

It is used to select to combine two or more data sets into one data set vertically. Catch is that they should have same columns.

Code
dim(TotalDOD_MaritalStatus)
[1] 90 10

These are the present dimensions of the data set after reading and transforming it, It has 90 rows and 10 columns.

Code
summary(TotalDOD_MaritalStatus)
  Pay Grade            SWC_Male          SWC_Female        SC_Male       
 Length:90          Min.   :    0.00   Min.   :   0.0   Min.   :   0.00  
 Class :character   1st Qu.:   19.25   1st Qu.:   7.0   1st Qu.:  25.25  
 Mode  :character   Median :  442.50   Median : 166.5   Median : 122.50  
                    Mean   : 5084.57   Mean   : 931.5   Mean   : 571.42  
                    3rd Qu.: 4434.00   3rd Qu.:1091.2   3rd Qu.: 471.00  
                    Max.   :50758.00   Max.   :7108.0   Max.   :7010.00  
   SC_Female          JSM_Male        JSM_Female        CM_Male       
 Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Min.   :    3.0  
 1st Qu.:   2.25   1st Qu.:  17.0   1st Qu.:   6.0   1st Qu.:  439.5  
 Median :  42.50   Median : 114.0   Median :  81.0   Median : 2496.5  
 Mean   : 271.91   Mean   : 546.4   Mean   : 507.9   Mean   : 7266.8  
 3rd Qu.: 191.75   3rd Qu.: 608.0   3rd Qu.: 497.5   3rd Qu.: 9517.8  
 Max.   :2599.00   Max.   :6154.0   Max.   :5519.0   Max.   :58317.0  
   CM_Female         DOD_type        
 Min.   :   0.00   Length:90         
 1st Qu.:   9.25   Class :character  
 Median : 154.00   Mode  :character  
 Mean   : 541.13                     
 3rd Qu.: 533.75                     
 Max.   :6010.00                     

Anticipate the End Result

As you might see the data set now there is lot of data which is still untidy, that is I can see that each observation doesn’t have it’s own row and need to be tidy further. First we need decrease the number of columns and use pivot_longer() function to convert the columns into rows so each row can have different and unique observation. And then after that make the “Status” column separate and split it into two columns Status and gender so we can have a separate variable for gender.

Pivot the Data

Code
TotalDOD_MaritalStatus <-  pivot_longer(TotalDOD_MaritalStatus, SWC_Male:CM_Female, names_to = "Status", values_to = "Count") %>% 
  separate(col = "Status", into= c("Status", "Gender"), "_") %>%
  arrange("Pay Grade")

TotalDOD_MaritalStatus
# A tibble: 720 × 5
   `Pay Grade` DOD_type Status Gender Count
   <chr>       <chr>    <chr>  <chr>  <dbl>
 1 E-1         AirForce SWC    Male    7721
 2 E-1         AirForce SWC    Female  1550
 3 E-1         AirForce SC     Male      27
 4 E-1         AirForce SC     Female     5
 5 E-1         AirForce JSM    Male      49
 6 E-1         AirForce JSM    Female    27
 7 E-1         AirForce CM     Male    1064
 8 E-1         AirForce CM     Female   178
 9 E-2         AirForce SWC    Male    4380
10 E-2         AirForce SWC    Female  1010
# … with 710 more rows
# ℹ Use `print(n = ...)` to see more rows
Pivot longer function

It is used to convert columns into rows so we can have each observation for each row.

Separate function

It is an important function used to split one column into two or more columns by using sep parameter.

Now from the table we can see that each variable has it’s own column and each observation has it’s own row, so now the data set is tidy.

Describe the final dimensions

Code
dim(TotalDOD_MaritalStatus)
[1] 720   5

Now from the tidy data set, the final dimensions are 720 rows and 5 columns. The column variables are “Pay Grade”, “DOD_type”, “Status”, “Gender” and “Count”. The pay grade represents the level and stats of the people, the DOD_type tells us about the type of service the person is enrolled in and Status gives us the Marital status of the person and Count represents the count of people enrolled in different services based on their marital status and distinguishing by their pay grade.