Code
library(tidyverse)
library(readxl)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Mani Shanker Kamarapu
August 21, 2022
The data set is about the marital status of the active duty DOD service. The data set focuses on four types of services(Air Force, Marine Corps, Navy and Army). The data set gives us the detailed description of marital status of the DOD service people and divided them on basis of gender and four types of statuses(Single without Children(SWC), Single with Children(SC), Joint Service Marriage(JSM) and Civilian Marriage(CM)). The enrolled people are also divided on the basis of pay grade into 3 groups(Enlisted, Officer and Warrant).
AirForce_MaritalStatus <- read_excel("_data/ActiveDuty_MaritalStatus.xls",
sheet = "AirForce", skip = 9, col_names = c("remove", "Pay Grade", "SWC_Male", "SWC_Female", "SWC_Total", "SC_Male", "SC_Female", "SC_Total", "JSM_Male", "JSM_Female", "JSM_Total", "CM_Male", "CM_Female", "CM_Total", "Male", "Female", "Total")) %>%
select("Pay Grade", "SWC_Male", "SWC_Female", "SC_Male", "SC_Female", "JSM_Male", "JSM_Female", "CM_Male", "CM_Female") %>%
drop_na("Pay Grade") %>%
filter(!grepl("TOTAL", `Pay Grade`)) %>%
mutate(DOD_type="AirForce")
MarineCorps_MaritalStatus <- read_excel("_data/ActiveDuty_MaritalStatus.xls",
sheet = "MarineCorps", skip = 9, col_names = c("remove", "Pay Grade", "SWC_Male", "SWC_Female", "SWC_Total", "SC_Male", "SC_Female", "SC_Total", "JSM_Male", "JSM_Female", "JSM_Total", "CM_Male", "CM_Female", "CM_Total", "M_Male", "M_Female", "M_Total")) %>%
select("Pay Grade", "SWC_Male", "SWC_Female", "SC_Male", "SC_Female", "JSM_Male", "JSM_Female", "CM_Male", "CM_Female") %>%
drop_na("Pay Grade") %>%
filter(!grepl("TOTAL", `Pay Grade`)) %>%
mutate(DOD_type="MarineCorps")
Navy_MaritalStatus <- read_excel("_data/ActiveDuty_MaritalStatus.xls",
sheet = "Navy", skip = 9, col_names = c("remove", "Pay Grade", "SWC_Male", "SWC_Female", "SWC_Total", "SC_Male", "SC_Female", "SC_Total", "JSM_Male", "JSM_Female", "JSM_Total", "CM_Male", "CM_Female", "CM_Total", "N_Male", "N_Female", "N_Total")) %>%
select("Pay Grade", "SWC_Male", "SWC_Female", "SC_Male", "SC_Female", "JSM_Male", "JSM_Female", "CM_Male", "CM_Female") %>%
drop_na("Pay Grade") %>%
filter(!grepl("TOTAL", `Pay Grade`)) %>%
mutate(DOD_type="Navy")
Army_MaritalStatus <- read_excel("_data/ActiveDuty_MaritalStatus.xls",
sheet = "Army", skip = 9, col_names = c("remove", "Pay Grade", "SWC_Male", "SWC_Female", "SWC_Total", "SC_Male", "SC_Female", "SC_Total", "JSM_Male", "JSM_Female", "JSM_Total", "CM_Male", "CM_Female", "CM_Total", "Army_Male", "Army_Female", "Army_Total")) %>%
select("Pay Grade", "SWC_Male", "SWC_Female", "SC_Male", "SC_Female", "JSM_Male", "JSM_Female", "CM_Male", "CM_Female") %>%
drop_na("Pay Grade") %>%
filter(!grepl("TOTAL", `Pay Grade`)) %>%
mutate(DOD_type="Army")
TotalDOD_MaritalStatus <- rbind(AirForce_MaritalStatus,MarineCorps_MaritalStatus, Navy_MaritalStatus, Army_MaritalStatus)
TotalDOD_MaritalStatus
# A tibble: 90 × 10
`Pay Grade` SWC_Male SWC_Fe…¹ SC_Male SC_Fe…² JSM_M…³ JSM_F…⁴ CM_Male CM_Fe…⁵
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 E-1 7721 1550 27 5 49 27 1064 178
2 E-2 4380 1010 33 9 97 105 802 163
3 E-3 29725 7108 396 266 1258 1687 10436 1631
4 E-4 20805 4756 987 842 3036 3207 15363 1769
5 E-5 14623 4104 2755 2171 6154 5519 31711 2889
6 E-6 3660 1377 2446 1449 3654 3263 23868 2026
7 E-7 1441 617 1539 734 2118 1419 17290 1188
8 E-8 182 139 236 110 505 241 3655 228
9 E-9 83 48 106 39 204 73 1975 108
10 O-1 3831 1068 45 43 182 265 1693 211
# … with 80 more rows, 1 more variable: DOD_type <chr>, and abbreviated
# variable names ¹SWC_Female, ²SC_Female, ³JSM_Male, ⁴JSM_Female, ⁵CM_Female
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
I have read the data sets and there were many rows empty, so I used the skip
to remove first nine rows of data and also removed some columns to reduce redundancy in data set by using select
function and then I removed empty rows by drop
function and filtered out the values which can calculated if wanted and added a new column using mutate
function in regards to combine the data sets and at last combined multiple data sets into one data set using rbind
function.
These are the present dimensions of the data set after reading and transforming it, It has 90 rows and 10 columns.
Pay Grade SWC_Male SWC_Female SC_Male
Length:90 Min. : 0.00 Min. : 0.0 Min. : 0.00
Class :character 1st Qu.: 19.25 1st Qu.: 7.0 1st Qu.: 25.25
Mode :character Median : 442.50 Median : 166.5 Median : 122.50
Mean : 5084.57 Mean : 931.5 Mean : 571.42
3rd Qu.: 4434.00 3rd Qu.:1091.2 3rd Qu.: 471.00
Max. :50758.00 Max. :7108.0 Max. :7010.00
SC_Female JSM_Male JSM_Female CM_Male
Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 3.0
1st Qu.: 2.25 1st Qu.: 17.0 1st Qu.: 6.0 1st Qu.: 439.5
Median : 42.50 Median : 114.0 Median : 81.0 Median : 2496.5
Mean : 271.91 Mean : 546.4 Mean : 507.9 Mean : 7266.8
3rd Qu.: 191.75 3rd Qu.: 608.0 3rd Qu.: 497.5 3rd Qu.: 9517.8
Max. :2599.00 Max. :6154.0 Max. :5519.0 Max. :58317.0
CM_Female DOD_type
Min. : 0.00 Length:90
1st Qu.: 9.25 Class :character
Median : 154.00 Mode :character
Mean : 541.13
3rd Qu.: 533.75
Max. :6010.00
As you might see the data set now there is lot of data which is still untidy, that is I can see that each observation doesn’t have it’s own row and need to be tidy further. First we need decrease the number of columns and use pivot_longer()
function to convert the columns into rows so each row can have different and unique observation. And then after that make the “Status” column separate and split it into two columns Status and gender so we can have a separate variable for gender.
# A tibble: 720 × 5
`Pay Grade` DOD_type Status Gender Count
<chr> <chr> <chr> <chr> <dbl>
1 E-1 AirForce SWC Male 7721
2 E-1 AirForce SWC Female 1550
3 E-1 AirForce SC Male 27
4 E-1 AirForce SC Female 5
5 E-1 AirForce JSM Male 49
6 E-1 AirForce JSM Female 27
7 E-1 AirForce CM Male 1064
8 E-1 AirForce CM Female 178
9 E-2 AirForce SWC Male 4380
10 E-2 AirForce SWC Female 1010
# … with 710 more rows
# ℹ Use `print(n = ...)` to see more rows
Now from the table we can see that each variable has it’s own column and each observation has it’s own row, so now the data set is tidy.
Now from the tidy data set, the final dimensions are 720 rows and 5 columns. The column variables are “Pay Grade”, “DOD_type”, “Status”, “Gender” and “Count”. The pay grade represents the level and stats of the people, the DOD_type tells us about the type of service the person is enrolled in and Status gives us the Marital status of the person and Count represents the count of people enrolled in different services based on their marital status and distinguishing by their pay grade.
---
title: "Homework2"
author: "Mani Shanker Kamarapu"
desription: "Reading in Data"
date: "08/21/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw2
- ActiveDuty_MartialStatus
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(readxl)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in data
The data set is about the marital status of the active duty DOD service. The data set focuses on four types of services(Air Force, Marine Corps, Navy and Army). The data set gives us the detailed description of marital status of the DOD service people and divided them on basis of gender and four types of statuses(Single without Children(SWC), Single with Children(SC), Joint Service Marriage(JSM) and Civilian Marriage(CM)). The enrolled people are also divided on the basis of pay grade into 3 groups(Enlisted, Officer and Warrant).
```{r}
AirForce_MaritalStatus <- read_excel("_data/ActiveDuty_MaritalStatus.xls",
sheet = "AirForce", skip = 9, col_names = c("remove", "Pay Grade", "SWC_Male", "SWC_Female", "SWC_Total", "SC_Male", "SC_Female", "SC_Total", "JSM_Male", "JSM_Female", "JSM_Total", "CM_Male", "CM_Female", "CM_Total", "Male", "Female", "Total")) %>%
select("Pay Grade", "SWC_Male", "SWC_Female", "SC_Male", "SC_Female", "JSM_Male", "JSM_Female", "CM_Male", "CM_Female") %>%
drop_na("Pay Grade") %>%
filter(!grepl("TOTAL", `Pay Grade`)) %>%
mutate(DOD_type="AirForce")
MarineCorps_MaritalStatus <- read_excel("_data/ActiveDuty_MaritalStatus.xls",
sheet = "MarineCorps", skip = 9, col_names = c("remove", "Pay Grade", "SWC_Male", "SWC_Female", "SWC_Total", "SC_Male", "SC_Female", "SC_Total", "JSM_Male", "JSM_Female", "JSM_Total", "CM_Male", "CM_Female", "CM_Total", "M_Male", "M_Female", "M_Total")) %>%
select("Pay Grade", "SWC_Male", "SWC_Female", "SC_Male", "SC_Female", "JSM_Male", "JSM_Female", "CM_Male", "CM_Female") %>%
drop_na("Pay Grade") %>%
filter(!grepl("TOTAL", `Pay Grade`)) %>%
mutate(DOD_type="MarineCorps")
Navy_MaritalStatus <- read_excel("_data/ActiveDuty_MaritalStatus.xls",
sheet = "Navy", skip = 9, col_names = c("remove", "Pay Grade", "SWC_Male", "SWC_Female", "SWC_Total", "SC_Male", "SC_Female", "SC_Total", "JSM_Male", "JSM_Female", "JSM_Total", "CM_Male", "CM_Female", "CM_Total", "N_Male", "N_Female", "N_Total")) %>%
select("Pay Grade", "SWC_Male", "SWC_Female", "SC_Male", "SC_Female", "JSM_Male", "JSM_Female", "CM_Male", "CM_Female") %>%
drop_na("Pay Grade") %>%
filter(!grepl("TOTAL", `Pay Grade`)) %>%
mutate(DOD_type="Navy")
Army_MaritalStatus <- read_excel("_data/ActiveDuty_MaritalStatus.xls",
sheet = "Army", skip = 9, col_names = c("remove", "Pay Grade", "SWC_Male", "SWC_Female", "SWC_Total", "SC_Male", "SC_Female", "SC_Total", "JSM_Male", "JSM_Female", "JSM_Total", "CM_Male", "CM_Female", "CM_Total", "Army_Male", "Army_Female", "Army_Total")) %>%
select("Pay Grade", "SWC_Male", "SWC_Female", "SC_Male", "SC_Female", "JSM_Male", "JSM_Female", "CM_Male", "CM_Female") %>%
drop_na("Pay Grade") %>%
filter(!grepl("TOTAL", `Pay Grade`)) %>%
mutate(DOD_type="Army")
TotalDOD_MaritalStatus <- rbind(AirForce_MaritalStatus,MarineCorps_MaritalStatus, Navy_MaritalStatus, Army_MaritalStatus)
TotalDOD_MaritalStatus
```
### Briefly describe the data
I have read the data sets and there were many rows empty, so I used the `skip` to remove first nine rows of data and also removed some columns to reduce redundancy in data set by using `select` function and then I removed empty rows by `drop` function and filtered out the values which can calculated if wanted and added a new column using `mutate` function in regards to combine the data sets and at last combined multiple data sets into one data set using `rbind` function.
:::{.callout-tip}
### Mutate function
It is used to create new columns with the values of old columns.
:::
:::{.callout-important}
### Rbind function
It is used to select to combine two or more data sets into one data set vertically. Catch is that they should have same columns.
:::
```{r}
dim(TotalDOD_MaritalStatus)
```
These are the present dimensions of the data set after reading and transforming it, It has 90 rows and 10 columns.
```{r}
summary(TotalDOD_MaritalStatus)
```
## Anticipate the End Result
As you might see the data set now there is lot of data which is still untidy, that is I can see that each observation doesn't have it's own row and need to be tidy further. First we need decrease the number of columns and use `pivot_longer()` function to convert the columns into rows so each row can have different and unique observation. And then after that make the "Status" column separate and split it into two columns Status and gender so we can have a separate variable for gender.
## Pivot the Data
```{r}
TotalDOD_MaritalStatus <- pivot_longer(TotalDOD_MaritalStatus, SWC_Male:CM_Female, names_to = "Status", values_to = "Count") %>%
separate(col = "Status", into= c("Status", "Gender"), "_") %>%
arrange("Pay Grade")
TotalDOD_MaritalStatus
```
:::{.callout-tip}
### Pivot longer function
It is used to convert columns into rows so we can have each observation for each row.
:::
:::{.callout-important}
### Separate function
It is an important function used to split one column into two or more columns by using `sep` parameter.
:::
Now from the table we can see that each variable has it's own column and each observation has it's own row, so now the data set is tidy.
### Describe the final dimensions
```{r}
dim(TotalDOD_MaritalStatus)
```
Now from the tidy data set, the final dimensions are 720 rows and 5 columns. The column variables are "Pay Grade", "DOD_type", "Status", "Gender" and "Count". The pay grade represents the level and stats of the people, the DOD_type tells us about the type of service the person is enrolled in and Status gives us the Marital status of the person and Count represents the count of people enrolled in different services based on their marital status and distinguishing by their pay grade.