Gender Wage Gap Across Age and Occupation

DACSS 603_Final Project_MDT
Author

Meredith Derian-Toth

Published

May 21, 2023

Code
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
Code
#reading in data
jobs_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/jobs_gender.csv")
earnings_female <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/earnings_female.csv") 
employed_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/employed_gender.csv") 

#bringing in libraries
library(dplyr)
library(vtable)
library(ggplot2)
library(readr)
library(tidyverse)
library(stringr)
library(lubridate)

Introduction

This research exploration takes a close look at the gender wage gap across age group and occupation. While the wage gap has decreased overall since 1980, women still make less than 100% of every dollar men make. In 2022 women made on average, about 82 cents for every dollar men make (Aragao, 2023). This is shown in the line graph below (Female Salary Percent of Male Salary).

Code
#wage gap by year
#Would also like to add a legend and labels to the this graphs
earnings_female%>%
  filter(str_detect(group,"Total, 16 years and older")) %>%
  ggplot(aes(x = Year)) +
  geom_line(aes(y = percent)) +
  geom_point(aes(y = percent)) +
  labs(title = "Female Salary Percent of Male Salary", 
       x="Year", 
       y="Percent of Salary") +
  ylim(0,100)

The Data set

The data used in this exploration and analysis originated from the Bureau of Labor Statistics and the Census Bureau, however the three datasets came from a Tidy Tuesday data dive. These data describe different variables about women in the workforce across time, ages, and occupations. The three datasets are: (1) earnings_female is historical data, providing the percent of earnings women make in relation to men, broken down by age group and ranging from 1979 - 2011, (2) employed_gender is another historical dataset providing the workforce information (percent of women and men working full time and part time), by year, ranging from 1968 - 2016, (3) and lastly, jobs_gender is detailed data regarding occupation, earnings for those occupations by gender, and percent of earnings women make in relation to men, ranging from 2013 - 2016.

The tables below provide the descriptive statistics for the dependent variable, female wage percent of male earnings. The first table is the dependent variable across occupations and year (from 2013 to 2016). The second table shows the dependent variable for women who are 16 years or older from 1979 to 2011.

Code
#Descriptive Statistics for the dependent variable: wage percent of male 
SumWage<-data_frame(jobs_gender$year, jobs_gender$wage_percent_of_male)

SumWage<- SumWage%>%
  rename('Year'='jobs_gender$year',
         'wage_percent_of_male' = 'jobs_gender$wage_percent_of_male')
sumtable(SumWage)  
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
Year 2088 2014 1.1 2013 2014 2015 2016
wage_percent_of_male 1242 84 9.4 51 78 91 117
Code
#Descriptive Statistics for the historical data of the Dependent variable: wage percent of male
earnings_female %>%
  group_by(group) %>%
  filter(str_detect(group,"Total, 16 years and older")) %>%
  sumtable()
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
Year 33 1995 9.7 1979 1987 2003 2011
group 33
... Total, 16 years and older 33 100%
percent 33 74 5.7 62 70 79 82

Research Question

The purpose of the following exploration and analysis is to answer the question: What are the contributing factors that lead to the gender wage gap to be greater or smaller? The factors explored in this particular analysis are age and occupation. Occupation is categorized into broad and detailed categories, and will be explored by comparing female dominated and male dominated occupations.

These factors have been explored in the literature. Many have found that the wage gap is smaller for younger women than older (Aragao, 2023; Blau & Kahn, 2016; Kochhar, 2023). And Wrohlick, 2017 found that there is less of a wage gap in the public sector when compared to the private sector. This analysis goes beyond comparing occupation by private versus public and into male dominated versus female dominated. This is something that has been done considerably less and is an important piece of knowing where to target closing the wage gap.

Hypotheses

According to the research, the wage gap for a woman widens as she gets older (Aragao, 2023; Blau & Kahn, 2016; Kochhar, 2023), and the wage gap is wider for women in the private sector compared to the public sector (Wrohlich 2017). Based on this literature, we hypothesize that younger woman and women working in female-dominated occupations will experience a smaller wage gap than older women and those working in male-dominated occupations.

Wage Gap by Age Group

The graph below shows the average wage gap across age groups. These data show that as women age, women make less of a percent of what men make. Therefore, we see the wage gap widen and take a particular dip around at 25 - 44 years old.

Code
#wage gap by age
earnings_female %>%
  group_by(group) %>%
  filter(!str_detect(group,"Total, 16 years and older")) %>%
  summarize(Mean_Percent_salary = mean(percent))%>%
  ggplot(aes(group,Mean_Percent_salary)) +
  geom_col(aes(fill = group)) +
  labs(title = "Female Salary Percent of Male Salary by Age Group",
       x="Age Group", 
       y="Average Percent of Salary") +
  geom_text(aes(y=Mean_Percent_salary, 
                label=sprintf("%0.2f", round(Mean_Percent_salary, digits = 2)))) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

One hypothesis for this is that the younger generation is experiencing a more narrow wage gap. And in that case, we would expect to the see wage gap narrow across time.

The graph and table below shows the the wage gap across time using the same data. From these visualizations we can conclude that the wage gap did not narrow. We can therefore attribute the widening of the wage gap to age group.

Code
#wage gap by year
SummaryYear <- jobs_gender %>%
  group_by(year)%>%
  drop_na(wage_percent_of_male)%>%
  summarise(Average_Percent_of_Salary = mean(wage_percent_of_male))

kable(SummaryYear)
year Average_Percent_of_Salary
2013 83.77102
2014 83.70873
2015 84.45372
2016 84.19526
Code
jobs_gender %>%
  group_by(year)%>%
  drop_na(wage_percent_of_male)%>%
  summarise(Mean_wage_percent_of_males = mean(wage_percent_of_male))%>%
  ggplot(aes(fill = year, y = `Mean_wage_percent_of_males`, x=year)) +
  geom_bar(position = "dodge", stat = "identity") +
  labs(title = "Female Salary Percent of Male Salary by Year",
       x="Year", 
       y="Average Percent of Salary") +
  geom_text(aes(y=Mean_wage_percent_of_males, 
                label=sprintf("%0.2f", round(Mean_wage_percent_of_males, digits = 2)))) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

Wage Gap by Occupation

To investigate wage gap by occupation, we are comparing male-dominated vs. female dominated occupations. The list below shows the average number of total female workers versus total male workers averaged across 3 year (2013- 2016), by job category. The table is sorted in descending order by total female workers.

Code
#creating a table of just total workers by gender and category
JobsCategoryGen<- data_frame(jobs_gender$year, jobs_gender$minor_category, jobs_gender$workers_male, jobs_gender$workers_female)
JobsCategoryGen<- JobsCategoryGen%>%
  rename('Year'='jobs_gender$year',
         'Occupation Category (Minor)' = 'jobs_gender$minor_category',
         'Total Male Worker' = 'jobs_gender$workers_male',
         'Total Female Workers' = 'jobs_gender$workers_female')

SummaryJobsCatGen <- JobsCategoryGen%>%
  group_by(`Occupation Category (Minor)`)%>%
  summarise(AverageMaleWorkers = mean(`Total Male Worker`), 
            AverageFemaleWorkers = mean(`Total Female Workers`))

kable(SummaryJobsCatGen[order(SummaryJobsCatGen$AverageFemaleWorkers, decreasing=TRUE),])
Occupation Category (Minor) AverageMaleWorkers AverageFemaleWorkers
Education, Training, and Library 138707.95 336849.955
Sales and Related 324403.33 226652.667
Building and Grounds Cleaning and Maintenance 377726.67 188702.792
Office and Administrative Support 73641.55 182680.466
Healthcare Support 28363.98 169199.227
Management 269406.44 167326.808
Community and Social Service 87166.78 148718.469
Healthcare Practitioners and Technical 54026.68 139826.234
Legal 137979.85 138897.400
Food Preparation and Serving Related 149773.50 128210.135
Business and Financial Operations 96380.12 113663.170
Personal Care and Service 33068.60 97494.512
Computer and mathematical 169581.77 56163.156
Arts, Design, Entertainment, Sports, and Media 56255.49 41269.278
Material Moving 146246.23 32834.804
Protective Service 113246.58 27095.264
Production 58897.09 20524.166
Life, Physical, and Social Science 25409.11 19563.148
Transportation 171202.30 18166.075
Architecture and Engineering 96430.32 15617.167
Farming, Fishing, and Forestry 66716.75 13853.938
Installation, Maintenance, and Repair 105593.08 3813.965
Construction and Extraction 137969.88 3553.678

From this list, we can see the following are the top 10 occupation categories that are on average more dominated by women:
(1) Education, Training, and Library
(2) Sales and Related
(3) Building and Grounds Cleaning and Maintenance
(4) Office and Administrative Support
(5) Healthcare Support
(6) Management
(7) Community and Social Service
(8) Healthcare Practitioners and Technical
(9) Legal
(10) Food Preparation and Serving Related

Next we see the same occupation categories listed in order of female percent of male earnings. Women who work in occupations that fall into the “Community and Social Services” category make, on average, 91% of every dollar a man in their field makes. Whereas a woman working in “Production” makes, on average, 77% of every dollar a man in their field makes.

Code
#wagegap by minor occupation category
SummaryMinorCat <- jobs_gender%>%
  group_by(minor_category)%>%
  drop_na(wage_percent_of_male) %>%
  summarise(Mean_wage_percent_of_males = mean(wage_percent_of_male))

kable(SummaryMinorCat[order(SummaryMinorCat$Mean_wage_percent_of_males, SummaryMinorCat$minor_category, decreasing=FALSE),])
minor_category Mean_wage_percent_of_males
Production 77.23993
Legal 77.81342
Sales and Related 77.83065
Building and Grounds Cleaning and Maintenance 79.31545
Farming, Fishing, and Forestry 79.61618
Transportation 79.88060
Business and Financial Operations 80.16418
Management 80.80490
Arts, Design, Entertainment, Sports, and Media 84.79955
Life, Physical, and Social Science 84.94294
Installation, Maintenance, and Repair 85.45982
Healthcare Practitioners and Technical 86.00865
Protective Service 86.02423
Office and Administrative Support 86.18804
Personal Care and Service 86.73739
Education, Training, and Library 87.19622
Food Preparation and Serving Related 87.44786
Architecture and Engineering 87.59434
Construction and Extraction 87.94221
Computer and mathematical 88.12798
Material Moving 89.10787
Healthcare Support 90.31624
Community and Social Service 91.43559
Code
#The visualization below is in VERY draft form, and cannot provide helpful information until it is reformatted. 
#jobs_gender %>%
  #group_by(minor_category)%>%
  #drop_na(wage_percent_of_male)%>%
  #summarise(Mean_wage_percent_of_males = mean(wage_percent_of_male))%>%
  #ggplot(aes(fill = minor_category, y = `Mean_wage_percent_of_males`, x=minor_category)) +
  #geom_bar(position = "dodge", stat = "identity") +
  #labs(title = "Female Salary Percent of Male Salary by Minor Occupation Category",
       #x="Occupation Cateogry", 
       #y="Average Percent of Salary") +
  #theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

When comparing the top 10 occupation categories that are more dominated by women (Table: “Occupation Categories Dominated by Women”) with the list of occupation categories of the smallest wage gap (Table: “Occupation Categories with the Smallest Wage Gap”), we can see some overlap.

Occupation Categories Dominated by Women
(1) Education, Training, and Library
(2) Sales and Related
(3) Building and Grounds Cleaning and Maintenance
(4) Office and Administrative Support
(5) Healthcare Support
(6) Management
(7) Community and Social Service
(8) Healthcare Practitioners and Technical
(9) Legal
(10) Food Preparation and Serving Related

Occupation Categories with the Smallest Wage Gap
(1) Production
(2) Legal
(3) Sales and Related
(4) Building and Grounds Cleaning and Maintenance
(5) Farming, Fishing, and Forestry
(6) Transportation
(7) Business and Financial Operations
(8) Management
(9) Arts, Design, Entertainment, Sports, and Media
(10) Life, Physical, and Social Science

Further analysis is necessary to confirm if there is a statistically significant relationship between gender-dominated occupation and wage gap.

Analysis

Relationship Between Age Group and Wage Gap

The visualization below shows the relationship between our first independent variable of age group and our dependent variable of female wage percent of male. This visualization tells us that starting at age 25, womens’ percent of mens’ earnings already starts to decrease, with a continued drop at age 35. This drop is addressed in the literature and is believed to be due women beginning to have children (Aragao, 2023; Kochhar, 2023).

Code
earnings_female %>%
  filter(!str_detect(group, "Total, 16 years and older"))%>%
  ggplot(aes(x = group, y = percent)) + 
  geom_boxplot() +
  geom_smooth(method = 'lm', se=F) +
  labs(title = "Female Salary Percent of Male Salary by Age", 
       x="Age Group", 
       y="Percent of Salary")

Analysis of Relationship Between Age Group and Wage Gap

An Analysis of Variance (ANOVA) was run to investigate the statistical relationship between age group and female earnings percent of male, controlling for year.

Code
summary(AgeANOVA<-aov(percent~group + Year, data = earnings_female))
             Df Sum Sq Mean Sq F value Pr(>F)    
group         7  21094    3013   308.2 <2e-16 ***
Year          1   4819    4819   492.9 <2e-16 ***
Residuals   255   2493      10                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of the ANOVA show a significant relationship between the age groups with a p value of <2e-16. A Tukey post-hoc analysis was used to breakdown where the significant differences existed.

Code
TukeyHSD(AgeANOVA, conf.level = .95)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = percent ~ group + Year, data = earnings_female)

$group
                                                    diff         lwr
20-24 years-16-19 years                       -1.0666667  -3.4191912
25-34 years-16-19 years                       -9.8242424 -12.1767670
35-44 years-16-19 years                      -20.6303030 -22.9828276
45-54 years-16-19 years                      -23.6696970 -26.0222215
55-64 years-16-19 years                      -24.1060606 -26.4585851
65 years and older-16-19 years               -17.2939394 -19.6464639
Total, 16 years and older-16-19 years        -16.8848485 -19.2373730
25-34 years-20-24 years                       -8.7575758 -11.1101003
35-44 years-20-24 years                      -19.5636364 -21.9161609
45-54 years-20-24 years                      -22.6030303 -24.9555548
55-64 years-20-24 years                      -23.0393939 -25.3919185
65 years and older-20-24 years               -16.2272727 -18.5797973
Total, 16 years and older-20-24 years        -15.8181818 -18.1707064
35-44 years-25-34 years                      -10.8060606 -13.1585851
45-54 years-25-34 years                      -13.8454545 -16.1979791
55-64 years-25-34 years                      -14.2818182 -16.6343427
65 years and older-25-34 years                -7.4696970  -9.8222215
Total, 16 years and older-25-34 years         -7.0606061  -9.4131306
45-54 years-35-44 years                       -3.0393939  -5.3919185
55-64 years-35-44 years                       -3.4757576  -5.8282821
65 years and older-35-44 years                 3.3363636   0.9838391
Total, 16 years and older-35-44 years          3.7454545   1.3929300
55-64 years-45-54 years                       -0.4363636  -2.7888882
65 years and older-45-54 years                 6.3757576   4.0232330
Total, 16 years and older-45-54 years          6.7848485   4.4323240
65 years and older-55-64 years                 6.8121212   4.4595967
Total, 16 years and older-55-64 years          7.2212121   4.8686876
Total, 16 years and older-65 years and older   0.4090909  -1.9434336
                                                     upr     p adj
20-24 years-16-19 years                        1.2858579 0.8631209
25-34 years-16-19 years                       -7.4717179 0.0000000
35-44 years-16-19 years                      -18.2777785 0.0000000
45-54 years-16-19 years                      -21.3171724 0.0000000
55-64 years-16-19 years                      -21.7535361 0.0000000
65 years and older-16-19 years               -14.9414149 0.0000000
Total, 16 years and older-16-19 years        -14.5323240 0.0000000
25-34 years-20-24 years                       -6.4050512 0.0000000
35-44 years-20-24 years                      -17.2111118 0.0000000
45-54 years-20-24 years                      -20.2505058 0.0000000
55-64 years-20-24 years                      -20.6868694 0.0000000
65 years and older-20-24 years               -13.8747482 0.0000000
Total, 16 years and older-20-24 years        -13.4656573 0.0000000
35-44 years-25-34 years                       -8.4535361 0.0000000
45-54 years-25-34 years                      -11.4929300 0.0000000
55-64 years-25-34 years                      -11.9292936 0.0000000
65 years and older-25-34 years                -5.1171724 0.0000000
Total, 16 years and older-25-34 years         -4.7080815 0.0000000
45-54 years-35-44 years                       -0.6868694 0.0025478
55-64 years-35-44 years                       -1.1232330 0.0002566
65 years and older-35-44 years                 5.6888882 0.0005509
Total, 16 years and older-35-44 years          6.0979791 0.0000542
55-64 years-45-54 years                        1.9161609 0.9992102
65 years and older-45-54 years                 8.7282821 0.0000000
Total, 16 years and older-45-54 years          9.1373730 0.0000000
65 years and older-55-64 years                 9.1646457 0.0000000
Total, 16 years and older-55-64 years          9.5737367 0.0000000
Total, 16 years and older-65 years and older   2.7616154 0.9994827

The results of the Tukey post-hoc show similar results to the boxplot visualization above. When younger women (age groups 16-19 years old) are compared to all age groups 25 years or older, there is a significant difference in percent of male earnings (p value of 0.0000). That is to say, once a women turns 25, she is already getting paid significantly less than her own gender at ages 16-24. Also shown in these results is with every age group jump, a woman makes significantly less than the age group before (p value 0.000) until a woman reaches 55 years old. There is not a significant difference between female earnings for women in the 45 - 54 and 55-64 age groups. Then women a woman turns 65+ she begins to make significantly more than the women in age group before her, yet she is still making an average of about 73% of her male peers.

Relationship Between Occupation and Wage Gap

The second independent variable investigated was whether an occupation’s majority gender has a relationship to the female percent of male earnings. We hypothesized that the wage gap is wider in male dominated occupations, therefore the female percent of male earnings would be lower in those occupations. We looked at this in two different ways.

First, a new variable was created, gender_dominatedV. For occupations with 50.1% or higher of females this variable was coded as “0”, for occupations made up of 50% or lower of females this variable was coded as “1”. In other words, “Male-domindated” is coded as “1” and “Female-Dominated” is coded as “0”.

Code
#Using minor_category and percent_female variables to create a new variable of "female dominated" vs. "male dominated"

#gender_dominated<-data_frame(jobs_gender$occupation, jobs_gender$percent_female)

#head(jobs_gender)
#kable(gender_dominated[order(gender_dominated$`jobs_gender$percent_female`, gender_dominated$`jobs_gender$occupation`, decreasing = TRUE),])


jobs_gender<-jobs_gender%>%
  mutate(gender_dominatedV = case_when(percent_female >=50.1 ~ 0,
                                       percent_female <=50 ~ 1))
#head(jobs_gender)

In the visualization below, we can see that there is a negative relationship between male dominated occupations and female percent of male earnings. This means, women make less of a percent of their male colleagues’ earnings in male dominated fields, than do women working in female dominated fields. The second visualization is taking a different perspective.

Code
ggplot(data = jobs_gender, aes(x = gender_dominatedV, y = wage_percent_of_male)) + 
  geom_point() +
  geom_smooth(method = 'lm', se=F) +
  labs(title = "Female Salary Percent of Male Salary by Gender Dominated Occuption", 
       x="Gender Dominated", 
       y="Percent of Salary")

The second way of looking at the majority gender in an occupation was with the variable “percent_female”, which provided the percent of people in an occupation who identified as female. The visualization below uses this variable to compare the percent of women in an occupation with the percent of male earnings. Here we can see a positive relationship, showing that as the amount of women in an occupation increase, their percent of earnings also increases. Both visualizations show that women working in in male-dominated occupations make less of a percent men than women working in female-dominated occupations.

Code
ggplot(data = jobs_gender, aes(x = percent_female, y = wage_percent_of_male)) + 
  geom_point() +
  geom_smooth(method = 'lm', se=F) +
  labs(title = "Female Salary Percent of Male Salary by Percent of Women in Occupations", 
       x="Percent of Women", 
       y="Percent of Salary") 

Analysis of Relationship Between Occupation and Wage Gap

To know if this relationship is statistically significant, three regression analyses were run. Missing data were considered in this analysis. Data from the variable wage_percent_of_male were coded as NAs if the sample size of an occupation was too low (with a count of 846 NAs). Therefore, these missing data were found to be “missing not at random”. These data were dropped when running the regressions. However is it noted since this could bias the data as it does not include the entire data-set’s set of occupations.

Code
summary(jobs_gender)
      year       occupation        major_category     minor_category    
 Min.   :2013   Length:2088        Length:2088        Length:2088       
 1st Qu.:2014   Class :character   Class :character   Class :character  
 Median :2014   Mode  :character   Mode  :character   Mode  :character  
 Mean   :2014                                                           
 3rd Qu.:2015                                                           
 Max.   :2016                                                           
                                                                        
 total_workers      workers_male     workers_female    percent_female  
 Min.   :    658   Min.   :      0   Min.   :      0   Min.   :  0.00  
 1st Qu.:  18687   1st Qu.:  10765   1st Qu.:   2364   1st Qu.: 10.73  
 Median :  58997   Median :  32302   Median :  15238   Median : 32.40  
 Mean   : 196055   Mean   : 111515   Mean   :  84540   Mean   : 36.00  
 3rd Qu.: 187415   3rd Qu.: 102644   3rd Qu.:  63326   3rd Qu.: 57.31  
 Max.   :3758629   Max.   :2570385   Max.   :2290818   Max.   :100.00  
                                                                       
 total_earnings   total_earnings_male total_earnings_female
 Min.   : 17266   Min.   : 12147      Min.   :  7447       
 1st Qu.: 32410   1st Qu.: 35702      1st Qu.: 28872       
 Median : 44437   Median : 46825      Median : 40191       
 Mean   : 49762   Mean   : 53138      Mean   : 44681       
 3rd Qu.: 61012   3rd Qu.: 65015      3rd Qu.: 54813       
 Max.   :201542   Max.   :231420      Max.   :166388       
                  NA's   :4           NA's   :65           
 wage_percent_of_male gender_dominatedV
 Min.   : 50.88       Min.   :0.0000   
 1st Qu.: 77.56       1st Qu.:0.0000   
 Median : 85.16       Median :1.0000   
 Mean   : 84.03       Mean   :0.6705   
 3rd Qu.: 90.62       3rd Qu.:1.0000   
 Max.   :117.40       Max.   :1.0000   
 NA's   :846                           
Code
summary(earnings_female)
      Year         group              percent     
 Min.   :1979   Length:264         Min.   :56.80  
 1st Qu.:1987   Class :character   1st Qu.:69.40  
 Median :1995   Mode  :character   Median :75.50  
 Mean   :1995                      Mean   :76.88  
 3rd Qu.:2003                      3rd Qu.:86.90  
 Max.   :2011                      Max.   :95.40  
Code
summary(employed_gender)
      year      total_full_time total_part_time full_time_female
 Min.   :1968   Min.   :80.30   Min.   :14.00   Min.   :71.90   
 1st Qu.:1980   1st Qu.:81.80   1st Qu.:16.80   1st Qu.:73.20   
 Median :1992   Median :82.60   Median :17.40   Median :73.90   
 Mean   :1992   Mean   :82.64   Mean   :17.36   Mean   :73.86   
 3rd Qu.:2004   3rd Qu.:83.20   3rd Qu.:18.20   3rd Qu.:74.70   
 Max.   :2016   Max.   :86.00   Max.   :19.70   Max.   :75.40   
 part_time_female full_time_male  part_time_male 
 Min.   :24.60    Min.   :86.60   Min.   : 7.80  
 1st Qu.:25.30    1st Qu.:89.00   1st Qu.: 9.60  
 Median :26.10    Median :89.50   Median :10.50  
 Mean   :26.14    Mean   :89.49   Mean   :10.51  
 3rd Qu.:26.80    3rd Qu.:90.40   3rd Qu.:11.00  
 Max.   :28.10    Max.   :92.20   Max.   :13.40  
Code
summary(GenDomReg<-lm(wage_percent_of_male ~ gender_dominatedV, data = jobs_gender))

Call:
lm(formula = wage_percent_of_male ~ gender_dominatedV, data = jobs_gender)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.390  -6.533   1.030   6.448  34.131 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        84.9569     0.3939 215.690  < 2e-16 ***
gender_dominatedV  -1.6920     0.5327  -3.176  0.00153 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.346 on 1240 degrees of freedom
  (846 observations deleted due to missingness)
Multiple R-squared:  0.00807,   Adjusted R-squared:  0.00727 
F-statistic: 10.09 on 1 and 1240 DF,  p-value: 0.001529
Code
summary(GenDomReg2<-lm(wage_percent_of_male ~ percent_female, data = jobs_gender))

Call:
lm(formula = wage_percent_of_male ~ percent_female, data = jobs_gender)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.416  -6.333   0.889   6.378  34.238 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    82.08101    0.56034  146.49  < 2e-16 ***
percent_female  0.04258    0.01078    3.95 8.27e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.325 on 1240 degrees of freedom
  (846 observations deleted due to missingness)
Multiple R-squared:  0.01242,   Adjusted R-squared:  0.01163 
F-statistic:  15.6 on 1 and 1240 DF,  p-value: 8.266e-05

According to the two regressions above, we can conclude that there is a statistically significant relationship between female percent of male earnings and percent of women in the occupation. That is, the more women in a field, the smaller the wage gap.

In the first regression (GenDomReg), we compared our binary variable (gender_dominatedV) to the female percent of male earnings (wage_percent_of_male) to find a significant, negative relationship. This analysis resulted in a p value of 0.00153 and a coefficient of -1.6920 (adjusted R-squared of 0.00727). This means the relationship between male dominated occupations and female percent of male earnings is negative. That is, women working in male-dominated occupations make an average of 1.6% less percent of what men make compared to women working in female dominated occupations. From this we can conclude that .

In the second regression (GenDomReg2), we compared percent of women in an occupation with female percent of male earnings and found a significant positive relationship. The analysis resulted in a significant p value of 8.27e-05 and a coefficient of 0.04258 (adjusted R-squared of 0.01163). These results reflect a positive relationship between percent of women in the occupation and wage percent of male. This means, that as the percent of women in the occupation increases, the percent of earnings increases by .04%.

From these analyses we can conclude that women working in male dominated occupations make significantly less of a percent of what men make, compared to the women working in female dominated occupations; and as percent of women in an occupation increase, women make more percent of male earnings than women in male dominated occupations.

A third regression was then run in order to control for total earnings and total workers (results below). We decided to use the percent of women variable rather than the binary variable because that yeilded a stronger relationship to percent of earnings.

Code
summary(GenDomReg3<-lm(wage_percent_of_male ~ percent_female + total_earnings + total_workers, data = jobs_gender))

Call:
lm(formula = wage_percent_of_male ~ percent_female + total_earnings + 
    total_workers, data = jobs_gender)

Residuals:
    Min      1Q  Median      3Q     Max 
-31.594  -6.323   0.881   6.551  33.669 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     8.428e+01  9.024e-01  93.389  < 2e-16 ***
percent_female  3.561e-02  1.104e-02   3.225  0.00129 ** 
total_earnings -2.727e-05  1.103e-05  -2.471  0.01359 *  
total_workers  -1.565e-06  5.843e-07  -2.678  0.00750 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.284 on 1238 degrees of freedom
  (846 observations deleted due to missingness)
Multiple R-squared:  0.02276,   Adjusted R-squared:  0.02039 
F-statistic: 9.611 on 3 and 1238 DF,  p-value: 2.833e-06

These regression results maintain the significant relationship between percent of women in an occupation and their percent of male earnings. However, the significance level decreases (p value of 0.00129), whereas the adjusted R-squared increases (0.02039). This means that we have a stronger model as a whole, however the relationship between our IV and DV has lost strength. The coefficients of the predictor variable is 3.561e-02. This indicates a positive relationship between percent female and female percent of male earnings. This means, as the percent of women in an occupation increases by 1%, the female percent of male earnings increases by 3.561e-02% (0.03561 or about 0.04%). These results indicate that as the proportion of women in an occuption increases, the women in that occupation are paid more relative to their male peers when compared to women working in an occupation with a lower proportion of females.

From these regressions we can conclude that the wage gap is statistically significantly greater for women in male-dominated occupations, than it is for women in female-dominated occupations.

Model Comparison and Diagnosis.

Code
n<-1646

#GenDomReg
RSS1<-deviance(GenDomReg)
GenDomRegAIC<-n*log(RSS1/n) + 2*1
#print(GenDomRegAIC)
#6893.216
GenDomRegBIC<-n*log(RSS1/n) + log(n)*1
#print(GenDomRegBIC)
#6898.623

#GenDomReg2
RSS2<-deviance(GenDomReg2)
GenDomReg2AIC<-n*log(RSS2/n) + 2*1
#print(GenDomReg2AIC)
#6885.975
GenDomReg2BIC<-n*log(RSS2/n) + log(n)*1
#print(GenDomReg2BIC)
#6891.381

#GenDomReg3
RSS3<-deviance(GenDomReg3)
GenDomReg3AIC<-n*log(RSS3/n) + 2*3
#print(GenDomReg3AIC)
#6872.658
GenDomReg3BIC<-n*log(RSS3/n) + log(n)*3
#print(GenDomReg3BIC)
#6888.876

#Creating a dataframe of my model comparison indicators
Model<- c('GenDomReg', 'GenDomReg2', 'GenDomReg3')
AIC<-c(6893.22, 6885.98, 6872.66)
BIC<-c(6898.62, 6891.38, 6888.88)
p_value<-c('0.00153**', '8.27e-05***', '0.00129**')
Adjusted_R_Squared<-c(0.00727, 0.01163, 0.02039)
Coefficient<-c(-1.692, 0.04258, 3.56E-02)
Model_Comparison<-data.frame(Model, AIC, BIC, p_value, Adjusted_R_Squared, Coefficient)

#Putting the model comparison indicators in a table
kable(Model_Comparison)
Model AIC BIC p_value Adjusted_R_Squared Coefficient
GenDomReg 6893.22 6898.62 0.00153** 0.00727 -1.69200
GenDomReg2 6885.98 6891.38 8.27e-05*** 0.01163 0.04258
GenDomReg3 6872.66 6888.88 0.00129** 0.02039 0.03560

To compare the three models we consider the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), p value, and adjusted R-squared. Across the three models, the AIC and BIC are not drastically different, however we do see that model 1 (GenDomReg) has the highest in both, possibly due to the binary predictor variable. The model comparison also shows that model 3 (GenDomReg3) has the strongest adjusted R-squared, which could be due to added control variables. And model 2 (GenDomReg2) has the more significant p value.

Below are the results of the regression diagnostics for all three regression analyses described above.

Code
par(mfrow = c(2,3))
plot(GenDomReg, which = 1:6)

The first regression (GenDomReg) analysis has passed all of the assumptions based on the diagnostic results above.

Residuals vs Fitted
The line in this diagnostic test is straight, this shows the analysis passing the assumption of linearity. The straight line also indicates a constant variance.

QQ-Plot
The results of the QQ Plot indicate the passing of the normality assumption. This is shown with the points falling along the line.

Scale-Location
The scale-location plot is another indicator of the constant variance assumption. The horizontal line resulting from this diagnostic test indicates that the model has passed this assumption.

Cook’s Distance
The results of this show that cook’s distance remains under 1, this indicates passing the assumption of influential observation.

Residuals vs Leverage
The results of this diagnostic show no points within the same area as the dashed lines. This is a further indicator that the analysis passed the assumption of influential observation. We can assume that there is not one observation that is particularly more influential than another.

Cook’s Distance vs Leverage
The results of this diagnostic plot show that the model is within a healthy area of under 1 for cooks distance (y axis) and below o.167 for leverage (x axis).

Code
par(mfrow = c(2,3))
plot(GenDomReg2, which = 1:6)

The second regression (GenDomReg2) analysis has also passed all of the assumptions based on the diagnostic results above.

Residuals vs Fitted
The line in this diagnostic test is generally straight with a slight curve. The slight curve is not enough to violate the assumptions tested in this diagnosis. Therefore the model passed the assumption of linearity and constant variance.

QQ-Plot
The results of the QQ Plot above shows the points falling along the line and therefore indicating that the model has passed the assumption of normality.

Scale-Location
The scale-location plot is another indicator of the constant variance assumption. The horizontal line resulting from this diagnostic test indicates that the model has passed this assumption.

Cook’s Distance
The results of this show that cook’s distance remains under 1, this indicates passing the assumption of influential observation.

Residuals vs Leverage
The results of this diagnostic show no points outside of the area with the red line. This is a further indicator that the analysis passed the assumption of influential observation. We can again assume that there is not one observation that is particularly more influential than another.

Cook’s Distance vs Leverage
The results of this diagnostic plot show that the model is within a healthy area of under 1 for cooks distance (y axis) and below o.167 for leverage (x axis).

Code
par(mfrow = c(2,3))
plot(GenDomReg3, which = 1:6)

The third regression analysis (GenDomReg3) has also passed all of the assumptions based on the diagnostic results above.

Residuals vs Fitted
The line in this diagnostic test is generally straight with a few slight curves. These curves are not enough to violate the assumptions tested in this diagnosis. Therefore the model passed the assumption of linearity and constant variance.

QQ-Plot
The results of the QQ Plot above shows the points falling along the line and therefore indicating that the model has passed the assumption of normality.

Scale-Location
The scale-location plot is another indicator of the constant variance assumption. The horizontal line resulting from this diagnostic test indicates that the model has passed this assumption. Although the line has a few curves, these are not enough to violate the assumption.

Cook’s Distance
The results of this show that cook’s distance remains under 1, this indicates passing the assumption of influential observation.

Residuals vs Leverage
The results of this diagnostic show no points outside of the area with the red line. This is a further indicator that the model passed the assumption of influential observation. We can, for a third time, assume that there is no single observation that is particularly more influential than another.

Cook’s Distance vs Leverage
The results of this diagnostic plot show that the model is within a healthy area of under 1 for cooks distance (y axis) and below o.167 for leverage (x axis).

Conclusions

Hyptheses

We hypothesized that younger woman and women working in female-dominated occupations will experience a smaller wage gap than older women and those working in male-dominated occupations.

The results of the analysis show that we reject the null hypotheses. That is, there is a significant relationship between both independent variables (age group and gender_dominatedV) and the dependent variable (wage_percent_of_male). The results showed that as a woman gets older, she gets paid less percent of male earnings. With every age group jump, a woman made a significantly lower percent of male earnings than the prior age group.

We can also conclude that as the percent of women in an occupation increase, the percent of male earnings also increased. This means that women in female-dominated occupations experience a smaller wage gap than those in male-dominated occupations.

Limitations

This investigation had hoped to analyze the interaction between the two independent variables (age group and occupation), however the occupation data ranged from 2011 to 2013 and the age data ranged from 1979 to 2011. Therefore, we were unable to look at the interaction between age and occupation, which tells us only a portion of the influence of our independent variables on wage gap.

References

Code
#citation("readr")
#citation("ggplot2")
#citation("tidyverse")
#citation("stringr")
#citation("dplyr")
#citation("vtable")

Aragão, C., 2023. Gender pay gap in U.S. hasn’t changed much in two decades, Pew Research Center. United States of America. Retrieved from https://policycommons.net/artifacts/3456468/gender-pay-gap-in-us/4256843/ on 29 Apr 2023. CID: 20.500.12592/8mzrnq.

Blau, F. D., & Kahn, L. M. (2017). The gender wage gap: Extent, trends, and explanations. Journal of economic literature, 55(3), 789-865.

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

Huntington-Klein N (2023). vtable: Variable Table for Variable Documentation. R package version 1.4.2, https://CRAN.R-project.org/package=vtable.

Kochhar, R. (2023). The enduring grip of the gender pay gap.

Thomas Mock (2022). Tidy Tuesday: A weekly data project aimed at the R ecosystem. https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-03-05

Wickham H (2022). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.5.0, https://CRAN.R-project.org/package=stringr.

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.

Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.0, https://CRAN.R-project.org/package=dplyr.

Wickham H, Hester J, Bryan J (2023). readr: Read Rectangular Text Data. R package version 2.1.4, https://CRAN.R-project.org/package=readr.

Wrohlich, K. (2017). Gender pay gap varies greatly by occupation. DIW Economic Bulletin, 7(43), 429-435.