Code
::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE) knitr
Meredith Derian-Toth
May 21, 2023
#reading in data
jobs_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/jobs_gender.csv")
earnings_female <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/earnings_female.csv")
employed_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/employed_gender.csv")
#bringing in libraries
library(dplyr)
library(vtable)
library(ggplot2)
library(readr)
library(tidyverse)
library(stringr)
library(lubridate)
This research exploration takes a close look at the gender wage gap across age group and occupation. While the wage gap has decreased overall since 1980, women still make less than 100% of every dollar men make. In 2022 women made on average, about 82 cents for every dollar men make (Aragao, 2023). This is shown in the line graph below (Female Salary Percent of Male Salary).
#wage gap by year
#Would also like to add a legend and labels to the this graphs
earnings_female%>%
filter(str_detect(group,"Total, 16 years and older")) %>%
ggplot(aes(x = Year)) +
geom_line(aes(y = percent)) +
geom_point(aes(y = percent)) +
labs(title = "Female Salary Percent of Male Salary",
x="Year",
y="Percent of Salary") +
ylim(0,100)
The data used in this exploration and analysis originated from the Bureau of Labor Statistics and the Census Bureau, however the three datasets came from a Tidy Tuesday data dive. These data describe different variables about women in the workforce across time, ages, and occupations. The three datasets are: (1) earnings_female is historical data, providing the percent of earnings women make in relation to men, broken down by age group and ranging from 1979 - 2011, (2) employed_gender is another historical dataset providing the workforce information (percent of women and men working full time and part time), by year, ranging from 1968 - 2016, (3) and lastly, jobs_gender is detailed data regarding occupation, earnings for those occupations by gender, and percent of earnings women make in relation to men, ranging from 2013 - 2016.
The tables below provide the descriptive statistics for the dependent variable, female wage percent of male earnings. The first table is the dependent variable across occupations and year (from 2013 to 2016). The second table shows the dependent variable for women who are 16 years or older from 1979 to 2011.
Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
---|---|---|---|---|---|---|---|
Year | 2088 | 2014 | 1.1 | 2013 | 2014 | 2015 | 2016 |
wage_percent_of_male | 1242 | 84 | 9.4 | 51 | 78 | 91 | 117 |
Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
---|---|---|---|---|---|---|---|
Year | 33 | 1995 | 9.7 | 1979 | 1987 | 2003 | 2011 |
group | 33 | ||||||
... Total, 16 years and older | 33 | 100% | |||||
percent | 33 | 74 | 5.7 | 62 | 70 | 79 | 82 |
The purpose of the following exploration and analysis is to answer the question: What are the contributing factors that lead to the gender wage gap to be greater or smaller? The factors explored in this particular analysis are age and occupation. Occupation is categorized into broad and detailed categories, and will be explored by comparing female dominated and male dominated occupations.
These factors have been explored in the literature. Many have found that the wage gap is smaller for younger women than older (Aragao, 2023; Blau & Kahn, 2016; Kochhar, 2023). And Wrohlick, 2017 found that there is less of a wage gap in the public sector when compared to the private sector. This analysis goes beyond comparing occupation by private versus public and into male dominated versus female dominated. This is something that has been done considerably less and is an important piece of knowing where to target closing the wage gap.
According to the research, the wage gap for a woman widens as she gets older (Aragao, 2023; Blau & Kahn, 2016; Kochhar, 2023), and the wage gap is wider for women in the private sector compared to the public sector (Wrohlich 2017). Based on this literature, we hypothesize that younger woman and women working in female-dominated occupations will experience a smaller wage gap than older women and those working in male-dominated occupations.
The graph below shows the average wage gap across age groups. These data show that as women age, women make less of a percent of what men make. Therefore, we see the wage gap widen and take a particular dip around at 25 - 44 years old.
#wage gap by age
earnings_female %>%
group_by(group) %>%
filter(!str_detect(group,"Total, 16 years and older")) %>%
summarize(Mean_Percent_salary = mean(percent))%>%
ggplot(aes(group,Mean_Percent_salary)) +
geom_col(aes(fill = group)) +
labs(title = "Female Salary Percent of Male Salary by Age Group",
x="Age Group",
y="Average Percent of Salary") +
geom_text(aes(y=Mean_Percent_salary,
label=sprintf("%0.2f", round(Mean_Percent_salary, digits = 2)))) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
One hypothesis for this is that the younger generation is experiencing a more narrow wage gap. And in that case, we would expect to the see wage gap narrow across time.
The graph and table below shows the the wage gap across time using the same data. From these visualizations we can conclude that the wage gap did not narrow. We can therefore attribute the widening of the wage gap to age group.
year | Average_Percent_of_Salary |
---|---|
2013 | 83.77102 |
2014 | 83.70873 |
2015 | 84.45372 |
2016 | 84.19526 |
jobs_gender %>%
group_by(year)%>%
drop_na(wage_percent_of_male)%>%
summarise(Mean_wage_percent_of_males = mean(wage_percent_of_male))%>%
ggplot(aes(fill = year, y = `Mean_wage_percent_of_males`, x=year)) +
geom_bar(position = "dodge", stat = "identity") +
labs(title = "Female Salary Percent of Male Salary by Year",
x="Year",
y="Average Percent of Salary") +
geom_text(aes(y=Mean_wage_percent_of_males,
label=sprintf("%0.2f", round(Mean_wage_percent_of_males, digits = 2)))) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
To investigate wage gap by occupation, we are comparing male-dominated vs. female dominated occupations. The list below shows the average number of total female workers versus total male workers averaged across 3 year (2013- 2016), by job category. The table is sorted in descending order by total female workers.
#creating a table of just total workers by gender and category
JobsCategoryGen<- data_frame(jobs_gender$year, jobs_gender$minor_category, jobs_gender$workers_male, jobs_gender$workers_female)
JobsCategoryGen<- JobsCategoryGen%>%
rename('Year'='jobs_gender$year',
'Occupation Category (Minor)' = 'jobs_gender$minor_category',
'Total Male Worker' = 'jobs_gender$workers_male',
'Total Female Workers' = 'jobs_gender$workers_female')
SummaryJobsCatGen <- JobsCategoryGen%>%
group_by(`Occupation Category (Minor)`)%>%
summarise(AverageMaleWorkers = mean(`Total Male Worker`),
AverageFemaleWorkers = mean(`Total Female Workers`))
kable(SummaryJobsCatGen[order(SummaryJobsCatGen$AverageFemaleWorkers, decreasing=TRUE),])
Occupation Category (Minor) | AverageMaleWorkers | AverageFemaleWorkers |
---|---|---|
Education, Training, and Library | 138707.95 | 336849.955 |
Sales and Related | 324403.33 | 226652.667 |
Building and Grounds Cleaning and Maintenance | 377726.67 | 188702.792 |
Office and Administrative Support | 73641.55 | 182680.466 |
Healthcare Support | 28363.98 | 169199.227 |
Management | 269406.44 | 167326.808 |
Community and Social Service | 87166.78 | 148718.469 |
Healthcare Practitioners and Technical | 54026.68 | 139826.234 |
Legal | 137979.85 | 138897.400 |
Food Preparation and Serving Related | 149773.50 | 128210.135 |
Business and Financial Operations | 96380.12 | 113663.170 |
Personal Care and Service | 33068.60 | 97494.512 |
Computer and mathematical | 169581.77 | 56163.156 |
Arts, Design, Entertainment, Sports, and Media | 56255.49 | 41269.278 |
Material Moving | 146246.23 | 32834.804 |
Protective Service | 113246.58 | 27095.264 |
Production | 58897.09 | 20524.166 |
Life, Physical, and Social Science | 25409.11 | 19563.148 |
Transportation | 171202.30 | 18166.075 |
Architecture and Engineering | 96430.32 | 15617.167 |
Farming, Fishing, and Forestry | 66716.75 | 13853.938 |
Installation, Maintenance, and Repair | 105593.08 | 3813.965 |
Construction and Extraction | 137969.88 | 3553.678 |
From this list, we can see the following are the top 10 occupation categories that are on average more dominated by women:
(1) Education, Training, and Library
(2) Sales and Related
(3) Building and Grounds Cleaning and Maintenance
(4) Office and Administrative Support
(5) Healthcare Support
(6) Management
(7) Community and Social Service
(8) Healthcare Practitioners and Technical
(9) Legal
(10) Food Preparation and Serving Related
Next we see the same occupation categories listed in order of female percent of male earnings. Women who work in occupations that fall into the “Community and Social Services” category make, on average, 91% of every dollar a man in their field makes. Whereas a woman working in “Production” makes, on average, 77% of every dollar a man in their field makes.
#wagegap by minor occupation category
SummaryMinorCat <- jobs_gender%>%
group_by(minor_category)%>%
drop_na(wage_percent_of_male) %>%
summarise(Mean_wage_percent_of_males = mean(wage_percent_of_male))
kable(SummaryMinorCat[order(SummaryMinorCat$Mean_wage_percent_of_males, SummaryMinorCat$minor_category, decreasing=FALSE),])
minor_category | Mean_wage_percent_of_males |
---|---|
Production | 77.23993 |
Legal | 77.81342 |
Sales and Related | 77.83065 |
Building and Grounds Cleaning and Maintenance | 79.31545 |
Farming, Fishing, and Forestry | 79.61618 |
Transportation | 79.88060 |
Business and Financial Operations | 80.16418 |
Management | 80.80490 |
Arts, Design, Entertainment, Sports, and Media | 84.79955 |
Life, Physical, and Social Science | 84.94294 |
Installation, Maintenance, and Repair | 85.45982 |
Healthcare Practitioners and Technical | 86.00865 |
Protective Service | 86.02423 |
Office and Administrative Support | 86.18804 |
Personal Care and Service | 86.73739 |
Education, Training, and Library | 87.19622 |
Food Preparation and Serving Related | 87.44786 |
Architecture and Engineering | 87.59434 |
Construction and Extraction | 87.94221 |
Computer and mathematical | 88.12798 |
Material Moving | 89.10787 |
Healthcare Support | 90.31624 |
Community and Social Service | 91.43559 |
#The visualization below is in VERY draft form, and cannot provide helpful information until it is reformatted.
#jobs_gender %>%
#group_by(minor_category)%>%
#drop_na(wage_percent_of_male)%>%
#summarise(Mean_wage_percent_of_males = mean(wage_percent_of_male))%>%
#ggplot(aes(fill = minor_category, y = `Mean_wage_percent_of_males`, x=minor_category)) +
#geom_bar(position = "dodge", stat = "identity") +
#labs(title = "Female Salary Percent of Male Salary by Minor Occupation Category",
#x="Occupation Cateogry",
#y="Average Percent of Salary") +
#theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
When comparing the top 10 occupation categories that are more dominated by women (Table: “Occupation Categories Dominated by Women”) with the list of occupation categories of the smallest wage gap (Table: “Occupation Categories with the Smallest Wage Gap”), we can see some overlap.
Occupation Categories Dominated by Women
(1) Education, Training, and Library
(2) Sales and Related
(3) Building and Grounds Cleaning and Maintenance
(4) Office and Administrative Support
(5) Healthcare Support
(6) Management
(7) Community and Social Service
(8) Healthcare Practitioners and Technical
(9) Legal
(10) Food Preparation and Serving Related
Occupation Categories with the Smallest Wage Gap
(1) Production
(2) Legal
(3) Sales and Related
(4) Building and Grounds Cleaning and Maintenance
(5) Farming, Fishing, and Forestry
(6) Transportation
(7) Business and Financial Operations
(8) Management
(9) Arts, Design, Entertainment, Sports, and Media
(10) Life, Physical, and Social Science
Further analysis is necessary to confirm if there is a statistically significant relationship between gender-dominated occupation and wage gap.
The visualization below shows the relationship between our first independent variable of age group and our dependent variable of female wage percent of male. This visualization tells us that starting at age 25, womens’ percent of mens’ earnings already starts to decrease, with a continued drop at age 35. This drop is addressed in the literature and is believed to be due women beginning to have children (Aragao, 2023; Kochhar, 2023).
An Analysis of Variance (ANOVA) was run to investigate the statistical relationship between age group and female earnings percent of male, controlling for year.
Df Sum Sq Mean Sq F value Pr(>F)
group 7 21094 3013 308.2 <2e-16 ***
Year 1 4819 4819 492.9 <2e-16 ***
Residuals 255 2493 10
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results of the ANOVA show a significant relationship between the age groups with a p value of <2e-16. A Tukey post-hoc analysis was used to breakdown where the significant differences existed.
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = percent ~ group + Year, data = earnings_female)
$group
diff lwr
20-24 years-16-19 years -1.0666667 -3.4191912
25-34 years-16-19 years -9.8242424 -12.1767670
35-44 years-16-19 years -20.6303030 -22.9828276
45-54 years-16-19 years -23.6696970 -26.0222215
55-64 years-16-19 years -24.1060606 -26.4585851
65 years and older-16-19 years -17.2939394 -19.6464639
Total, 16 years and older-16-19 years -16.8848485 -19.2373730
25-34 years-20-24 years -8.7575758 -11.1101003
35-44 years-20-24 years -19.5636364 -21.9161609
45-54 years-20-24 years -22.6030303 -24.9555548
55-64 years-20-24 years -23.0393939 -25.3919185
65 years and older-20-24 years -16.2272727 -18.5797973
Total, 16 years and older-20-24 years -15.8181818 -18.1707064
35-44 years-25-34 years -10.8060606 -13.1585851
45-54 years-25-34 years -13.8454545 -16.1979791
55-64 years-25-34 years -14.2818182 -16.6343427
65 years and older-25-34 years -7.4696970 -9.8222215
Total, 16 years and older-25-34 years -7.0606061 -9.4131306
45-54 years-35-44 years -3.0393939 -5.3919185
55-64 years-35-44 years -3.4757576 -5.8282821
65 years and older-35-44 years 3.3363636 0.9838391
Total, 16 years and older-35-44 years 3.7454545 1.3929300
55-64 years-45-54 years -0.4363636 -2.7888882
65 years and older-45-54 years 6.3757576 4.0232330
Total, 16 years and older-45-54 years 6.7848485 4.4323240
65 years and older-55-64 years 6.8121212 4.4595967
Total, 16 years and older-55-64 years 7.2212121 4.8686876
Total, 16 years and older-65 years and older 0.4090909 -1.9434336
upr p adj
20-24 years-16-19 years 1.2858579 0.8631209
25-34 years-16-19 years -7.4717179 0.0000000
35-44 years-16-19 years -18.2777785 0.0000000
45-54 years-16-19 years -21.3171724 0.0000000
55-64 years-16-19 years -21.7535361 0.0000000
65 years and older-16-19 years -14.9414149 0.0000000
Total, 16 years and older-16-19 years -14.5323240 0.0000000
25-34 years-20-24 years -6.4050512 0.0000000
35-44 years-20-24 years -17.2111118 0.0000000
45-54 years-20-24 years -20.2505058 0.0000000
55-64 years-20-24 years -20.6868694 0.0000000
65 years and older-20-24 years -13.8747482 0.0000000
Total, 16 years and older-20-24 years -13.4656573 0.0000000
35-44 years-25-34 years -8.4535361 0.0000000
45-54 years-25-34 years -11.4929300 0.0000000
55-64 years-25-34 years -11.9292936 0.0000000
65 years and older-25-34 years -5.1171724 0.0000000
Total, 16 years and older-25-34 years -4.7080815 0.0000000
45-54 years-35-44 years -0.6868694 0.0025478
55-64 years-35-44 years -1.1232330 0.0002566
65 years and older-35-44 years 5.6888882 0.0005509
Total, 16 years and older-35-44 years 6.0979791 0.0000542
55-64 years-45-54 years 1.9161609 0.9992102
65 years and older-45-54 years 8.7282821 0.0000000
Total, 16 years and older-45-54 years 9.1373730 0.0000000
65 years and older-55-64 years 9.1646457 0.0000000
Total, 16 years and older-55-64 years 9.5737367 0.0000000
Total, 16 years and older-65 years and older 2.7616154 0.9994827
The results of the Tukey post-hoc show similar results to the boxplot visualization above. When younger women (age groups 16-19 years old) are compared to all age groups 25 years or older, there is a significant difference in percent of male earnings (p value of 0.0000). That is to say, once a women turns 25, she is already getting paid significantly less than her own gender at ages 16-24. Also shown in these results is with every age group jump, a woman makes significantly less than the age group before (p value 0.000) until a woman reaches 55 years old. There is not a significant difference between female earnings for women in the 45 - 54 and 55-64 age groups. Then women a woman turns 65+ she begins to make significantly more than the women in age group before her, yet she is still making an average of about 73% of her male peers.
The second independent variable investigated was whether an occupation’s majority gender has a relationship to the female percent of male earnings. We hypothesized that the wage gap is wider in male dominated occupations, therefore the female percent of male earnings would be lower in those occupations. We looked at this in two different ways.
First, a new variable was created, gender_dominatedV. For occupations with 50.1% or higher of females this variable was coded as “0”, for occupations made up of 50% or lower of females this variable was coded as “1”. In other words, “Male-domindated” is coded as “1” and “Female-Dominated” is coded as “0”.
#Using minor_category and percent_female variables to create a new variable of "female dominated" vs. "male dominated"
#gender_dominated<-data_frame(jobs_gender$occupation, jobs_gender$percent_female)
#head(jobs_gender)
#kable(gender_dominated[order(gender_dominated$`jobs_gender$percent_female`, gender_dominated$`jobs_gender$occupation`, decreasing = TRUE),])
jobs_gender<-jobs_gender%>%
mutate(gender_dominatedV = case_when(percent_female >=50.1 ~ 0,
percent_female <=50 ~ 1))
#head(jobs_gender)
In the visualization below, we can see that there is a negative relationship between male dominated occupations and female percent of male earnings. This means, women make less of a percent of their male colleagues’ earnings in male dominated fields, than do women working in female dominated fields. The second visualization is taking a different perspective.
The second way of looking at the majority gender in an occupation was with the variable “percent_female”, which provided the percent of people in an occupation who identified as female. The visualization below uses this variable to compare the percent of women in an occupation with the percent of male earnings. Here we can see a positive relationship, showing that as the amount of women in an occupation increase, their percent of earnings also increases. Both visualizations show that women working in in male-dominated occupations make less of a percent men than women working in female-dominated occupations.
To know if this relationship is statistically significant, three regression analyses were run. Missing data were considered in this analysis. Data from the variable wage_percent_of_male were coded as NAs if the sample size of an occupation was too low (with a count of 846 NAs). Therefore, these missing data were found to be “missing not at random”. These data were dropped when running the regressions. However is it noted since this could bias the data as it does not include the entire data-set’s set of occupations.
year occupation major_category minor_category
Min. :2013 Length:2088 Length:2088 Length:2088
1st Qu.:2014 Class :character Class :character Class :character
Median :2014 Mode :character Mode :character Mode :character
Mean :2014
3rd Qu.:2015
Max. :2016
total_workers workers_male workers_female percent_female
Min. : 658 Min. : 0 Min. : 0 Min. : 0.00
1st Qu.: 18687 1st Qu.: 10765 1st Qu.: 2364 1st Qu.: 10.73
Median : 58997 Median : 32302 Median : 15238 Median : 32.40
Mean : 196055 Mean : 111515 Mean : 84540 Mean : 36.00
3rd Qu.: 187415 3rd Qu.: 102644 3rd Qu.: 63326 3rd Qu.: 57.31
Max. :3758629 Max. :2570385 Max. :2290818 Max. :100.00
total_earnings total_earnings_male total_earnings_female
Min. : 17266 Min. : 12147 Min. : 7447
1st Qu.: 32410 1st Qu.: 35702 1st Qu.: 28872
Median : 44437 Median : 46825 Median : 40191
Mean : 49762 Mean : 53138 Mean : 44681
3rd Qu.: 61012 3rd Qu.: 65015 3rd Qu.: 54813
Max. :201542 Max. :231420 Max. :166388
NA's :4 NA's :65
wage_percent_of_male gender_dominatedV
Min. : 50.88 Min. :0.0000
1st Qu.: 77.56 1st Qu.:0.0000
Median : 85.16 Median :1.0000
Mean : 84.03 Mean :0.6705
3rd Qu.: 90.62 3rd Qu.:1.0000
Max. :117.40 Max. :1.0000
NA's :846
Year group percent
Min. :1979 Length:264 Min. :56.80
1st Qu.:1987 Class :character 1st Qu.:69.40
Median :1995 Mode :character Median :75.50
Mean :1995 Mean :76.88
3rd Qu.:2003 3rd Qu.:86.90
Max. :2011 Max. :95.40
year total_full_time total_part_time full_time_female
Min. :1968 Min. :80.30 Min. :14.00 Min. :71.90
1st Qu.:1980 1st Qu.:81.80 1st Qu.:16.80 1st Qu.:73.20
Median :1992 Median :82.60 Median :17.40 Median :73.90
Mean :1992 Mean :82.64 Mean :17.36 Mean :73.86
3rd Qu.:2004 3rd Qu.:83.20 3rd Qu.:18.20 3rd Qu.:74.70
Max. :2016 Max. :86.00 Max. :19.70 Max. :75.40
part_time_female full_time_male part_time_male
Min. :24.60 Min. :86.60 Min. : 7.80
1st Qu.:25.30 1st Qu.:89.00 1st Qu.: 9.60
Median :26.10 Median :89.50 Median :10.50
Mean :26.14 Mean :89.49 Mean :10.51
3rd Qu.:26.80 3rd Qu.:90.40 3rd Qu.:11.00
Max. :28.10 Max. :92.20 Max. :13.40
Call:
lm(formula = wage_percent_of_male ~ gender_dominatedV, data = jobs_gender)
Residuals:
Min 1Q Median 3Q Max
-32.390 -6.533 1.030 6.448 34.131
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 84.9569 0.3939 215.690 < 2e-16 ***
gender_dominatedV -1.6920 0.5327 -3.176 0.00153 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.346 on 1240 degrees of freedom
(846 observations deleted due to missingness)
Multiple R-squared: 0.00807, Adjusted R-squared: 0.00727
F-statistic: 10.09 on 1 and 1240 DF, p-value: 0.001529
Call:
lm(formula = wage_percent_of_male ~ percent_female, data = jobs_gender)
Residuals:
Min 1Q Median 3Q Max
-32.416 -6.333 0.889 6.378 34.238
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 82.08101 0.56034 146.49 < 2e-16 ***
percent_female 0.04258 0.01078 3.95 8.27e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.325 on 1240 degrees of freedom
(846 observations deleted due to missingness)
Multiple R-squared: 0.01242, Adjusted R-squared: 0.01163
F-statistic: 15.6 on 1 and 1240 DF, p-value: 8.266e-05
According to the two regressions above, we can conclude that there is a statistically significant relationship between female percent of male earnings and percent of women in the occupation. That is, the more women in a field, the smaller the wage gap.
In the first regression (GenDomReg), we compared our binary variable (gender_dominatedV) to the female percent of male earnings (wage_percent_of_male) to find a significant, negative relationship. This analysis resulted in a p value of 0.00153 and a coefficient of -1.6920 (adjusted R-squared of 0.00727). This means the relationship between male dominated occupations and female percent of male earnings is negative. That is, women working in male-dominated occupations make an average of 1.6% less percent of what men make compared to women working in female dominated occupations. From this we can conclude that .
In the second regression (GenDomReg2), we compared percent of women in an occupation with female percent of male earnings and found a significant positive relationship. The analysis resulted in a significant p value of 8.27e-05 and a coefficient of 0.04258 (adjusted R-squared of 0.01163). These results reflect a positive relationship between percent of women in the occupation and wage percent of male. This means, that as the percent of women in the occupation increases, the percent of earnings increases by .04%.
From these analyses we can conclude that women working in male dominated occupations make significantly less of a percent of what men make, compared to the women working in female dominated occupations; and as percent of women in an occupation increase, women make more percent of male earnings than women in male dominated occupations.
A third regression was then run in order to control for total earnings and total workers (results below). We decided to use the percent of women variable rather than the binary variable because that yeilded a stronger relationship to percent of earnings.
Call:
lm(formula = wage_percent_of_male ~ percent_female + total_earnings +
total_workers, data = jobs_gender)
Residuals:
Min 1Q Median 3Q Max
-31.594 -6.323 0.881 6.551 33.669
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.428e+01 9.024e-01 93.389 < 2e-16 ***
percent_female 3.561e-02 1.104e-02 3.225 0.00129 **
total_earnings -2.727e-05 1.103e-05 -2.471 0.01359 *
total_workers -1.565e-06 5.843e-07 -2.678 0.00750 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.284 on 1238 degrees of freedom
(846 observations deleted due to missingness)
Multiple R-squared: 0.02276, Adjusted R-squared: 0.02039
F-statistic: 9.611 on 3 and 1238 DF, p-value: 2.833e-06
These regression results maintain the significant relationship between percent of women in an occupation and their percent of male earnings. However, the significance level decreases (p value of 0.00129), whereas the adjusted R-squared increases (0.02039). This means that we have a stronger model as a whole, however the relationship between our IV and DV has lost strength. The coefficients of the predictor variable is 3.561e-02. This indicates a positive relationship between percent female and female percent of male earnings. This means, as the percent of women in an occupation increases by 1%, the female percent of male earnings increases by 3.561e-02% (0.03561 or about 0.04%). These results indicate that as the proportion of women in an occuption increases, the women in that occupation are paid more relative to their male peers when compared to women working in an occupation with a lower proportion of females.
From these regressions we can conclude that the wage gap is statistically significantly greater for women in male-dominated occupations, than it is for women in female-dominated occupations.
n<-1646
#GenDomReg
RSS1<-deviance(GenDomReg)
GenDomRegAIC<-n*log(RSS1/n) + 2*1
#print(GenDomRegAIC)
#6893.216
GenDomRegBIC<-n*log(RSS1/n) + log(n)*1
#print(GenDomRegBIC)
#6898.623
#GenDomReg2
RSS2<-deviance(GenDomReg2)
GenDomReg2AIC<-n*log(RSS2/n) + 2*1
#print(GenDomReg2AIC)
#6885.975
GenDomReg2BIC<-n*log(RSS2/n) + log(n)*1
#print(GenDomReg2BIC)
#6891.381
#GenDomReg3
RSS3<-deviance(GenDomReg3)
GenDomReg3AIC<-n*log(RSS3/n) + 2*3
#print(GenDomReg3AIC)
#6872.658
GenDomReg3BIC<-n*log(RSS3/n) + log(n)*3
#print(GenDomReg3BIC)
#6888.876
#Creating a dataframe of my model comparison indicators
Model<- c('GenDomReg', 'GenDomReg2', 'GenDomReg3')
AIC<-c(6893.22, 6885.98, 6872.66)
BIC<-c(6898.62, 6891.38, 6888.88)
p_value<-c('0.00153**', '8.27e-05***', '0.00129**')
Adjusted_R_Squared<-c(0.00727, 0.01163, 0.02039)
Coefficient<-c(-1.692, 0.04258, 3.56E-02)
Model_Comparison<-data.frame(Model, AIC, BIC, p_value, Adjusted_R_Squared, Coefficient)
#Putting the model comparison indicators in a table
kable(Model_Comparison)
Model | AIC | BIC | p_value | Adjusted_R_Squared | Coefficient |
---|---|---|---|---|---|
GenDomReg | 6893.22 | 6898.62 | 0.00153** | 0.00727 | -1.69200 |
GenDomReg2 | 6885.98 | 6891.38 | 8.27e-05*** | 0.01163 | 0.04258 |
GenDomReg3 | 6872.66 | 6888.88 | 0.00129** | 0.02039 | 0.03560 |
To compare the three models we consider the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), p value, and adjusted R-squared. Across the three models, the AIC and BIC are not drastically different, however we do see that model 1 (GenDomReg) has the highest in both, possibly due to the binary predictor variable. The model comparison also shows that model 3 (GenDomReg3) has the strongest adjusted R-squared, which could be due to added control variables. And model 2 (GenDomReg2) has the more significant p value.
Below are the results of the regression diagnostics for all three regression analyses described above.
The first regression (GenDomReg) analysis has passed all of the assumptions based on the diagnostic results above.
Residuals vs Fitted
The line in this diagnostic test is straight, this shows the analysis passing the assumption of linearity. The straight line also indicates a constant variance.
QQ-Plot
The results of the QQ Plot indicate the passing of the normality assumption. This is shown with the points falling along the line.
Scale-Location
The scale-location plot is another indicator of the constant variance assumption. The horizontal line resulting from this diagnostic test indicates that the model has passed this assumption.
Cook’s Distance
The results of this show that cook’s distance remains under 1, this indicates passing the assumption of influential observation.
Residuals vs Leverage
The results of this diagnostic show no points within the same area as the dashed lines. This is a further indicator that the analysis passed the assumption of influential observation. We can assume that there is not one observation that is particularly more influential than another.
Cook’s Distance vs Leverage
The results of this diagnostic plot show that the model is within a healthy area of under 1 for cooks distance (y axis) and below o.167 for leverage (x axis).
The second regression (GenDomReg2) analysis has also passed all of the assumptions based on the diagnostic results above.
Residuals vs Fitted
The line in this diagnostic test is generally straight with a slight curve. The slight curve is not enough to violate the assumptions tested in this diagnosis. Therefore the model passed the assumption of linearity and constant variance.
QQ-Plot
The results of the QQ Plot above shows the points falling along the line and therefore indicating that the model has passed the assumption of normality.
Scale-Location
The scale-location plot is another indicator of the constant variance assumption. The horizontal line resulting from this diagnostic test indicates that the model has passed this assumption.
Cook’s Distance
The results of this show that cook’s distance remains under 1, this indicates passing the assumption of influential observation.
Residuals vs Leverage
The results of this diagnostic show no points outside of the area with the red line. This is a further indicator that the analysis passed the assumption of influential observation. We can again assume that there is not one observation that is particularly more influential than another.
Cook’s Distance vs Leverage
The results of this diagnostic plot show that the model is within a healthy area of under 1 for cooks distance (y axis) and below o.167 for leverage (x axis).
The third regression analysis (GenDomReg3) has also passed all of the assumptions based on the diagnostic results above.
Residuals vs Fitted
The line in this diagnostic test is generally straight with a few slight curves. These curves are not enough to violate the assumptions tested in this diagnosis. Therefore the model passed the assumption of linearity and constant variance.
QQ-Plot
The results of the QQ Plot above shows the points falling along the line and therefore indicating that the model has passed the assumption of normality.
Scale-Location
The scale-location plot is another indicator of the constant variance assumption. The horizontal line resulting from this diagnostic test indicates that the model has passed this assumption. Although the line has a few curves, these are not enough to violate the assumption.
Cook’s Distance
The results of this show that cook’s distance remains under 1, this indicates passing the assumption of influential observation.
Residuals vs Leverage
The results of this diagnostic show no points outside of the area with the red line. This is a further indicator that the model passed the assumption of influential observation. We can, for a third time, assume that there is no single observation that is particularly more influential than another.
Cook’s Distance vs Leverage
The results of this diagnostic plot show that the model is within a healthy area of under 1 for cooks distance (y axis) and below o.167 for leverage (x axis).
We hypothesized that younger woman and women working in female-dominated occupations will experience a smaller wage gap than older women and those working in male-dominated occupations.
The results of the analysis show that we reject the null hypotheses. That is, there is a significant relationship between both independent variables (age group and gender_dominatedV) and the dependent variable (wage_percent_of_male). The results showed that as a woman gets older, she gets paid less percent of male earnings. With every age group jump, a woman made a significantly lower percent of male earnings than the prior age group.
We can also conclude that as the percent of women in an occupation increase, the percent of male earnings also increased. This means that women in female-dominated occupations experience a smaller wage gap than those in male-dominated occupations.
This investigation had hoped to analyze the interaction between the two independent variables (age group and occupation), however the occupation data ranged from 2011 to 2013 and the age data ranged from 1979 to 2011. Therefore, we were unable to look at the interaction between age and occupation, which tells us only a portion of the influence of our independent variables on wage gap.
Aragão, C., 2023. Gender pay gap in U.S. hasn’t changed much in two decades, Pew Research Center. United States of America. Retrieved from https://policycommons.net/artifacts/3456468/gender-pay-gap-in-us/4256843/ on 29 Apr 2023. CID: 20.500.12592/8mzrnq.
Blau, F. D., & Kahn, L. M. (2017). The gender wage gap: Extent, trends, and explanations. Journal of economic literature, 55(3), 789-865.
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Huntington-Klein N (2023). vtable: Variable Table for Variable Documentation. R package version 1.4.2, https://CRAN.R-project.org/package=vtable.
Kochhar, R. (2023). The enduring grip of the gender pay gap.
Thomas Mock (2022). Tidy Tuesday: A weekly data project aimed at the R ecosystem. https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-03-05
Wickham H (2022). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.5.0, https://CRAN.R-project.org/package=stringr.
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.0, https://CRAN.R-project.org/package=dplyr.
Wickham H, Hester J, Bryan J (2023). readr: Read Rectangular Text Data. R package version 2.1.4, https://CRAN.R-project.org/package=readr.
Wrohlich, K. (2017). Gender pay gap varies greatly by occupation. DIW Economic Bulletin, 7(43), 429-435.
---
title: "Gender Wage Gap Across Age and Occupation"
author: "Meredith Derian-Toth"
description: "DACSS 603_Final Project_MDT"
date: "05/21/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
---
```{r setup}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
```{r reading in data}
#reading in data
jobs_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/jobs_gender.csv")
earnings_female <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/earnings_female.csv")
employed_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/employed_gender.csv")
#bringing in libraries
library(dplyr)
library(vtable)
library(ggplot2)
library(readr)
library(tidyverse)
library(stringr)
library(lubridate)
```
## Introduction
This research exploration takes a close look at the gender wage gap across age group and occupation. While the wage gap has decreased overall since 1980, women still make less than 100% of every dollar men make. In 2022 women made on average, about 82 cents for every dollar men make (Aragao, 2023). This is shown in the line graph below (Female Salary Percent of Male Salary).
```{r wage gap over time}
#wage gap by year
#Would also like to add a legend and labels to the this graphs
earnings_female%>%
filter(str_detect(group,"Total, 16 years and older")) %>%
ggplot(aes(x = Year)) +
geom_line(aes(y = percent)) +
geom_point(aes(y = percent)) +
labs(title = "Female Salary Percent of Male Salary",
x="Year",
y="Percent of Salary") +
ylim(0,100)
```
### The Data set
The data used in this exploration and analysis originated from the Bureau of Labor Statistics and the Census Bureau, however the three datasets came from a Tidy Tuesday data dive. These data describe different variables about women in the workforce across time, ages, and occupations. The three datasets are: (1) earnings_female is historical data, providing the percent of earnings women make in relation to men, broken down by age group and ranging from 1979 - 2011, (2) employed_gender is another historical dataset providing the workforce information (percent of women and men working full time and part time), by year, ranging from 1968 - 2016, (3) and lastly, jobs_gender is detailed data regarding occupation, earnings for those occupations by gender, and percent of earnings women make in relation to men, ranging from 2013 - 2016.
The tables below provide the descriptive statistics for the dependent variable, female wage percent of male earnings. The first table is the dependent variable across occupations and year (from 2013 to 2016). The second table shows the dependent variable for women who are 16 years or older from 1979 to 2011.
```{r Descriptive Statistic}
#Descriptive Statistics for the dependent variable: wage percent of male
SumWage<-data_frame(jobs_gender$year, jobs_gender$wage_percent_of_male)
SumWage<- SumWage%>%
rename('Year'='jobs_gender$year',
'wage_percent_of_male' = 'jobs_gender$wage_percent_of_male')
sumtable(SumWage)
#Descriptive Statistics for the historical data of the Dependent variable: wage percent of male
earnings_female %>%
group_by(group) %>%
filter(str_detect(group,"Total, 16 years and older")) %>%
sumtable()
```
### Research Question
The purpose of the following exploration and analysis is to answer the question: *What are the contributing factors that lead to the gender wage gap to be greater or smaller?* The factors explored in this particular analysis are age and occupation. Occupation is categorized into broad and detailed categories, and will be explored by comparing female dominated and male dominated occupations.
These factors have been explored in the literature. Many have found that the wage gap is smaller for younger women than older (Aragao, 2023; Blau & Kahn, 2016; Kochhar, 2023). And Wrohlick, 2017 found that there is less of a wage gap in the public sector when compared to the private sector. This analysis goes beyond comparing occupation by private versus public and into male dominated versus female dominated. This is something that has been done considerably less and is an important piece of knowing where to target closing the wage gap.
### Hypotheses
According to the research, the wage gap for a woman widens as she gets older (Aragao, 2023; Blau & Kahn, 2016; Kochhar, 2023), and the wage gap is wider for women in the private sector compared to the public sector (Wrohlich 2017). Based on this literature, we hypothesize that younger woman and women working in female-dominated occupations will experience a smaller wage gap than older women and those working in male-dominated occupations.
#### Wage Gap by Age Group
The graph below shows the average wage gap across age groups. These data show that as women age, women make less of a percent of what men make. Therefore, we see the wage gap widen and take a particular dip around at 25 - 44 years old.
```{r wage gap by age}
#wage gap by age
earnings_female %>%
group_by(group) %>%
filter(!str_detect(group,"Total, 16 years and older")) %>%
summarize(Mean_Percent_salary = mean(percent))%>%
ggplot(aes(group,Mean_Percent_salary)) +
geom_col(aes(fill = group)) +
labs(title = "Female Salary Percent of Male Salary by Age Group",
x="Age Group",
y="Average Percent of Salary") +
geom_text(aes(y=Mean_Percent_salary,
label=sprintf("%0.2f", round(Mean_Percent_salary, digits = 2)))) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
```
One hypothesis for this is that the younger generation is experiencing a more narrow wage gap. And in that case, we would expect to the see wage gap narrow across time.
The graph and table below shows the the wage gap across time using the same data. From these visualizations we can conclude that the wage gap did not narrow. We can therefore attribute the widening of the wage gap to age group.
```{r wage by year}
#wage gap by year
SummaryYear <- jobs_gender %>%
group_by(year)%>%
drop_na(wage_percent_of_male)%>%
summarise(Average_Percent_of_Salary = mean(wage_percent_of_male))
kable(SummaryYear)
jobs_gender %>%
group_by(year)%>%
drop_na(wage_percent_of_male)%>%
summarise(Mean_wage_percent_of_males = mean(wage_percent_of_male))%>%
ggplot(aes(fill = year, y = `Mean_wage_percent_of_males`, x=year)) +
geom_bar(position = "dodge", stat = "identity") +
labs(title = "Female Salary Percent of Male Salary by Year",
x="Year",
y="Average Percent of Salary") +
geom_text(aes(y=Mean_wage_percent_of_males,
label=sprintf("%0.2f", round(Mean_wage_percent_of_males, digits = 2)))) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
```
#### Wage Gap by Occupation
To investigate wage gap by occupation, we are comparing male-dominated vs. female dominated occupations. The list below shows the average number of total female workers versus total male workers averaged across 3 year (2013- 2016), by job category. The table is sorted in descending order by total female workers.
```{r occupation by gender workers}
#creating a table of just total workers by gender and category
JobsCategoryGen<- data_frame(jobs_gender$year, jobs_gender$minor_category, jobs_gender$workers_male, jobs_gender$workers_female)
JobsCategoryGen<- JobsCategoryGen%>%
rename('Year'='jobs_gender$year',
'Occupation Category (Minor)' = 'jobs_gender$minor_category',
'Total Male Worker' = 'jobs_gender$workers_male',
'Total Female Workers' = 'jobs_gender$workers_female')
SummaryJobsCatGen <- JobsCategoryGen%>%
group_by(`Occupation Category (Minor)`)%>%
summarise(AverageMaleWorkers = mean(`Total Male Worker`),
AverageFemaleWorkers = mean(`Total Female Workers`))
kable(SummaryJobsCatGen[order(SummaryJobsCatGen$AverageFemaleWorkers, decreasing=TRUE),])
```
From this list, we can see the following are the top 10 occupation categories that are on average more dominated by women:
\(1\) Education, Training, and Library
\(2\) Sales and Related
\(3\) Building and Grounds Cleaning and Maintenance
\(4\) Office and Administrative Support
\(5\) Healthcare Support
\(6\) Management
\(7\) Community and Social Service
\(8\) Healthcare Practitioners and Technical
\(9\) Legal
\(10\) Food Preparation and Serving Related
Next we see the same occupation categories listed in order of female percent of male earnings. Women who work in occupations that fall into the "Community and Social Services" category make, on average, 91% of every dollar a man in their field makes. Whereas a woman working in "Production" makes, on average, 77% of every dollar a man in their field makes.
```{r wage gap by occupation_minor category}
#wagegap by minor occupation category
SummaryMinorCat <- jobs_gender%>%
group_by(minor_category)%>%
drop_na(wage_percent_of_male) %>%
summarise(Mean_wage_percent_of_males = mean(wage_percent_of_male))
kable(SummaryMinorCat[order(SummaryMinorCat$Mean_wage_percent_of_males, SummaryMinorCat$minor_category, decreasing=FALSE),])
#The visualization below is in VERY draft form, and cannot provide helpful information until it is reformatted.
#jobs_gender %>%
#group_by(minor_category)%>%
#drop_na(wage_percent_of_male)%>%
#summarise(Mean_wage_percent_of_males = mean(wage_percent_of_male))%>%
#ggplot(aes(fill = minor_category, y = `Mean_wage_percent_of_males`, x=minor_category)) +
#geom_bar(position = "dodge", stat = "identity") +
#labs(title = "Female Salary Percent of Male Salary by Minor Occupation Category",
#x="Occupation Cateogry",
#y="Average Percent of Salary") +
#theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
```
When comparing the top 10 occupation categories that are more dominated by women (Table: "Occupation Categories Dominated by Women") with the list of occupation categories of the smallest wage gap (Table: "Occupation Categories with the Smallest Wage Gap"), we can see some overlap.
**Occupation Categories Dominated by Women**
(1) Education, Training, and Library
(2) *Sales and Related*
(3) *Building and Grounds Cleaning and Maintenance*
(4) Office and Administrative Support
(5) Healthcare Support
(6) *Management*
(7) Community and Social Service
(8) Healthcare Practitioners and Technical
(9) *Legal*
(10) Food Preparation and Serving Related
**Occupation Categories with the Smallest Wage Gap**
(1) Production
(2) *Legal*
(3) *Sales and Related*
(4) *Building and Grounds Cleaning and Maintenance*
(5) Farming, Fishing, and Forestry
(6) Transportation
(7) Business and Financial Operations
(8) *Management*
(9) Arts, Design, Entertainment, Sports, and Media
(10) Life, Physical, and Social Science
Further analysis is necessary to confirm if there is a statistically significant relationship between gender-dominated occupation and wage gap.
## Analysis
### Relationship Between Age Group and Wage Gap
The visualization below shows the relationship between our first independent variable of age group and our dependent variable of female wage percent of male. This visualization tells us that starting at age 25, womens' percent of mens' earnings already starts to decrease, with a continued drop at age 35. This drop is addressed in the literature and is believed to be due women beginning to have children (Aragao, 2023; Kochhar, 2023).
```{r relationship between age and wage gap}
earnings_female %>%
filter(!str_detect(group, "Total, 16 years and older"))%>%
ggplot(aes(x = group, y = percent)) +
geom_boxplot() +
geom_smooth(method = 'lm', se=F) +
labs(title = "Female Salary Percent of Male Salary by Age",
x="Age Group",
y="Percent of Salary")
```
#### Analysis of Relationship Between Age Group and Wage Gap
An Analysis of Variance (ANOVA) was run to investigate the statistical relationship between age group and female earnings percent of male, controlling for year.
```{r ANOVA}
summary(AgeANOVA<-aov(percent~group + Year, data = earnings_female))
```
The results of the ANOVA show a significant relationship between the age groups with a p value of <2e-16. A Tukey post-hoc analysis was used to breakdown where the significant differences existed.
```{r Tukey HSD}
TukeyHSD(AgeANOVA, conf.level = .95)
```
The results of the Tukey post-hoc show similar results to the boxplot visualization above. When younger women (age groups 16-19 years old) are compared to all age groups 25 years or older, there is a significant difference in percent of male earnings (p value of 0.0000). That is to say, once a women turns 25, she is already getting paid significantly less than her own gender at ages 16-24. Also shown in these results is with every age group jump, a woman makes significantly less than the age group before (p value 0.000) until a woman reaches 55 years old. There is not a significant difference between female earnings for women in the 45 - 54 and 55-64 age groups. Then women a woman turns 65+ she begins to make significantly more than the women in age group before her, yet she is still making an average of about 73% of her male peers.
### Relationship Between Occupation and Wage Gap
The second independent variable investigated was whether an occupation's majority gender has a relationship to the female percent of male earnings. We hypothesized that the wage gap is wider in male dominated occupations, therefore the female percent of male earnings would be lower in those occupations. We looked at this in two different ways.
First, a new variable was created, *gender_dominatedV*. For occupations with 50.1% or higher of females this variable was coded as "0", for occupations made up of 50% or lower of females this variable was coded as "1". In other words, "Male-domindated" is coded as "1" and "Female-Dominated" is coded as "0".
```{r new categorical variable- binary}
#Using minor_category and percent_female variables to create a new variable of "female dominated" vs. "male dominated"
#gender_dominated<-data_frame(jobs_gender$occupation, jobs_gender$percent_female)
#head(jobs_gender)
#kable(gender_dominated[order(gender_dominated$`jobs_gender$percent_female`, gender_dominated$`jobs_gender$occupation`, decreasing = TRUE),])
jobs_gender<-jobs_gender%>%
mutate(gender_dominatedV = case_when(percent_female >=50.1 ~ 0,
percent_female <=50 ~ 1))
#head(jobs_gender)
```
In the visualization below, we can see that there is a negative relationship between male dominated occupations and female percent of male earnings. This means, women make less of a percent of their male colleagues' earnings in male dominated fields, than do women working in female dominated fields. The second visualization is taking a different perspective.
```{r relationship between occupation and wage gap1}
ggplot(data = jobs_gender, aes(x = gender_dominatedV, y = wage_percent_of_male)) +
geom_point() +
geom_smooth(method = 'lm', se=F) +
labs(title = "Female Salary Percent of Male Salary by Gender Dominated Occuption",
x="Gender Dominated",
y="Percent of Salary")
```
The second way of looking at the majority gender in an occupation was with the variable "percent_female", which provided the percent of people in an occupation who identified as female. The visualization below uses this variable to compare the percent of women in an occupation with the percent of male earnings. Here we can see a positive relationship, showing that as the amount of women in an occupation increase, their percent of earnings also increases. Both visualizations show that women working in in male-dominated occupations make less of a percent men than women working in female-dominated occupations.
```{r relationship between occupation and wage gap2}
ggplot(data = jobs_gender, aes(x = percent_female, y = wage_percent_of_male)) +
geom_point() +
geom_smooth(method = 'lm', se=F) +
labs(title = "Female Salary Percent of Male Salary by Percent of Women in Occupations",
x="Percent of Women",
y="Percent of Salary")
```
#### Analysis of Relationship Between Occupation and Wage Gap
To know if this relationship is statistically significant, three regression analyses were run. Missing data were considered in this analysis. Data from the variable *wage_percent_of_male* were coded as NAs if the sample size of an occupation was too low (with a count of 846 NAs). Therefore, these missing data were found to be "missing not at random". These data were dropped when running the regressions. However is it noted since this could bias the data as it does not include the entire data-set's set of occupations.
```{r missing data}
summary(jobs_gender)
summary(earnings_female)
summary(employed_gender)
```
```{r regression occupation and wage gap}
summary(GenDomReg<-lm(wage_percent_of_male ~ gender_dominatedV, data = jobs_gender))
summary(GenDomReg2<-lm(wage_percent_of_male ~ percent_female, data = jobs_gender))
```
According to the two regressions above, we can conclude that there is a statistically significant relationship between female percent of male earnings and percent of women in the occupation. That is, the more women in a field, the smaller the wage gap.
In the first regression (*GenDomReg*), we compared our binary variable (*gender_dominatedV*) to the female percent of male earnings (*wage_percent_of_male*) to find a significant, negative relationship. This analysis resulted in a p value of 0.00153 and a coefficient of -1.6920 (adjusted R-squared of 0.00727). This means the relationship between male dominated occupations and female percent of male earnings is negative. That is, women working in male-dominated occupations make an average of 1.6% less percent of what men make compared to women working in female dominated occupations. From this we can conclude that .
In the second regression (*GenDomReg2*), we compared percent of women in an occupation with female percent of male earnings and found a significant positive relationship. The analysis resulted in a significant p value of 8.27e-05 and a coefficient of 0.04258 (adjusted R-squared of 0.01163). These results reflect a positive relationship between percent of women in the occupation and wage percent of male. This means, that as the percent of women in the occupation increases, the percent of earnings increases by .04%.
From these analyses we can conclude that women working in male dominated occupations make significantly less of a percent of what men make, compared to the women working in female dominated occupations; and as percent of women in an occupation increase, women make more percent of male earnings than women in male dominated occupations.
A third regression was then run in order to control for total earnings and total workers (results below). We decided to use the percent of women variable rather than the binary variable because that yeilded a stronger relationship to percent of earnings.
```{r occupation and wage regression 3}
summary(GenDomReg3<-lm(wage_percent_of_male ~ percent_female + total_earnings + total_workers, data = jobs_gender))
```
These regression results maintain the significant relationship between percent of women in an occupation and their percent of male earnings. However, the significance level decreases (p value of 0.00129), whereas the adjusted R-squared increases (0.02039). This means that we have a stronger model as a whole, however the relationship between our IV and DV has lost strength. The coefficients of the predictor variable is 3.561e-02. This indicates a positive relationship between percent female and female percent of male earnings. This means, as the percent of women in an occupation increases by 1%, the female percent of male earnings increases by 3.561e-02% (0.03561 or about 0.04%). These results indicate that as the proportion of women in an occuption increases, the women in that occupation are paid more relative to their male peers when compared to women working in an occupation with a lower proportion of females.
From these regressions we can conclude that the wage gap is statistically significantly greater for women in male-dominated occupations, than it is for women in female-dominated occupations.
#### Model Comparison and Diagnosis.
```{r AIC and BIC for all three models}
n<-1646
#GenDomReg
RSS1<-deviance(GenDomReg)
GenDomRegAIC<-n*log(RSS1/n) + 2*1
#print(GenDomRegAIC)
#6893.216
GenDomRegBIC<-n*log(RSS1/n) + log(n)*1
#print(GenDomRegBIC)
#6898.623
#GenDomReg2
RSS2<-deviance(GenDomReg2)
GenDomReg2AIC<-n*log(RSS2/n) + 2*1
#print(GenDomReg2AIC)
#6885.975
GenDomReg2BIC<-n*log(RSS2/n) + log(n)*1
#print(GenDomReg2BIC)
#6891.381
#GenDomReg3
RSS3<-deviance(GenDomReg3)
GenDomReg3AIC<-n*log(RSS3/n) + 2*3
#print(GenDomReg3AIC)
#6872.658
GenDomReg3BIC<-n*log(RSS3/n) + log(n)*3
#print(GenDomReg3BIC)
#6888.876
#Creating a dataframe of my model comparison indicators
Model<- c('GenDomReg', 'GenDomReg2', 'GenDomReg3')
AIC<-c(6893.22, 6885.98, 6872.66)
BIC<-c(6898.62, 6891.38, 6888.88)
p_value<-c('0.00153**', '8.27e-05***', '0.00129**')
Adjusted_R_Squared<-c(0.00727, 0.01163, 0.02039)
Coefficient<-c(-1.692, 0.04258, 3.56E-02)
Model_Comparison<-data.frame(Model, AIC, BIC, p_value, Adjusted_R_Squared, Coefficient)
#Putting the model comparison indicators in a table
kable(Model_Comparison)
```
To compare the three models we consider the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), p value, and adjusted R-squared. Across the three models, the AIC and BIC are not drastically different, however we do see that model 1 (*GenDomReg*) has the highest in both, possibly due to the binary predictor variable. The model comparison also shows that model 3 (*GenDomReg3*) has the strongest adjusted R-squared, which could be due to added control variables. And model 2 (*GenDomReg2*) has the more significant p value.
Below are the results of the regression diagnostics for all three regression analyses described above.
```{r regression diagnostic1}
par(mfrow = c(2,3))
plot(GenDomReg, which = 1:6)
```
The first regression (*GenDomReg*) analysis has passed all of the assumptions based on the diagnostic results above.
**Residuals vs Fitted**
The line in this diagnostic test is straight, this shows the analysis passing the assumption of linearity. The straight line also indicates a constant variance.
**QQ-Plot**
The results of the QQ Plot indicate the passing of the normality assumption. This is shown with the points falling along the line.
**Scale-Location**
The scale-location plot is another indicator of the constant variance assumption. The horizontal line resulting from this diagnostic test indicates that the model has passed this assumption.
**Cook's Distance**
The results of this show that cook's distance remains under 1, this indicates passing the assumption of influential observation.
**Residuals vs Leverage**
The results of this diagnostic show no points within the same area as the dashed lines. This is a further indicator that the analysis passed the assumption of influential observation. We can assume that there is not one observation that is particularly more influential than another.
**Cook's Distance vs Leverage**
The results of this diagnostic plot show that the model is within a healthy area of under 1 for cooks distance (y axis) and below o.167 for leverage (x axis).
```{r regression diagnostic2}
par(mfrow = c(2,3))
plot(GenDomReg2, which = 1:6)
```
The second regression (*GenDomReg2*) analysis has also passed all of the assumptions based on the diagnostic results above.
**Residuals vs Fitted**
The line in this diagnostic test is generally straight with a slight curve. The slight curve is not enough to violate the assumptions tested in this diagnosis. Therefore the model passed the assumption of linearity and constant variance.
**QQ-Plot**
The results of the QQ Plot above shows the points falling along the line and therefore indicating that the model has passed the assumption of normality.
**Scale-Location**
The scale-location plot is another indicator of the constant variance assumption. The horizontal line resulting from this diagnostic test indicates that the model has passed this assumption.
**Cook's Distance**
The results of this show that cook's distance remains under 1, this indicates passing the assumption of influential observation.
**Residuals vs Leverage**
The results of this diagnostic show no points outside of the area with the red line. This is a further indicator that the analysis passed the assumption of influential observation. We can again assume that there is not one observation that is particularly more influential than another.
**Cook's Distance vs Leverage**
The results of this diagnostic plot show that the model is within a healthy area of under 1 for cooks distance (y axis) and below o.167 for leverage (x axis).
```{r regression diagnostic3}
par(mfrow = c(2,3))
plot(GenDomReg3, which = 1:6)
```
The third regression analysis (*GenDomReg3*) has also passed all of the assumptions based on the diagnostic results above.
**Residuals vs Fitted**
The line in this diagnostic test is generally straight with a few slight curves. These curves are not enough to violate the assumptions tested in this diagnosis. Therefore the model passed the assumption of linearity and constant variance.
**QQ-Plot**
The results of the QQ Plot above shows the points falling along the line and therefore indicating that the model has passed the assumption of normality.
**Scale-Location**
The scale-location plot is another indicator of the constant variance assumption. The horizontal line resulting from this diagnostic test indicates that the model has passed this assumption. Although the line has a few curves, these are not enough to violate the assumption.
**Cook's Distance**
The results of this show that cook's distance remains under 1, this indicates passing the assumption of influential observation.
**Residuals vs Leverage**
The results of this diagnostic show no points outside of the area with the red line. This is a further indicator that the model passed the assumption of influential observation. We can, for a third time, assume that there is no single observation that is particularly more influential than another.
**Cook's Distance vs Leverage**
The results of this diagnostic plot show that the model is within a healthy area of under 1 for cooks distance (y axis) and below o.167 for leverage (x axis).
## Conclusions
#### Hyptheses
We hypothesized that younger woman and women working in female-dominated occupations will experience a smaller wage gap than older women and those working in male-dominated occupations.
The results of the analysis show that we reject the null hypotheses. That is, there is a significant relationship between both independent variables (*age group* and *gender_dominatedV*) and the dependent variable (*wage_percent_of_male*). The results showed that as a woman gets older, she gets paid less percent of male earnings. With every age group jump, a woman made a significantly lower percent of male earnings than the prior age group.
We can also conclude that as the percent of women in an occupation increase, the percent of male earnings also increased. This means that women in female-dominated occupations experience a smaller wage gap than those in male-dominated occupations.
## Limitations
This investigation had hoped to analyze the interaction between the two independent variables (age group and occupation), however the occupation data ranged from 2011 to 2013 and the age data ranged from 1979 to 2011. Therefore, we were unable to look at the interaction between age and occupation, which tells us only a portion of the influence of our independent variables on wage gap.
## References
```{R Citations}
#citation("readr")
#citation("ggplot2")
#citation("tidyverse")
#citation("stringr")
#citation("dplyr")
#citation("vtable")
````
Aragão, C., 2023. Gender pay gap in U.S. hasn’t changed much in two decades, Pew Research Center. United States of America. Retrieved from https://policycommons.net/artifacts/3456468/gender-pay-gap-in-us/4256843/ on 29 Apr 2023. CID: 20.500.12592/8mzrnq.
Blau, F. D., & Kahn, L. M. (2017). The gender wage gap: Extent, trends, and explanations. Journal of economic literature, 55(3), 789-865.
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Huntington-Klein N (2023). _vtable: Variable Table for Variable Documentation_. R package version 1.4.2,
<https://CRAN.R-project.org/package=vtable>.
Kochhar, R. (2023). The enduring grip of the gender pay gap.
Thomas Mock (2022). Tidy Tuesday: A weekly data project aimed at the R ecosystem. https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-03-05
Wickham H (2022). _stringr: Simple, Consistent Wrappers for Common String Operations_. R package version 1.5.0,
<https://CRAN.R-project.org/package=stringr>.
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL,
Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H
(2019). “Welcome to the tidyverse.” _Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686
<https://doi.org/10.21105/joss.01686>.
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A Grammar of Data Manipulation_. R package version 1.1.0,
<https://CRAN.R-project.org/package=dplyr>.
Wickham H, Hester J, Bryan J (2023). _readr: Read Rectangular Text Data_. R package version 2.1.4,
<https://CRAN.R-project.org/package=readr>.
Wrohlich, K. (2017). Gender pay gap varies greatly by occupation. DIW Economic Bulletin, 7(43), 429-435.