Final Project

Final Project: Census PSEO Analysis

Author

Moira Chiong

Published

July 13, 2023

Code

setwd("C:/Users/chion/OneDrive/Desktop/DACSS 601/DACSS_601_Summer2023_Sec1/posts")
library(readxl)
library(tidyverse)
library(dplyr)
library(ggplot2)
pseo_ma_SANDBOX <- read_excel("datafolderMoiraChiong/pseo_ma_SANDBOX.xlsx")
pseo_ma_SANDBOX

# A tibble: 12,639 x 48
   agg_level_pseo label_agg_level_pseo   inst_level label_inst_level institution
            <dbl> <chr>                  <chr>      <chr>            <chr>      
 1             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 2             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 3             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 4             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 5             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 6             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 7             38 "Degree Level *\r\nIn~ I          Institution      00216700   
 8             38 "Degree Level *\r\nIn~ I          Institution      00216700   
 9             38 "Degree Level *\r\nIn~ I          Institution      00216700   
10             38 "Degree Level *\r\nIn~ I          Institution      00216800   
# i 12,629 more rows
# i 43 more variables: label_institution <chr>,
#   `Degree\r\nAward\r\nLevel` <chr>, label_degree_level <chr>,
#   cip_level <chr>, label_cip_level <chr>, cipcode <chr>, label_cipcode <chr>,
#   grad_cohort <chr>, label_grad_cohort <chr>, grad_cohort_years <chr>,
#   label_grad_cohort_years <chr>, geo_level <chr>, label_geo_level <chr>,
#   geography <chr>, label_geography <chr>, ind_level <chr>, ...

Introduction

For my final project, I used the Census Bureau Post-Secondary Employment Outcomes (PSEO) database filtered for Massachusetts. This database cites the employment outcomes (earnings 1, 5 and 10 years after graduation) for all public Massachusetts higher education institutions. The data set provides postsecondary employment outcomes for graduates at all credential levels, from certificate to doctorate. The data is divided into three to five year cohorts, ranging from 2001-2018. In addition, there are tabulations of number of graduates and number of employed graduates in the data set.

Some limitations of this data set are that it is not disaggregated by demographic variables such as race/ethnicity, gender, age or Pell status. This makes a truly nuanced study of the variables and research questions less possible with the present data set. It is my hope that new iterations of this data set will include such demographic variables.

Potential Research Questions

The following questions are intended to uncover patterns in the PSEO data. While robust analysis of patterns in the data may require advanced statistical analysis, the data analysis completed in the project do reveal underlying general patterns.My research questions are as follows:

What is the relationship between credential level and earnings?
What is the relationship between time elapsed since graduation and earnings? Do earnings grow as more years after graduation elapse?
What is the relationship between academic field of study and earnings? Questions for further research: What is the relationship between race/ethnicity, age, Pell status and earnings?

Notes on Code- Data Analysis and Answers to Questions 1-3

Question 1: It is clear in the data that there exists an increasing relationship between credentials and earnings. Further, as credentials increase toward the doctoral level, earnings increase. This is corroborated in Table 1 below. For example, Doctoral graduates earn more than twice (on average, around $80k, 1 year after graduation) than Baccalaureate graduates (on average, around $40k, 1 year after graduation). With the exception of Certificates 1-2 years, and both types of doctoral degrees, the data is skewed right. This implies that the earnings data has a peak on the left side of the graph with gradual decrease on the right.

Question 2: It is also evident in the data that earnings increase with time elapsed since graduation. As seen in the data below:

Summary Statistics, Year 1, p50 Earnings Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 22214 34095 38097 46229 55442 120621 10

Summary Statistics, Year 5, p50 Earnings Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 31422 47280 53655 60603 70968 138314 10

Note the increase in mean earnings from $38,097 in Year 1 to $53,655 by Year 5.

Question 3: My conjecture is that STEM fields are most lucrative, at any credential level. With strong labor markets in STEM professions, the postsecondary employment outcomes are strongest for those in STEM fields. I will later perform a check using UMass Amherst BA earnings data by academic field.

Close Up: Earnings of BA Economics Graduates

Out of curiousity, I proceeded to analyze the earnings patterns of BA Economics graduates in the MA public system. I found that of all the MA public Economics BA granting institutions, University of Massachusetts Amherst recorded the highest earnings at 47610 after 1 year of graduation. What is interesting to note is that earnings of Year 1 graduates of the UMass segment were consistently higher than state university This pattern did not extend to the 5 year mark, earnings of BA graduates were higher at the UMass campuses (with the exception of UMass Dartmouth) and highest at UMass Amherst , with Bridgewater State, Framingham State and Worcester State all showing earnings of over 60000.

Close Up: Earnings of UMass Amherst BA Graduates

In general, graduates of UMass Amherst have the highest earnings of graduates of any public higher education institution in Massachusetts. Fields that are particularly lucrative are, in descending order, Computer Science, Engineering, Business and Health Professions. Conversely, fields such as Human Sciences, Visual and Performing Arts and History are the least lucrative fields. This is not a shock and corroborates common conventional thought and practices about what fields are most lucrative.

Data Wrangling

Code

# Find the number of rows and columns in the dataset
nrow(pseo_ma_SANDBOX)

[1] 12639

Code

ncol(pseo_ma_SANDBOX)

[1] 48

Code

# Convert to numeric, 1 year after graduation
pseo_ma_SANDBOX$y1_grads_earn = as.numeric((pseo_ma_SANDBOX$y1_grads_earn))
pseo_ma_SANDBOX$y1_ipeds_count = as.numeric((pseo_ma_SANDBOX$y1_ipeds_count))
pseo_ma_SANDBOX$y1_p50_earnings = as.numeric((pseo_ma_SANDBOX$y1_p50_earnings))

Code

# Convert to numeric, five years after graduation
pseo_ma_SANDBOX$y5_grads_earn = as.numeric((pseo_ma_SANDBOX$y5_grads_earn))
pseo_ma_SANDBOX$y5_ipeds_count = as.numeric((pseo_ma_SANDBOX$y5_ipeds_count))
pseo_ma_SANDBOX$y5_p50_earnings = as.numeric((pseo_ma_SANDBOX$y5_p50_earnings))

## Descriptive Statistics

::: {.cell}

```{.r .cell-code}
# Filter data

pseo_filtered_two <-filter(pseo_ma_SANDBOX, pseo_ma_SANDBOX$label_cipcode=="All Instructional Programs" & pseo_ma_SANDBOX$label_grad_cohort== "All Cohorts")
View(pseo_filtered_two)

:::

Code

# Summary Statistics, 1 year after Graduation

y1gradsearnSummary <-summary(pseo_filtered_two$y1_grads_earn, na.rm=T)
print(y1gradsearnSummary)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   34.0   186.5   708.0  2403.7  2315.0 35472.0      10

Code

y1ipedscountSummary <-summary(pseo_filtered_two$y1_ipeds_count, na.rm=T)
print(y1ipedscountSummary)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    1.0   224.5   976.0  3370.2  3410.0 53560.0      10

Code

y1p50earningsSummary <-summary(pseo_filtered_two$y1_p50_earnings, na.rm=T)
print(y1p50earningsSummary)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  22214   34095   38097   46229   55442  120621      10

Code

# Find summary statistics, 5 years after graduation

y5gradsearnSummary <-summary(pseo_filtered_two$y5_grads_earn, na.rm=T)
print(y5gradsearnSummary)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   31.0   208.8   783.0  2039.6  2888.0 25220.0      10

Code

y5ipedscountSummary <-summary(pseo_filtered_two$y5_ipeds_count, na.rm=T)
print(y5ipedscountSummary)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    1.0   224.5   976.0  2561.2  2916.5 33916.0      10

Code

y5p50earningsSummary <-summary(pseo_filtered_two$y5_p50_earnings, na.rm=T)
print(y5p50earningsSummary)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  31422   47280   53655   60603   70968  138314      10

Code

``

# 1 year Earnings by Credential- Mean and Median
pseo_filtered_two %>%
group_by(pseo_filtered_two$label_degree_level) %>%
summarise(Mean_Earnings=mean(y1_p50_earnings, na.rm=T), Median_Earnings=median(y1_p50_earnings, na.rm=T))

Error: attempt to use zero-length variable name

Code

# Filter data-- just BA degrees
pseo_filtered <-filter(pseo_ma_SANDBOX, pseo_ma_SANDBOX$label_degree_level=="Baccalaureate", pseo_ma_SANDBOX$label_cipcode!="All Instructional Programs" & pseo_ma_SANDBOX$label_grad_cohort== "All Cohorts")
`

Error: <text>:3:1: unexpected INCOMPLETE_STRING
2: pseo_filtered <-filter(pseo_ma_SANDBOX, pseo_ma_SANDBOX$label_degree_level=="Baccalaureate", pseo_ma_SANDBOX$label_cipcode!="All Instructional Programs" & pseo_ma_SANDBOX$label_grad_cohort== "
3: `
   ^

Data Visualization Code

Code

# Bar Chart, Mean Earnings One Year After Graduation all and by Credential Level 

options(scipen = 999)
vis_y1_p50_earnings <-
  ggplot(pseo_filtered_two, aes(x=label_institution, y=y1_p50_earnings)) +
  geom_bar(stat="identity") +
  stat_summary(fun = mean, geom = "col") +
  ggtitle("Average Earnings, 1 Year After Graduation") +
 xlab("Institution") + ylab("Average Earnings")  +
 theme(axis.text.x=element_text(size=4, angle = -90, hjust = 0)) +
 scale_y_continuous(limits=c(0,100000))
 print(vis_y1_p50_earnings)

Code

  vis_y1_p50_earnings + facet_wrap( ~ label_degree_level, ncol=1)

Code

# Bar Chart, Mean Earnings 5 Years After Graduation (all) and by Credential Level
options(scipen = 999)
vis_y5_p50_earnings <-
  ggplot(pseo_filtered_two, aes(x=label_institution, y=y5_p50_earnings)) +
  geom_bar(stat="identity") +
  stat_summary(fun = mean, geom = "col") +
  ggtitle("Average Earnings, 5 years After Graduation") +
 xlab("Institution") + ylab("Average Earnings")  +
 theme(axis.text.x=element_text(size=4, angle = -90, hjust = 0)) +
 scale_y_continuous(limits=c(0,100000))
 print(vis_y5_p50_earnings)

Code

  vis_y5_p50_earnings + facet_wrap( ~ label_degree_level, ncol=1)

Code

# Bar Graph of Number of Graduates
options(scipen = 999)
pseo_filtered_two$y1_ipeds_count <- as.numeric(pseo_filtered_two$y1_ipeds_count)
vis_y1_graduates <-
  ggplot(pseo_filtered_two, aes(x=label_institution, y=y1_ipeds_count)) +
  geom_bar(stat="identity") +
  stat_summary(fun = mean, geom = "col") +
  ggtitle("Number of Graduates, 1 year After Graduation, all cohorts") +
 xlab("Institution") + ylab("Number of Graduates")  +
 theme(axis.text.x=element_text(size=4, angle = -90, hjust = 0)) +
scale_y_continuous(limits = c(0, 80000))
  print(vis_y1_graduates)

Code

  vis_y1_graduates + facet_wrap( ~ label_degree_level, ncol=1)

Code

# Filter for Baccalaureate Economics degrees
library(dplyr)
pseo_filtered_econ<-filter(pseo_ma_SANDBOX, pseo_ma_SANDBOX$label_degree_level=="Baccalaureate" & pseo_ma_SANDBOX$label_cipcode=="Economics" & pseo_ma_SANDBOX$label_grad_cohort== "All Cohorts")

# Scatterplot of Bachelors in Economics, all institutions
vis_600 <-
ggplot(pseo_filtered_econ, aes(x=y1_grads_earn, y=y1_p50_earnings, color=label_institution)) +
geom_point() +
labs(y= "Earnings", x= "Number of Graduates", color= "Credential Level", title = "Earnings by Number of Graduates by Institution")
print(vis_600)

Code

pseo_filtered_econ %>%
group_by(pseo_filtered_econ$label_degree_level, pseo_filtered_econ$label_institution) %>%
summarise(Mean=mean(y1_p50_earnings, na.rm=T), Median=median(y1_p50_earnings, na.rm=T))

# A tibble: 10 x 4
# Groups:   pseo_filtered_econ$label_degree_level [1]
   `pseo_filtered_econ$label_degree_level` pseo_filtered_econ$lab~1  Mean Median
   <chr>                                   <chr>                    <dbl>  <dbl>
 1 Baccalaureate                           "Bridgewater\r\nState U~ 38473  38473
 2 Baccalaureate                           "Fitchburg State\r\nUni~   NaN     NA
 3 Baccalaureate                           "Framingham\r\nState Un~ 40713  40713
 4 Baccalaureate                           "Salem State\r\nUnivers~ 35121  35121
 5 Baccalaureate                           "University of\r\nMassa~ 47610  47610
 6 Baccalaureate                           "University of\r\nMassa~ 41376  41376
 7 Baccalaureate                           "University of\r\nMassa~ 42996  42996
 8 Baccalaureate                           "University of\r\nMassa~ 38528  38528
 9 Baccalaureate                           "Westfield State\r\nUni~ 40008  40008
10 Baccalaureate                           "Worcester State\r\nUni~ 36980  36980
# i abbreviated name: 1: `pseo_filtered_econ$label_institution`

Code

pseo_filtered_econ %>%
group_by(pseo_filtered_econ$label_institution) %>%
summarise(Mean=mean(y5_p50_earnings, na.rm=T), Median=median(y5_p50_earnings, na.rm=T))

# A tibble: 10 x 3
   `pseo_filtered_econ$label_institution`            Mean Median
   <chr>                                            <dbl>  <dbl>
 1 "Bridgewater\r\nState University"                68271  68271
 2 "Fitchburg State\r\nUniversity"                    NaN     NA
 3 "Framingham\r\nState University"                 64565  64565
 4 "Salem State\r\nUniversity"                      55778  55778
 5 "University of\r\nMassachusetts -\r\nAmherst"    70569  70569
 6 "University of\r\nMassachusetts -\r\nLowell"     62776  62776
 7 "University of\r\nMassachusetts at\r\nBoston"    65056  65056
 8 "University of\r\nMassachusetts at\r\nDartmouth" 51413  51413
 9 "Westfield State\r\nUniversity"                  59899  59899
10 "Worcester State\r\nUniversity"                  63618  63618

Code

# Bar of Bachelors in Economics
pseo_filtered_econ %>%
group_by(pseo_filtered_econ$label_degree_level, pseo_filtered_econ$label_institution) %>%
summarise(Mean=mean(y1_p50_earnings, na.rm=T), Median=median(y1_p50_earnings, na.rm=T))

# A tibble: 10 x 4
# Groups:   pseo_filtered_econ$label_degree_level [1]
   `pseo_filtered_econ$label_degree_level` pseo_filtered_econ$lab~1  Mean Median
   <chr>                                   <chr>                    <dbl>  <dbl>
 1 Baccalaureate                           "Bridgewater\r\nState U~ 38473  38473
 2 Baccalaureate                           "Fitchburg State\r\nUni~   NaN     NA
 3 Baccalaureate                           "Framingham\r\nState Un~ 40713  40713
 4 Baccalaureate                           "Salem State\r\nUnivers~ 35121  35121
 5 Baccalaureate                           "University of\r\nMassa~ 47610  47610
 6 Baccalaureate                           "University of\r\nMassa~ 41376  41376
 7 Baccalaureate                           "University of\r\nMassa~ 42996  42996
 8 Baccalaureate                           "University of\r\nMassa~ 38528  38528
 9 Baccalaureate                           "Westfield State\r\nUni~ 40008  40008
10 Baccalaureate                           "Worcester State\r\nUni~ 36980  36980
# i abbreviated name: 1: `pseo_filtered_econ$label_institution`

::: {.cell}

```{.r .cell-code}
options(scipen = 999)

vis_y1_econ <-
ggplot(pseo_filtered_econ, aes(x=label_institution, y=y1_p50_earnings)) +
geom_bar(stat="identity") +
ggtitle("Average Earnings, Economics, 1 year After Graduation") +
xlab("Institution") + 
ylab("Average Earnings") +
theme(axis.text.x=element_text(size=4, angle = -90, hjust = 0)) 
 print(vis_y1_econ)

:::

Code

# 1 year earnings for BAs by 2-digit CIP Code at UMass Amherst- Scatterplot
library(dplyr)
pseo_filtered_three <-filter(pseo_ma_SANDBOX, pseo_ma_SANDBOX$institution=="00222100" & pseo_ma_SANDBOX$label_degree_level=="Baccalaureate" & pseo_ma_SANDBOX$cip_level=="2" & pseo_ma_SANDBOX$label_grad_cohort== "All Cohorts")

vis_650 <-
ggplot(pseo_filtered_three, aes(x=y1_grads_earn, y=y1_p50_earnings, color=label_cipcode)) +
geom_point() +
labs(y= "Earnings", x= "Number of BA Graduates", color= "Academic Field (2-digit CIP)", title = "Earnings 1 Year After Graduation by Number of BA Graduates at UMass Amherst by Academic Field")
print(vis_650)

Code

library(dplyr)
pseo_filtered_three <-filter(pseo_ma_SANDBOX, pseo_ma_SANDBOX$institution=="00222100" & pseo_ma_SANDBOX$label_degree_level=="Baccalaureate" & pseo_ma_SANDBOX$cip_level=="2" & pseo_ma_SANDBOX$label_grad_cohort== "All Cohorts")

# Bachelor's Earnings by CIP Code, UMass Amherst
pseo_filtered_three %>%
group_by(pseo_filtered_three$label_cipcode)

# A tibble: 26 x 49
# Groups:   pseo_filtered_three$label_cipcode [26]
   agg_level_pseo label_agg_level_pseo   inst_level label_inst_level institution
            <dbl> <chr>                  <chr>      <chr>            <chr>      
 1             40 "Degree Level * CIP\r~ I          Institution      00222100   
 2             40 "Degree Level * CIP\r~ I          Institution      00222100   
 3             40 "Degree Level * CIP\r~ I          Institution      00222100   
 4             40 "Degree Level * CIP\r~ I          Institution      00222100   
 5             40 "Degree Level * CIP\r~ I          Institution      00222100   
 6             40 "Degree Level * CIP\r~ I          Institution      00222100   
 7             40 "Degree Level * CIP\r~ I          Institution      00222100   
 8             40 "Degree Level * CIP\r~ I          Institution      00222100   
 9             40 "Degree Level * CIP\r~ I          Institution      00222100   
10             40 "Degree Level * CIP\r~ I          Institution      00222100   
# i 16 more rows
# i 44 more variables: label_institution <chr>,
#   `Degree\r\nAward\r\nLevel` <chr>, label_degree_level <chr>,
#   cip_level <chr>, label_cip_level <chr>, cipcode <chr>, label_cipcode <chr>,
#   grad_cohort <chr>, label_grad_cohort <chr>, grad_cohort_years <chr>,
#   label_grad_cohort_years <chr>, geo_level <chr>, label_geo_level <chr>,
#   geography <chr>, label_geography <chr>, ind_level <chr>, ...

Code

summarise(Mean=mean(y1_p50_earnings, na.rm=T))

Error in mean(y1_p50_earnings, na.rm = T): object 'y1_p50_earnings' not found

Critical Reflection

What I learned from this project confirmed conventional knowledge and axioms. For example, my analyses confirmed that earnings generally increase with credential level, earnings generally increase with time since graduation elapsed, and earnings in STEM fields are generally higher than other fields. The limitations of my analysis is that I did not break down my analysis demographically. I expect that the Census Bureau will provide this disaggregated data in time. Given the recent Supreme Court decision on affirmative action, I would conjecture that such employment outcomes data disaggregated by race/ethnicity would be a useful and timely addition to the dataset and further analysis. Further statistical analysis could test the relationship between credential level and earnings, and test if they are correlated as well. Further, we could check for a statistical relationship between earnings and race/ethnicity. In addition, unit record data would be very interesting for analysis using panel data methods.

References

Census Bureau (2023). Post-secondary Employment Outcomes: Massachusetts. [Dataset]. https://lehd.ces.census.gov/data/pseo_experimental.html

Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.