Exploratory Data Analysis
Author

Moira Chiong

Published

June 12, 2023

Read in Data

Code
setwd("C:/Users/chion/OneDrive/Desktop/DACSS 601/DACSS_601_Summer2023_Sec1/posts")
library(readxl)
Warning: package 'readxl' was built under R version 4.1.3
Code
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.1.3
Warning: package 'ggplot2' was built under R version 4.1.3
Warning: package 'tibble' was built under R version 4.1.3
Warning: package 'tidyr' was built under R version 4.1.3
Warning: package 'readr' was built under R version 4.1.3
Warning: package 'purrr' was built under R version 4.1.3
Warning: package 'dplyr' was built under R version 4.1.3
Warning: package 'stringr' was built under R version 4.1.3
Warning: package 'forcats' was built under R version 4.1.3
Warning: package 'lubridate' was built under R version 4.1.3
-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr     1.1.2     v readr     2.1.4
v forcats   1.0.0     v stringr   1.5.0
v ggplot2   3.4.2     v tibble    3.2.1
v lubridate 1.9.2     v tidyr     1.3.0
v purrr     1.0.1     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(dplyr)
library(ggplot2)
pseo_ma_SANDBOX <- read_excel("datafolderMoiraChiong/pseo_ma_SANDBOX.xlsx")
pseo_ma_SANDBOX
# A tibble: 12,639 x 48
   agg_level_pseo label_agg_level_pseo   inst_level label_inst_level institution
            <dbl> <chr>                  <chr>      <chr>            <chr>      
 1             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 2             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 3             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 4             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 5             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 6             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 7             38 "Degree Level *\r\nIn~ I          Institution      00216700   
 8             38 "Degree Level *\r\nIn~ I          Institution      00216700   
 9             38 "Degree Level *\r\nIn~ I          Institution      00216700   
10             38 "Degree Level *\r\nIn~ I          Institution      00216800   
# i 12,629 more rows
# i 43 more variables: label_institution <chr>,
#   `Degree\r\nAward\r\nLevel` <chr>, label_degree_level <chr>,
#   cip_level <chr>, label_cip_level <chr>, cipcode <chr>, label_cipcode <chr>,
#   grad_cohort <chr>, label_grad_cohort <chr>, grad_cohort_years <chr>,
#   label_grad_cohort_years <chr>, geo_level <chr>, label_geo_level <chr>,
#   geography <chr>, label_geography <chr>, ind_level <chr>, ...
Code
View(pseo_ma_SANDBOX)

Commentary

For Homework 2, I used the Census Bureau Post-Secondary Employment Outcomes (PSEO) database filtered for Massachusetts. This database cites the employment outcomes (earnings 1, 5 and 10 years after graduation) for all public Massachusetts higher education institutions. The data set provides postsecondary employment outcomes for graduates at all credential levels, from certificate to doctorate. ## Potential Research Questions: What is the relationship between credential level and earnings? Do earnings grow over time? Questions for further research: What is the relationship between race/ethnicity, age, Pell status and earnings? ## Notes on Code The mean for all graduates for 1 year after graduation was 41256, with the median lower at 36197, demonstrating a skewed left distribution. The mean for all graduates five years after graduation 57654, while the median was 53678, again skewed left distribution. The mean of the 1st year after graduation data for Baccalaureates was 39966, while the median was 36050. This reveals to me that the data were skewed to the left with more observations towards the beginning of the number line. For the mean of the 5th year after graduation for Baccalaureates, mean at 58696 again was higher than the median 54870, again representing a left skew of the data. The minimum of the Baccalaureate earnings the first year after graduation was low at 19825, while the Baccalaureate maximum was at 110317. The minimum of the Baccalaureate earnings the fifth year after graduation were much higher at 29649 and the maximum at 120707. The minimum and maximum for the entire sample the first year after graduation were 19766 and 147728 respectively. Conversely, the minimum and maximum for fifth year earnings were 28528 and 189604 respectively. This represents that earnings increased the longer the time elapsed since graduation as well as by credential level. In addition, numerous visualizations on the data confirm patterns about the relationship between credential, institution and academic field. It is notable from the visualizations that higher credentials generally led to higher earnings, and that the scientific and professional fields were demonstrated to be more lucrative. Additional review of Economics Bachelor’s graduates showed that earnings were highest among UMass Amherst grads and ranged from around $30k to around $50k. There were 12639 rows and 48 columns in the dataset.

## Data Wrangling
::: {.cell}

```{.r .cell-code}
#list the unique degree levels
unique(pseo_ma_SANDBOX$label_degree_level)
#convert to numeric
pseo_ma_SANDBOX_num <- pseo_ma_SANDBOX
y1_num<- as.numeric(pseo_ma_SANDBOX$y1_p50_earnings)
#find summary statistics on whole data set, 1 year after graduation
pseo_ma_SANDBOX_num
summary(y1_num, na.rm=T)
#find summary statistics on whole data set, 5 years after graduation
y5_num<-as.numeric(pseo_ma_SANDBOX$y5_p50_earnings)
summary(y5_num, na.rm=T)
#find the number of rows and columns in the dataset
nrow(pseo_ma_SANDBOX)
ncol(pseo_ma_SANDBOX)
#find summary statistics of filtered data-- just BA degrees
pseo_filtered <-filter(pseo_ma_SANDBOX, pseo_ma_SANDBOX$label_degree_level=="Baccalaureate")
pseo_filtered_five <- subset(pseo_filtered, pseo_filtered$label_cipcode!="All Instructional Programs")
y1a_num<- as.numeric(pseo_filtered_five$y1_p50_earnings)
y1a_num  %>% summary(y1_p50_earnings, na.rm=TRUE)
y5a_num <- as.numeric(pseo_filtered_five$y5_p50_earnings)
y5a_num %>% summary(y5_p50_earnings, na.rm=TRUE)
``
view(pseo_ma_SANDBOX)
# filter data for all instructional programs to get an overview
library(dplyr)
pseo_filtered_two <-filter(pseo_ma_SANDBOX, pseo_ma_SANDBOX$label_cipcode=="All Instructional Programs")`
#NEED HELP
#pseo_ma_SANDBOX %>%
#  group_by(pseo_ma_SANDBOX$label_degree_level) %>%
#  summarise(Mean=mean(y1_num, na.rm=T))
Error: attempt to use zero-length variable name

:::

Visualization Code

Code
library(tidyverse)
library(ggplot2)
View(pseo_ma_SANDBOX)
pseo_viz <- pseo_ma_SANDBOX
#y1a_num_1<- as.numeric(pseo_viz$y1_grads_earn)
#y1_num <-as.numeric(pseo_viz$y1_p50_earnings)
# Scatterplot of Year 1 earnings by credential level and number of graduates, all institutions
vis_500 <-
ggplot(pseo_viz, aes(x=y1_grads_earn, y=y1_p50_earnings, color=label_degree_level)) +
geom_point() +
labs(y= "Earnings", x= "Number of Graduates", color= "Credential Level", title = "Earnings by Number of Graduates")
print(vis_500)

Code
# facet grid of scatterplot
vis_500 + facet_grid(label_degree_level ~ .)

Code
# facet grid with spacing of scatterplot
vis_500 + facet_wrap( ~ label_degree_level, ncol=2)

Code
#filter for Baccalaureate Economics degrees
library(dplyr)
pseo_filtered <-filter(pseo_ma_SANDBOX, pseo_ma_SANDBOX$label_degree_level=="Baccalaureate" & pseo_ma_SANDBOX$label_cipcode=="Economics")
View(pseo_filtered)
# scatterplot of Bachelors in Economics, all institutions
vis_600 <-
ggplot(pseo_filtered, aes(x=y1_grads_earn, y=y1_p50_earnings, color=label_institution)) +
geom_point() +
labs(y= "Earnings", x= "Number of Graduates", color= "Credential Level", title = "Earnings by Number of Graduates by Institution")
print(vis_600)

Code
# bar graph of Economics BAs 1 year earnings  by institution
vis_900 <-ggplot(data = pseo_filtered, mapping = aes(x=y1_grads_earn, y=y1_p50_earnings, fill=label_institution)) +
geom_bar(stat="identity", position = "dodge") +
labs(x="Earnings", y="Count", fill="Institution", title="Earnings and Number of Graduates Economics BAs")
print(vis_900)

Code
#Scatterplot of year 1 earnings, all instructional programs, by degree level
pseo_filtered_two <-filter(pseo_ma_SANDBOX, pseo_ma_SANDBOX$label_cipcode=="All Instructional Programs")
View(pseo_filtered_two)
vis_550 <-
ggplot(pseo_filtered_two, aes(x=y1_grads_earn, y=y1_p50_earnings, color=label_degree_level)) +
geom_point() +
labs(y= "Earnings", x= "Number of Graduates", color= "Degree Level", title = "Earnings by Number of Graduates, All Instructional Programs")
print(vis_550)

Code
# Facet wrap of all instructional programs
vis_550 + facet_wrap( ~ label_degree_level, ncol=3)

Code
# 1 year earnings for BAs by 2-digit CIP Code at UMass Amherst- Scatterplot
library(dplyr)
pseo_filtered_three <-filter(pseo_ma_SANDBOX, pseo_ma_SANDBOX$institution=="00222100" & pseo_ma_SANDBOX$label_degree_level=="Baccalaureate" & pseo_ma_SANDBOX$cip_level=="2")
View(pseo_filtered_three)
vis_650 <-
ggplot(pseo_filtered_three, aes(x=y1_grads_earn, y=y1_p50_earnings, color=label_cipcode)) +
geom_point() +
labs(y= "Earnings", x= "Number of Graduates", color= "Academic Field (2-digit CIP)", title = "Earnings 1 Year After Graduation by Number of Graduates at UMass Amherst by Academic Field")
print(vis_650)

Code
#Year 5 Earnings for UMass Amherst by CIP Code BA
vis_700 <-
ggplot(pseo_filtered_three, aes(x=y5_grads_earn, y=y5_p50_earnings, color=label_cipcode)) +
geom_point() +
labs(y= "Earnings", x= "Number of Graduates", color= "Academic Field (2-digit CIP)", title = "Earnings 5 Years After Graduation by Number of Graduates at UMass Amherst by Academic Field")
print(vis_700)

Code
#Dodge by 2-digit CIP UMass Amherst BA
vis_950 <-ggplot(data = pseo_filtered_three, mapping = aes(x=y1_grads_earn, y=y1_p50_earnings, fill=label_cipcode)) +
geom_bar(stat="identity", position = "dodge") +
labs(x="Earnings", y="Count", fill="Academic Field", title="Earnings and Number of Grads by Field- UMass Amherst")
print(vis_950)

Code
# Facet wrap of all instructional programs
vis_550 + facet_wrap( ~ label_degree_level, ncol=3)