Homework 2

Summary Statistics

Author

Moira Chiong

Published

June 12, 2023

Homework Two: Reading in Data

Sean Conway

•

Jun 6

100 points

For this homework, your goal is to read in a more complicated dataset. Please use the category tag “hw2” as well as a tag for the dataset you choose to use.

Read in a dataset. It’s strongly recommended that you choose a dataset you’re considering using for the final project. If you decide to use one of the datasets we have provided, please use a challenging dataset - check with us if you are not sure.

Code

setwd("C:/Users/chion/OneDrive/Desktop/DACSS 601/DACSS_601_Summer2023_Sec1/posts")
library(readxl)

Warning: package 'readxl' was built under R version 4.1.3

Code

pseo_ma_SANDBOX <- read_excel("pseo_ma_SANDBOX.xlsx")
pseo_ma_SANDBOX

# A tibble: 12,639 x 48
   agg_level_pseo label_agg_level_pseo   inst_level label_inst_level institution
            <dbl> <chr>                  <chr>      <chr>            <chr>      
 1             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 2             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 3             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 4             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 5             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 6             38 "Degree Level *\r\nIn~ I          Institution      00216100   
 7             38 "Degree Level *\r\nIn~ I          Institution      00216700   
 8             38 "Degree Level *\r\nIn~ I          Institution      00216700   
 9             38 "Degree Level *\r\nIn~ I          Institution      00216700   
10             38 "Degree Level *\r\nIn~ I          Institution      00216800   
# i 12,629 more rows
# i 43 more variables: label_institution <chr>,
#   `Degree\r\nAward\r\nLevel` <chr>, label_degree_level <chr>,
#   cip_level <chr>, label_cip_level <chr>, cipcode <chr>, label_cipcode <chr>,
#   grad_cohort <chr>, label_grad_cohort <chr>, grad_cohort_years <chr>,
#   label_grad_cohort_years <chr>, geo_level <chr>, label_geo_level <chr>,
#   geography <chr>, label_geography <chr>, ind_level <chr>, ...

Clean the data as needed using dplyr and related tidyverse packages.
Provide a narrative about the data set (look it up if you aren’t sure what you have got) and the variables in your dataset, including what type of data each variable is. The goal of this step is to communicate in a visually appealing way to non-experts - not to replicate r-code.
Identify potential research questions that your dataset can help answer.

##Commentary

For Homework 2, I used the Census Bureau Post-Secondary Employment Outcomes (PSEO) database filtered for Massachusetts. This database cites the employment outcomes (earnings 1, 5 and 10 years after graduation) for all public Massachusetts higher education institutions. The data set provides postsecondary employment outcomes for graduates at all credential levels, from certificate to doctorate.

##Potential Research Questions:

What is the relationship between credential level and earnings? Do earnings grow over time?

Questions for further research: What is the relationship between race/ethnicity, age, Pell status and earnings?

##Notes on Code

The mean for all graduates for 1 year after graduation was 41256, with the median lower at 36197, demonstrating a skewed left distribution. The mean for all graduates five years after graduation 57654, while the median was 53678, again skewed left distribution. The mean of the 1st year after graduation data for Baccalaureates was 39966, while the median was 36050. This reveals to me that the data were skewed to the left with more observations towards the beginning of the number line. For the mean of the 5th year after graduation for Baccalaureates, mean at 58696 again was higher than the median 54870, again representing a left skew of the data. The minimum of the Baccalaureate earnings the first year after graduation was low at 19825, while the Baccalaureate maximum was at 110317. The minimum of the Baccalaureate earnings the fifth year after graduation were much higher at 29649 and the maximum at 120707. The minimum and maximum for the entire sample the first year after graduation were 19766 and 147728 respectively. Conversely, the minimum and maximum for fifth year earnings were 28528 and 189604 respectively. This represents that earnings increased the longer the time elapsed since graduation as well as by credential level.

There were 12639 rows and 48 columns in the dataset.

y1a_num  %>% summary(y1_p50_earnings, na.rm=TRUE)    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's    19825   31970   36050   39966   43117  110317    2714

##Code

Code

#list the unique degree levels
unique(pseo_ma_SANDBOX$label_degree_level)

[1] "Certificate < 1 year"                "Associates"                         
[3] "Baccalaureate"                       "Masters"                            
[5] "Doctoral -\r\nResearch/Scholarship"  "Doctoral -\r\nProfessional Practice"
[7] "Certificate 1-2 years"               "Certificate 2-4 years"

Code

#convert to numeric
y1_num<- as.numeric(pseo_ma_SANDBOX$y1_p50_earnings)

Warning: NAs introduced by coercion

Code

#find summary statistics on whole data set, 1 year after graduation
mean(y1_num, na.rm=TRUE)

[1] 41256.48

Code

median(y1_num, na.rm=TRUE)

[1] 36197

Code

min(y1_num, na.rm=TRUE)

[1] 19766

Code

max(y1_num, na.rm=TRUE)

[1] 147728

Code

IQR(y1_num, na.rm=TRUE)

[1] 15153

Code

#find summary statistics on whole data set, 5 years after graduation
y5_num<-as.numeric(pseo_ma_SANDBOX$y5_p50_earnings)

Warning: NAs introduced by coercion

Code

mean(y5_num, na.rm=TRUE)

[1] 57653.91

Code

median(y5_num, na.rm=TRUE)

[1] 53677.5

Code

min(y5_num, na.rm=TRUE)

[1] 28528

Code

max(y5_num, na.rm=TRUE)

[1] 189604

Code

IQR(y5_num, na.rm=TRUE)

[1] 18766.25

Code

summary(pseo_ma_SANDBOX_num)

Error in summary(pseo_ma_SANDBOX_num): object 'pseo_ma_SANDBOX_num' not found

Code

print(typeof(pseo_ma_SANDBOX$y1_p50_earnings))

[1] "character"

Code

install.packages("dplyr")

Installing package into 'C:/Users/chion/OneDrive/Documents/R/win-library/4.1'
(as 'lib' is unspecified)

Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror

Code

library(dplyr)

Warning: package 'dplyr' was built under R version 4.1.3


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Code

summary(y1_num)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  19766   31280   36197   41256   46433  147728    9174

Code

summary(y5_num)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  28528   46162   53678   57654   64928  189604    9495

Code

#find the number of rows and columns in the dataset
nrow(pseo_ma_SANDBOX)

[1] 12639

Code

ncol(pseo_ma_SANDBOX)

[1] 48

Code

#filter for Baccalaureate degrees
pseo_filtered <-filter(pseo_ma_SANDBOX, pseo_ma_SANDBOX$label_degree_level=="Baccalaureate")
View(pseo_filtered)

#find summary statistics of filtered data-- just BA degrees
y1a_num<- as.numeric(pseo_filtered$y1_p50_earnings)

Warning: NAs introduced by coercion

Code

y1a_num  %>% summary(y1_p50_earnings, na.rm=TRUE)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  19825   31970   36050   39966   43117  110317    2714

Code

y5a_num <- as.numeric(pseo_filtered$y5_p50_earnings)

Warning: NAs introduced by coercion

Code

y5a_num %>% summary(y5_p50_earnings, na.rm=TRUE)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  29649   48861   54870   58696   63958  120707    3191