Summary Statistics
Author

Moira Chiong

Published

June 12, 2023

Homework Two: Reading in Data

Sean Conway

Jun 6

100 points

For this homework, your goal is to read in a more complicated dataset. Please use the category tag “hw2” as well as a tag for the dataset you choose to use.

  • Read in a dataset. It’s strongly recommended that you choose a dataset you’re considering using for the final project. If you decide to use one of the datasets we have provided, please use a challenging dataset - check with us if you are not sure.

    Code
    setwd("C:/Users/chion/OneDrive/Desktop/DACSS 601/DACSS_601_Summer2023_Sec1/posts")
    library(readxl)
    Warning: package 'readxl' was built under R version 4.1.3
    Code
    pseo_ma_SANDBOX <- read_excel("pseo_ma_SANDBOX.xlsx")
    pseo_ma_SANDBOX
    # A tibble: 12,639 x 48
       agg_level_pseo label_agg_level_pseo   inst_level label_inst_level institution
                <dbl> <chr>                  <chr>      <chr>            <chr>      
     1             38 "Degree Level *\r\nIn~ I          Institution      00216100   
     2             38 "Degree Level *\r\nIn~ I          Institution      00216100   
     3             38 "Degree Level *\r\nIn~ I          Institution      00216100   
     4             38 "Degree Level *\r\nIn~ I          Institution      00216100   
     5             38 "Degree Level *\r\nIn~ I          Institution      00216100   
     6             38 "Degree Level *\r\nIn~ I          Institution      00216100   
     7             38 "Degree Level *\r\nIn~ I          Institution      00216700   
     8             38 "Degree Level *\r\nIn~ I          Institution      00216700   
     9             38 "Degree Level *\r\nIn~ I          Institution      00216700   
    10             38 "Degree Level *\r\nIn~ I          Institution      00216800   
    # i 12,629 more rows
    # i 43 more variables: label_institution <chr>,
    #   `Degree\r\nAward\r\nLevel` <chr>, label_degree_level <chr>,
    #   cip_level <chr>, label_cip_level <chr>, cipcode <chr>, label_cipcode <chr>,
    #   grad_cohort <chr>, label_grad_cohort <chr>, grad_cohort_years <chr>,
    #   label_grad_cohort_years <chr>, geo_level <chr>, label_geo_level <chr>,
    #   geography <chr>, label_geography <chr>, ind_level <chr>, ...
  • Clean the data as needed using dplyr and related tidyverse packages.

  • Provide a narrative about the data set (look it up if you aren’t sure what you have got) and the variables in your dataset, including what type of data each variable is. The goal of this step is to communicate in a visually appealing way to non-experts - not to replicate r-code.

  • Identify potential research questions that your dataset can help answer.

##Commentary

For Homework 2, I used the Census Bureau Post-Secondary Employment Outcomes (PSEO) database filtered for Massachusetts. This database cites the employment outcomes (earnings 1, 5 and 10 years after graduation) for all public Massachusetts higher education institutions. The data set provides postsecondary employment outcomes for graduates at all credential levels, from certificate to doctorate.

##Potential Research Questions:

What is the relationship between credential level and earnings? Do earnings grow over time?

Questions for further research: What is the relationship between race/ethnicity, age, Pell status and earnings?

##Notes on Code

The mean for all graduates for 1 year after graduation was 41256, with the median lower at 36197, demonstrating a skewed left distribution. The mean for all graduates five years after graduation 57654, while the median was 53678, again skewed left distribution. The mean of the 1st year after graduation data for Baccalaureates was 39966, while the median was 36050. This reveals to me that the data were skewed to the left with more observations towards the beginning of the number line. For the mean of the 5th year after graduation for Baccalaureates, mean at 58696 again was higher than the median 54870, again representing a left skew of the data. The minimum of the Baccalaureate earnings the first year after graduation was low at 19825, while the Baccalaureate maximum was at 110317. The minimum of the Baccalaureate earnings the fifth year after graduation were much higher at 29649 and the maximum at 120707. The minimum and maximum for the entire sample the first year after graduation were 19766 and 147728 respectively. Conversely, the minimum and maximum for fifth year earnings were 28528 and 189604 respectively. This represents that earnings increased the longer the time elapsed since graduation as well as by credential level.

There were 12639 rows and 48 columns in the dataset.

y1a_num  %>% summary(y1_p50_earnings, na.rm=TRUE)    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's    19825   31970   36050   39966   43117  110317    2714 

##Code

Code
#list the unique degree levels
unique(pseo_ma_SANDBOX$label_degree_level)
[1] "Certificate < 1 year"                "Associates"                         
[3] "Baccalaureate"                       "Masters"                            
[5] "Doctoral -\r\nResearch/Scholarship"  "Doctoral -\r\nProfessional Practice"
[7] "Certificate 1-2 years"               "Certificate 2-4 years"              
Code
#convert to numeric
y1_num<- as.numeric(pseo_ma_SANDBOX$y1_p50_earnings) 
Warning: NAs introduced by coercion
Code
#find summary statistics on whole data set, 1 year after graduation
mean(y1_num, na.rm=TRUE)
[1] 41256.48
Code
median(y1_num, na.rm=TRUE)
[1] 36197
Code
min(y1_num, na.rm=TRUE)
[1] 19766
Code
max(y1_num, na.rm=TRUE)
[1] 147728
Code
IQR(y1_num, na.rm=TRUE)
[1] 15153
Code
#find summary statistics on whole data set, 5 years after graduation
y5_num<-as.numeric(pseo_ma_SANDBOX$y5_p50_earnings)
Warning: NAs introduced by coercion
Code
mean(y5_num, na.rm=TRUE)
[1] 57653.91
Code
median(y5_num, na.rm=TRUE)
[1] 53677.5
Code
min(y5_num, na.rm=TRUE)
[1] 28528
Code
max(y5_num, na.rm=TRUE)
[1] 189604
Code
IQR(y5_num, na.rm=TRUE)
[1] 18766.25
Code
summary(pseo_ma_SANDBOX_num)
Error in summary(pseo_ma_SANDBOX_num): object 'pseo_ma_SANDBOX_num' not found
Code
print(typeof(pseo_ma_SANDBOX$y1_p50_earnings))
[1] "character"
Code
install.packages("dplyr")
Installing package into 'C:/Users/chion/OneDrive/Documents/R/win-library/4.1'
(as 'lib' is unspecified)
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
Code
library(dplyr)
Warning: package 'dplyr' was built under R version 4.1.3

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
summary(y1_num)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  19766   31280   36197   41256   46433  147728    9174 
Code
summary(y5_num)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  28528   46162   53678   57654   64928  189604    9495 
Code
#find the number of rows and columns in the dataset
nrow(pseo_ma_SANDBOX)
[1] 12639
Code
ncol(pseo_ma_SANDBOX)
[1] 48
Code
#filter for Baccalaureate degrees
pseo_filtered <-filter(pseo_ma_SANDBOX, pseo_ma_SANDBOX$label_degree_level=="Baccalaureate")
View(pseo_filtered)

#find summary statistics of filtered data-- just BA degrees
y1a_num<- as.numeric(pseo_filtered$y1_p50_earnings) 
Warning: NAs introduced by coercion
Code
y1a_num  %>% summary(y1_p50_earnings, na.rm=TRUE)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  19825   31970   36050   39966   43117  110317    2714 
Code
y5a_num <- as.numeric(pseo_filtered$y5_p50_earnings)
Warning: NAs introduced by coercion
Code
y5a_num %>% summary(y5_p50_earnings, na.rm=TRUE)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  29649   48861   54870   58696   63958  120707    3191