For this homework, your goal is to read in a more complicated dataset. Please use the category tag “hw2” as well as a tag for the dataset you choose to use.
Read in a dataset. It’s strongly recommended that you choose a dataset you’re considering using for the final project. If you decide to use one of the datasets we have provided, please use a challenging dataset - check with us if you are not sure.
# A tibble: 12,639 x 48
agg_level_pseo label_agg_level_pseo inst_level label_inst_level institution
<dbl> <chr> <chr> <chr> <chr>
1 38 "Degree Level *\r\nIn~ I Institution 00216100
2 38 "Degree Level *\r\nIn~ I Institution 00216100
3 38 "Degree Level *\r\nIn~ I Institution 00216100
4 38 "Degree Level *\r\nIn~ I Institution 00216100
5 38 "Degree Level *\r\nIn~ I Institution 00216100
6 38 "Degree Level *\r\nIn~ I Institution 00216100
7 38 "Degree Level *\r\nIn~ I Institution 00216700
8 38 "Degree Level *\r\nIn~ I Institution 00216700
9 38 "Degree Level *\r\nIn~ I Institution 00216700
10 38 "Degree Level *\r\nIn~ I Institution 00216800
# i 12,629 more rows
# i 43 more variables: label_institution <chr>,
# `Degree\r\nAward\r\nLevel` <chr>, label_degree_level <chr>,
# cip_level <chr>, label_cip_level <chr>, cipcode <chr>, label_cipcode <chr>,
# grad_cohort <chr>, label_grad_cohort <chr>, grad_cohort_years <chr>,
# label_grad_cohort_years <chr>, geo_level <chr>, label_geo_level <chr>,
# geography <chr>, label_geography <chr>, ind_level <chr>, ...
Clean the data as needed using dplyr and related tidyverse packages.
Provide a narrative about the data set (look it up if you aren’t sure what you have got) and the variables in your dataset, including what type of data each variable is. The goal of this step is to communicate in a visually appealing way to non-experts - not to replicate r-code.
Identify potential research questions that your dataset can help answer.
##Commentary
For Homework 2, I used the Census Bureau Post-Secondary Employment Outcomes (PSEO) database filtered for Massachusetts. This database cites the employment outcomes (earnings 1, 5 and 10 years after graduation) for all public Massachusetts higher education institutions. The data set provides postsecondary employment outcomes for graduates at all credential levels, from certificate to doctorate.
##Potential Research Questions:
What is the relationship between credential level and earnings? Do earnings grow over time?
Questions for further research: What is the relationship between race/ethnicity, age, Pell status and earnings?
##Notes on Code
The mean for all graduates for 1 year after graduation was 41256, with the median lower at 36197, demonstrating a skewed left distribution. The mean for all graduates five years after graduation 57654, while the median was 53678, again skewed left distribution. The mean of the 1st year after graduation data for Baccalaureates was 39966, while the median was 36050. This reveals to me that the data were skewed to the left with more observations towards the beginning of the number line. For the mean of the 5th year after graduation for Baccalaureates, mean at 58696 again was higher than the median 54870, again representing a left skew of the data. The minimum of the Baccalaureate earnings the first year after graduation was low at 19825, while the Baccalaureate maximum was at 110317. The minimum of the Baccalaureate earnings the fifth year after graduation were much higher at 29649 and the maximum at 120707. The minimum and maximum for the entire sample the first year after graduation were 19766 and 147728 respectively. Conversely, the minimum and maximum for fifth year earnings were 28528 and 189604 respectively. This represents that earnings increased the longer the time elapsed since graduation as well as by credential level.
There were 12639 rows and 48 columns in the dataset.
y1a_num %>% summary(y1_p50_earnings, na.rm=TRUE) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 19825 31970 36050 39966 43117 110317 2714
##Code
Code
#list the unique degree levelsunique(pseo_ma_SANDBOX$label_degree_level)
#convert to numericy1_num<-as.numeric(pseo_ma_SANDBOX$y1_p50_earnings)
Warning: NAs introduced by coercion
Code
#find summary statistics on whole data set, 1 year after graduationmean(y1_num, na.rm=TRUE)
[1] 41256.48
Code
median(y1_num, na.rm=TRUE)
[1] 36197
Code
min(y1_num, na.rm=TRUE)
[1] 19766
Code
max(y1_num, na.rm=TRUE)
[1] 147728
Code
IQR(y1_num, na.rm=TRUE)
[1] 15153
Code
#find summary statistics on whole data set, 5 years after graduationy5_num<-as.numeric(pseo_ma_SANDBOX$y5_p50_earnings)
Warning: NAs introduced by coercion
Code
mean(y5_num, na.rm=TRUE)
[1] 57653.91
Code
median(y5_num, na.rm=TRUE)
[1] 53677.5
Code
min(y5_num, na.rm=TRUE)
[1] 28528
Code
max(y5_num, na.rm=TRUE)
[1] 189604
Code
IQR(y5_num, na.rm=TRUE)
[1] 18766.25
Code
summary(pseo_ma_SANDBOX_num)
Error in summary(pseo_ma_SANDBOX_num): object 'pseo_ma_SANDBOX_num' not found
Code
print(typeof(pseo_ma_SANDBOX$y1_p50_earnings))
[1] "character"
Code
install.packages("dplyr")
Installing package into 'C:/Users/chion/OneDrive/Documents/R/win-library/4.1'
(as 'lib' is unspecified)
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
Code
library(dplyr)
Warning: package 'dplyr' was built under R version 4.1.3
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Code
summary(y1_num)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
19766 31280 36197 41256 46433 147728 9174
Code
summary(y5_num)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
28528 46162 53678 57654 64928 189604 9495
Code
#find the number of rows and columns in the datasetnrow(pseo_ma_SANDBOX)
[1] 12639
Code
ncol(pseo_ma_SANDBOX)
[1] 48
Code
#filter for Baccalaureate degreespseo_filtered <-filter(pseo_ma_SANDBOX, pseo_ma_SANDBOX$label_degree_level=="Baccalaureate")View(pseo_filtered)#find summary statistics of filtered data-- just BA degreesy1a_num<-as.numeric(pseo_filtered$y1_p50_earnings)
Warning: NAs introduced by coercion
Code
y1a_num %>%summary(y1_p50_earnings, na.rm=TRUE)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
19825 31970 36050 39966 43117 110317 2714
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
29649 48861 54870 58696 63958 120707 3191
Source Code
---title: "Homework 2"author: "Moira Chiong"description: "Summary Statistics"date: "6/12/2023"format: html: toc: true code-fold: true code-copy: true code-tools: true---# Homework Two: Reading in DataSean Conway•Jun 6**100 points**For this homework, your goal is to read in a more complicated dataset. Please use the category tag "hw2" as well as a tag for the dataset you choose to use.- Read in a dataset. **It's strongly recommended that you choose a dataset you're considering using for the final project.** If you decide to use one of the datasets we have provided, please use a challenging dataset - check with us if you are not sure.```{r}setwd("C:/Users/chion/OneDrive/Desktop/DACSS 601/DACSS_601_Summer2023_Sec1/posts")library(readxl) pseo_ma_SANDBOX <-read_excel("pseo_ma_SANDBOX.xlsx") pseo_ma_SANDBOX```- - Clean the data as needed using dplyr and related tidyverse packages.- Provide a narrative about the data set (look it up if you aren't sure what you have got) and the variables in your dataset, including what type of data each variable is. The goal of this step is to communicate in a visually appealing way to non-experts - not to replicate r-code.- Identify potential research questions that your dataset can help answer.##CommentaryFor Homework 2, I used the Census Bureau Post-Secondary Employment Outcomes (PSEO) database filtered for Massachusetts. This database cites the employment outcomes (earnings 1, 5 and 10 years after graduation) for all public Massachusetts higher education institutions. The data set provides postsecondary employment outcomes for graduates at all credential levels, from certificate to doctorate.##Potential Research Questions:What is the relationship between credential level and earnings? Do earnings grow over time?Questions for further research: What is the relationship between race/ethnicity, age, Pell status and earnings?##Notes on CodeThe mean for all graduates for 1 year after graduation was 41256, with the median lower at 36197, demonstrating a skewed left distribution. The mean for all graduates five years after graduation 57654, while the median was 53678, again skewed left distribution. The mean of the 1st year after graduation data for Baccalaureates was 39966, while the median was 36050. This reveals to me that the data were skewed to the left with more observations towards the beginning of the number line. For the mean of the 5th year after graduation for Baccalaureates, mean at 58696 again was higher than the median 54870, again representing a left skew of the data. The minimum of the Baccalaureate earnings the first year after graduation was low at 19825, while the Baccalaureate maximum was at 110317. The minimum of the Baccalaureate earnings the fifth year after graduation were much higher at 29649 and the maximum at 120707. The minimum and maximum for the entire sample the first year after graduation were 19766 and 147728 respectively. Conversely, the minimum and maximum for fifth year earnings were 28528 and 189604 respectively. This represents that earnings increased the longer the time elapsed since graduation as well as by credential level.There were 12639 rows and 48 columns in the dataset.``` y1a_num %>% summary(y1_p50_earnings, na.rm=TRUE) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 19825 31970 36050 39966 43117 110317 2714 ```##Code```{r}#list the unique degree levelsunique(pseo_ma_SANDBOX$label_degree_level)#convert to numericy1_num<-as.numeric(pseo_ma_SANDBOX$y1_p50_earnings) #find summary statistics on whole data set, 1 year after graduationmean(y1_num, na.rm=TRUE)median(y1_num, na.rm=TRUE)min(y1_num, na.rm=TRUE)max(y1_num, na.rm=TRUE)IQR(y1_num, na.rm=TRUE)#find summary statistics on whole data set, 5 years after graduationy5_num<-as.numeric(pseo_ma_SANDBOX$y5_p50_earnings)mean(y5_num, na.rm=TRUE)median(y5_num, na.rm=TRUE)min(y5_num, na.rm=TRUE)max(y5_num, na.rm=TRUE)IQR(y5_num, na.rm=TRUE)summary(pseo_ma_SANDBOX_num)print(typeof(pseo_ma_SANDBOX$y1_p50_earnings))install.packages("dplyr")library(dplyr)summary(y1_num)summary(y5_num)#find the number of rows and columns in the datasetnrow(pseo_ma_SANDBOX)ncol(pseo_ma_SANDBOX)#filter for Baccalaureate degreespseo_filtered <-filter(pseo_ma_SANDBOX, pseo_ma_SANDBOX$label_degree_level=="Baccalaureate")View(pseo_filtered)#find summary statistics of filtered data-- just BA degreesy1a_num<-as.numeric(pseo_filtered$y1_p50_earnings) y1a_num %>%summary(y1_p50_earnings, na.rm=TRUE)y5a_num <-as.numeric(pseo_filtered$y5_p50_earnings)y5a_num %>%summary(y5_p50_earnings, na.rm=TRUE)`````` ```