This HW4 reads in data, creates statistics, and visualizations
This is K Martins’s submission for Homework Four in DACSS 601. In this I will read in a baseball player biographical dataset from retrosheet.org. Note, Compared to the previous homeworks, I will now be only looking at the 4 years prior to World War 2, the 4 years of World War 2 (US involvement), and the 4 years, after World War 2.
Then I will perform descriptive statistics using means, median, and SDs. Primarily, I am looking at how the debut ages, years of baseball play, and Hall of Fall status was affected (or not) by the Second World War. I will also create 2 visualizations using ggplot2 package and explain the visualization.
Here I read in the baseball biographical file from retrosheet.org that I saved within my set working directory.
# Read in Baseball Biographical File dataset
biographic_baseball <- read.csv("BIOFILE_RETROSHEET_kjmcleaner.csv",TRUE,',')
head(biographic_baseball)
PLAYERID LAST FIRST NICKNAME BIRTHDATE Alt_BIRTHDATE
1 aardd001 Aardsma David Allan David 12/27/1981 12/27/1981
2 aaroh101 Aaron Henry Louis Hank 02/05/1934 02/05/1934
3 aarot101 Aaron Tommie Lee Tommie 08/05/1939 08/05/1939
4 aased001 Aase Donald William Don 09/08/1954 09/08/1954
5 abada001 Abad Fausto Andres Andy 08/25/1972 08/25/1972
6 abadf001 Abad Fernando Antonio Fernando 12/17/1985 12/17/1985
BIRTH.CITY BIRTH.STATE BIRTH.COUNTRY PLAY.DEBUT Alt_DEBUTDATE
1 Denver Colorado USA 04/06/2004 04/06/2004
2 Mobile Alabama USA 04/13/1954 04/13/1954
3 Mobile Alabama USA 04/10/1962 04/10/1962
4 Orange California USA 07/26/1977 07/26/1977
5 Palm Beach Florida USA 09/10/2001 09/10/2001
6 La Romana La Romana Dominican Republic 07/28/2010 07/28/2010
DEBUT_AGE PLAY.LASTGAME Alt_LASTGAME Complete_Years_Active
1 22 08/23/2015 08/23/2015 11
2 20 10/03/1976 10/03/1976 22
3 22 09/26/1971 09/26/1971 9
4 22 10/03/1990 10/03/1990 13
5 29 04/13/2006 04/13/2006 4
6 24 10/01/2021 10/01/2021 11
MGR.DEBUT MGR.LASTGAME COACH.DEBUT COACH.LASTGAME UMP.DEBUT
1
2
3 04/06/1979 09/30/1984
4
5
6
UMP.LASTGAME DEATHDATE DEATH.CITY DEATH.STATE DEATH.COUNTRY BATS
1 R
2 01/22/2021 Atlanta Georgia USA R
3 08/16/1984 Atlanta Georgia USA R
4 R
5 L
6 L
THROWS HEIGHT WEIGHT CEMETERY CEME.CITY CEME.STATE
1 R 6-05 200
2 R 6-00 180 Southview Cemetery Atlanta Georgia
3 R 6-03 190 Catholic Cemetery Mobile Alabama
4 R 6-03 190
5 L 6-01 184
6 L 6-02 205
CEME.COUNTRY CEME.NOTE BIRTH.NAME NAME.CHG BAT.CHG HOF
1 NA NA NOT
2 USA NA NA HOF
3 USA NA NA NOT
4 NA NA NOT
5 NA NA NOT
6 NA NA NOT
str(biographic_baseball)
'data.frame': 21574 obs. of 38 variables:
$ PLAYERID : chr "aardd001" "aaroh101" "aarot101" "aased001" ...
$ LAST : chr "Aardsma" "Aaron" "Aaron" "Aase" ...
$ FIRST : chr "David Allan" "Henry Louis" "Tommie Lee" "Donald William" ...
$ NICKNAME : chr "David" "Hank" "Tommie" "Don" ...
$ BIRTHDATE : chr "12/27/1981" "02/05/1934" "08/05/1939" "09/08/1954" ...
$ Alt_BIRTHDATE : chr "12/27/1981" "02/05/1934" "08/05/1939" "09/08/1954" ...
$ BIRTH.CITY : chr "Denver" "Mobile" "Mobile" "Orange" ...
$ BIRTH.STATE : chr "Colorado" "Alabama" "Alabama" "California" ...
$ BIRTH.COUNTRY : chr "USA" "USA" "USA" "USA" ...
$ PLAY.DEBUT : chr "04/06/2004" "04/13/1954" "04/10/1962" "07/26/1977" ...
$ Alt_DEBUTDATE : chr "04/06/2004" "04/13/1954" "04/10/1962" "07/26/1977" ...
$ DEBUT_AGE : chr "22" "20" "22" "22" ...
$ PLAY.LASTGAME : chr "08/23/2015" "10/03/1976" "09/26/1971" "10/03/1990" ...
$ Alt_LASTGAME : chr "08/23/2015" "10/03/1976" "09/26/1971" "10/03/1990" ...
$ Complete_Years_Active: int 11 22 9 13 4 11 0 13 4 4 ...
$ MGR.DEBUT : chr "" "" "" "" ...
$ MGR.LASTGAME : chr "" "" "" "" ...
$ COACH.DEBUT : chr "" "" "04/06/1979" "" ...
$ COACH.LASTGAME : chr "" "" "09/30/1984" "" ...
$ UMP.DEBUT : chr "" "" "" "" ...
$ UMP.LASTGAME : chr "" "" "" "" ...
$ DEATHDATE : chr "" "01/22/2021" "08/16/1984" "" ...
$ DEATH.CITY : chr "" "Atlanta" "Atlanta" "" ...
$ DEATH.STATE : chr "" "Georgia" "Georgia" "" ...
$ DEATH.COUNTRY : chr "" "USA" "USA" "" ...
$ BATS : chr "R" "R" "R" "R" ...
$ THROWS : chr "R" "R" "R" "R" ...
$ HEIGHT : chr "6-05" "6-00" "6-03" "6-03" ...
$ WEIGHT : int 200 180 190 190 184 205 192 170 175 169 ...
$ CEMETERY : chr "" "Southview Cemetery" "Catholic Cemetery" "" ...
$ CEME.CITY : chr "" "Atlanta" "Mobile" "" ...
$ CEME.STATE : chr "" "Georgia" "Alabama" "" ...
$ CEME.COUNTRY : chr "" "USA" "USA" "" ...
$ CEME.NOTE : chr "" "" "" "" ...
$ BIRTH.NAME : chr "" "" "" "" ...
$ NAME.CHG : logi NA NA NA NA NA NA ...
$ BAT.CHG : logi NA NA NA NA NA NA ...
$ HOF : chr "NOT" "HOF" "NOT" "NOT" ...
# I selected only the variables using necessary for my questions.
biographic_baseball_small <- select(biographic_baseball, PLAYERID, LAST, FIRST,
NICKNAME, BIRTHDATE, BIRTH.CITY,
BIRTH.STATE, BIRTH.COUNTRY, PLAY.DEBUT,
DEBUT_AGE, PLAY.LASTGAME,
Complete_Years_Active, BATS, THROWS,HOF)
In this data-wrangling I have filtered for the 4 years prior to World War 2, the 4 years of World War 2, and the 4 years after World War 2.
# Filter for Wartime Debut Dates as well as Dates 4 Seasons Prior and After
ww_biogrpahic_baseball <- dplyr::filter(biographic_baseball_small,
grepl('1938|1939|1940|1941|1942|1943
|1944|1945|1946|1947|1948|1949',
PLAY.DEBUT))
#Tidying Debut_Age and Complete_Years_Active to Numeric from Character
ww_biogrpahic_baseball$DEBUT_AGE <- as.numeric(as.character(
ww_biogrpahic_baseball$DEBUT_AGE))
ww_biogrpahic_baseball$Complete_Years_Active <- as.numeric(as.character(
ww_biogrpahic_baseball$Complete_Years_Active))
ww_biogrpahic_baseball$PLAY.DEBUT <- mdy(ww_biogrpahic_baseball$PLAY.DEBUT)
head(ww_biogrpahic_baseball)
PLAYERID LAST FIRST NICKNAME BIRTHDATE
1 aberc101 Aberson Clifford Alexander Cliff 08/28/1921
2 abert102 Abernathy Talmadge Lafayette Ted 10/30/1921
3 aberw101 Abernathy Virgil Woodrow Woody 02/01/1915
4 abrac101 Abrams Calvin Ross Cal 03/02/1924
5 abrej101 Abreu Joseph Lawrence Joe 05/24/1913
6 adama101 Adams Ace Townsend Ace 03/02/1910
BIRTH.CITY BIRTH.STATE BIRTH.COUNTRY PLAY.DEBUT DEBUT_AGE
1 Chicago Illinois USA 1947-07-18 25
2 Bynum North Carolina USA 1942-09-19 20
3 Forest City North Carolina USA 1946-07-28 31
4 Philadelphia Pennsylvania USA 1949-04-19 25
5 Oakland California USA 1942-04-23 28
6 Willows California USA 1941-04-15 31
PLAY.LASTGAME Complete_Years_Active BATS THROWS HOF
1 05/09/1949 1 R R NOT
2 04/29/1944 1 R L NOT
3 04/17/1947 0 L L NOT
4 05/09/1956 7 L L NOT
5 07/11/1942 0 R R NOT
6 04/24/1946 5 R R NOT
preWW2_biogrpahic_baseball <- dplyr::filter(ww_biogrpahic_baseball,
grepl('1938|1939|1940|1941',PLAY.DEBUT))
postWW2_biogrpahic_baseball <- dplyr::filter(ww_biogrpahic_baseball,
grepl('1946|1947|1948|1949',PLAY.DEBUT))
duringWW2_biogrpahic_baseball <- dplyr::filter(ww_biogrpahic_baseball,
grepl('1942|1943|1944|1945',PLAY.DEBUT))
For each period the mean debut age and years active was calculated.
There were small increases to debut age rose in the wartime years compared to pre and post WW2 periods.
There were larger increases to years active in pre and post WW2 periods as compared to wartime.
#Pre WW2 Means of Debut Age and Years Active
summarise(preWW2_biogrpahic_baseball, mean_preWW2_Debut_Age =
mean(`DEBUT_AGE`), mean_prepostWWI_Complete_Years_Active =
mean(`Complete_Years_Active`))
mean_preWW2_Debut_Age mean_prepostWWI_Complete_Years_Active
1 23.79565 5.541304
#During WW2 Means of Debut Age and Years Active
summarise(duringWW2_biogrpahic_baseball, mean_duringWW2_Debut_Age =
mean(`DEBUT_AGE`), duringWW2_Complete_Years_Active =
mean(`Complete_Years_Active`))
mean_duringWW2_Debut_Age duringWW2_Complete_Years_Active
1 24.90698 3.108527
#Post WW2 Means of Debut Age and Years Active
summarise(postWW2_biogrpahic_baseball, mean_postWW2_Debut_Age =
mean(`DEBUT_AGE`), postWW2_Complete_Years_Active =
mean(`Complete_Years_Active`))
mean_postWW2_Debut_Age postWW2_Complete_Years_Active
1 24.77129 4.827251
For each period the median debut age and years active was calculated.
There was a 2 year increase to debut age in the wartime years compared to pre WW2 periods with no change comparing wartime period to post-WW2 period.
Larger increases were observed in years active in pre and post WW2 periods as compared to wartime.
#Pre WW2 Median of Debut Age and Years Active
summarise(preWW2_biogrpahic_baseball, median_preWW2_Debut_Age =
median(`DEBUT_AGE`), median_prepostWWI_Complete_Years_Active =
median(`Complete_Years_Active`))
median_preWW2_Debut_Age median_prepostWWI_Complete_Years_Active
1 23 5
#During WW2 Median of Debut Age and Years Active
summarise(duringWW2_biogrpahic_baseball, median_duringWW2_Debut_Age =
median(`DEBUT_AGE`), duringWW2_Complete_Years_Active =
median(`Complete_Years_Active`))
median_duringWW2_Debut_Age duringWW2_Complete_Years_Active
1 25 1
#Post WW2 Median of Debut Age and Years Active
summarise(postWW2_biogrpahic_baseball, median_postWW2_Debut_Age =
median(`DEBUT_AGE`), postWW2_Complete_Years_Active =
median(`Complete_Years_Active`))
median_postWW2_Debut_Age postWW2_Complete_Years_Active
1 25 4
For each period the standard deviation of debut age and years active was calculated.
Standard deviation in debut age during wartime years increased compared to pre and post WW2 periods.
This contrasted to decreased in standard deviation of years active in wartime period compared to pre and post WW2 periods.
#Pre WW2 SD of Debut Age and Years Active
summarise(preWW2_biogrpahic_baseball, sd_preWW2_Debut_Age =
sd(`DEBUT_AGE`), sd_prepostWWI_Complete_Years_Active =
median(`Complete_Years_Active`))
sd_preWW2_Debut_Age sd_prepostWWI_Complete_Years_Active
1 2.911472 5
#During WW2 SD of Debut Age and Years Active
summarise(duringWW2_biogrpahic_baseball, sd_duringWW2_Debut_Age =
sd(`DEBUT_AGE`), duringWW2_Complete_Years_Active =
sd(`Complete_Years_Active`))
sd_duringWW2_Debut_Age duringWW2_Complete_Years_Active
1 4.13147 4.504299
#Post WW2 SD of Debut Age and Years Active
summarise(postWW2_biogrpahic_baseball, sd_postWW2_Debut_Age =
sd(`DEBUT_AGE`), postWW2_Complete_Years_Active =
sd(`Complete_Years_Active`))
sd_postWW2_Debut_Age postWW2_Complete_Years_Active
1 3.295415 4.961494
I have used ggplot2 to show Debut Age and Debut Date on a scatterplot.
The variables are Debut Age and Debut Date
I am attempting to answer whether the players were older as they debuted during the wartime period compared to pre and post periods due to the draft or other feelings of obligation to join the armed forces.
I didn’t mean to but I concluded that the 1943 season started late, which is clear by the gap in the scatterplot. According to baseball-almanac.com the first game was in August where the baseball season typicall starts in April. I have also concluded that the clusters of ages did tend to be older during the wartime seasons of 1942 - 1945 but it was not glaringly obvious.
While this does get at the age of the player, it does not speak to the quality of the player who debuted. With my dataset, I have information on Hall of Fame status as well as active years in the league and I hope to explore these.
It is unclear what the draft or minimum age requirements is (for example it looks like there was a 15 year-old that debuted, Joe Nuxhall btw). I may be naive to just have been looking at older ages and should have also been looking at ages below the draft age of 21 and the minimum 18 age to volunteer.
It is unclear exactly what season each of clusters are refering to and I should interval by every year.
#Scatter plot of Debut Age and Debut Date
ggplot(ww_biogrpahic_baseball, aes(PLAY.DEBUT, DEBUT_AGE)) + geom_point()
I have used ggplot2 to display the Hall of Famers within the dataset.
The variable I am visualizing is players from my dataset that debuted between 1938 and 1949 that are hall of famers.
I am attempting to answer the quality question. How good were the players in this period that I am looking at. Seeing whether a player is in the Hall of Fame indicates success.
I can only conclude that there were HOF players that debuted in this time period.
I am not able to see what years these Hall of Famers debuted and whether or not there is a change to the amount of Hall of Famers in wartime compared to peacetime.
It is unclear how many total players and how many HOF and non-HOF players there are.
I would bucket debut years into pre wW2, during WW2, and post WW2 to compare the amounts of players along with whether or not they became HOFs. I would also include counts at the tops of each bar.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Martins (2022, Jan. 11). Data Analytics and Computational Social Science: Homework 4 Read, Stats, Visualizations. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommartike8854623/
BibTeX citation
@misc{martins2022homework, author = {Martins, K}, title = {Data Analytics and Computational Social Science: Homework 4 Read, Stats, Visualizations}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommartike8854623/}, year = {2022} }