This HW3 is my dataset, reading in data, and my research question
This is K Martins’s submission for Homework Three in DACSS 601. In this I will read in a baseball player biographical dataset from retrosheet.org.
Then I will explain the variables and select only the information necessary for my potential questions.
For my final project I wanted to evaluate the effect of war on baseball. (And this is NOT the darling baseball statistic “WAR” which is wins over replacement) It is relatively unique that baseball played throughout both world wars only having to shorten the 1918 season (which the Boston Red Sox won with their next title coming, of course, 86 years later in 2004). Using biographical data such as birth date, birth city and also debut date and last game date I look to evaluate how the world wars affects the types of players, how old they were and how long they lasted in the league. Were these replacement players? Did they still make the team the following years after the war ended? Where were these players from? Because of this I will be looking at dates straddling the US involvement in 1917 - 1918 and 1941 - 1945 for World War 1 and World War 2, respectively.
Here I read in the baseball biographical file from retrosheet.org that I saved within my set working directory. The file is a relatively clean csv data set. The variables are the listed below .
# Read in Baseball Biographical File dataset
biographic_baseball <- read.csv("BIOFILE_RETROSHEET.txt",TRUE,',')
head(biographic_baseball)
PLAYERID LAST FIRST NICKNAME BIRTHDATE BIRTH.CITY
1 aardd001 Aardsma David Allan David 12/27/1981 Denver
2 aaroh101 Aaron Henry Louis Hank 02/05/1934 Mobile
3 aarot101 Aaron Tommie Lee Tommie 08/05/1939 Mobile
4 aased001 Aase Donald William Don 09/08/1954 Orange
5 abada001 Abad Fausto Andres Andy 08/25/1972 Palm Beach
6 abadf001 Abad Fernando Antonio Fernando 12/17/1985 La Romana
BIRTH.STATE BIRTH.COUNTRY PLAY.DEBUT PLAY.LASTGAME MGR.DEBUT
1 Colorado USA 04/06/2004 08/23/2015
2 Alabama USA 04/13/1954 10/03/1976
3 Alabama USA 04/10/1962 09/26/1971
4 California USA 07/26/1977 10/03/1990
5 Florida USA 09/10/2001 04/13/2006
6 La Romana Dominican Republic 07/28/2010 10/01/2021
MGR.LASTGAME COACH.DEBUT COACH.LASTGAME UMP.DEBUT UMP.LASTGAME
1
2
3 04/06/1979 09/30/1984
4
5
6
DEATHDATE DEATH.CITY DEATH.STATE DEATH.COUNTRY BATS THROWS HEIGHT
1 R R 6-05
2 01/22/2021 Atlanta Georgia USA R R 6-00
3 08/16/1984 Atlanta Georgia USA R R 6-03
4 R R 6-03
5 L L 6-01
6 L L 6-02
WEIGHT CEMETERY CEME.CITY CEME.STATE CEME.COUNTRY
1 200
2 180 Southview Cemetery Atlanta Georgia USA
3 190 Catholic Cemetery Mobile Alabama USA
4 190
5 184
6 205
CEME.NOTE BIRTH.NAME NAME.CHG BAT.CHG HOF
1 NA NA NOT
2 NA NA HOF
3 NA NA NOT
4 NA NA NOT
5 NA NA NOT
6 NA NA NOT
str(biographic_baseball)
'data.frame': 21574 obs. of 33 variables:
$ PLAYERID : chr "aardd001" "aaroh101" "aarot101" "aased001" ...
$ LAST : chr "Aardsma" "Aaron" "Aaron" "Aase" ...
$ FIRST : chr "David Allan" "Henry Louis" "Tommie Lee" "Donald William" ...
$ NICKNAME : chr "David" "Hank" "Tommie" "Don" ...
$ BIRTHDATE : chr "12/27/1981" "02/05/1934" "08/05/1939" "09/08/1954" ...
$ BIRTH.CITY : chr "Denver" "Mobile" "Mobile" "Orange" ...
$ BIRTH.STATE : chr "Colorado" "Alabama" "Alabama" "California" ...
$ BIRTH.COUNTRY : chr "USA" "USA" "USA" "USA" ...
$ PLAY.DEBUT : chr "04/06/2004" "04/13/1954" "04/10/1962" "07/26/1977" ...
$ PLAY.LASTGAME : chr "08/23/2015" "10/03/1976" "09/26/1971" "10/03/1990" ...
$ MGR.DEBUT : chr "" "" "" "" ...
$ MGR.LASTGAME : chr "" "" "" "" ...
$ COACH.DEBUT : chr "" "" "04/06/1979" "" ...
$ COACH.LASTGAME: chr "" "" "09/30/1984" "" ...
$ UMP.DEBUT : chr "" "" "" "" ...
$ UMP.LASTGAME : chr "" "" "" "" ...
$ DEATHDATE : chr "" "01/22/2021" "08/16/1984" "" ...
$ DEATH.CITY : chr "" "Atlanta" "Atlanta" "" ...
$ DEATH.STATE : chr "" "Georgia" "Georgia" "" ...
$ DEATH.COUNTRY : chr "" "USA" "USA" "" ...
$ BATS : chr "R" "R" "R" "R" ...
$ THROWS : chr "R" "R" "R" "R" ...
$ HEIGHT : chr "6-05" "6-00" "6-03" "6-03" ...
$ WEIGHT : int 200 180 190 190 184 205 192 170 175 169 ...
$ CEMETERY : chr "" "Southview Cemetery" "Catholic Cemetery" "" ...
$ CEME.CITY : chr "" "Atlanta" "Mobile" "" ...
$ CEME.STATE : chr "" "Georgia" "Alabama" "" ...
$ CEME.COUNTRY : chr "" "USA" "USA" "" ...
$ CEME.NOTE : chr "" "" "" "" ...
$ BIRTH.NAME : chr "" "" "" "" ...
$ NAME.CHG : logi NA NA NA NA NA NA ...
$ BAT.CHG : logi NA NA NA NA NA NA ...
$ HOF : chr "NOT" "HOF" "NOT" "NOT" ...
# I selected only the variables using necessary for my questions.
biographic_baseball_small <- select(biographic_baseball, PLAYERID, LAST, FIRST, NICKNAME, BIRTHDATE, BIRTH.CITY, BIRTH.STATE, BIRTH.COUNTRY, PLAY.DEBUT, PLAY.LASTGAME)
In this data-wrangling I have filtered for 2 years before World War Periods, World War periods, and after World War periods. I did this by filtering for column Play.DEBUT using grepl which returns a logical vector indicating which element of a character vector contains the match.
# Filter for Wartime Debut Dates as well as Dates 2 Season Prior and After
ww_biogrpahic_baseball <- dplyr::filter(biographic_baseball_small,
grepl('1915|1916|1917|1918|1919|1940
|1941|1942|1943|1944|1945|1946|
1947',PLAY.DEBUT))
head(ww_biogrpahic_baseball)
PLAYERID LAST FIRST NICKNAME BIRTHDATE
1 abert102 Abernathy Talmadge Lafayette Ted 10/30/1921
2 aberw101 Abernathy Virgil Woodrow Woody 02/01/1915
3 abrej101 Abreu Joseph Lawrence Joe 05/24/1913
4 adama101 Adams Ace Townsend Ace 03/02/1910
5 adamb103 Adams Robert Henry Bobby 12/14/1921
6 adamr101 Adams Charles Dwight Red 10/07/1921
BIRTH.CITY BIRTH.STATE BIRTH.COUNTRY PLAY.DEBUT PLAY.LASTGAME
1 Bynum North Carolina USA 09/19/1942 04/29/1944
2 Forest City North Carolina USA 07/28/1946 04/17/1947
3 Oakland California USA 04/23/1942 07/11/1942
4 Willows California USA 04/15/1941 04/24/1946
5 Tuolumne California USA 04/16/1946 04/22/1959
6 Parlier California USA 05/05/1946 07/02/1946
In this dplyr function I arranged by debut date.
# Arrange by Debut Date Ascending
arrange_ww_biogrpahic_baseball <- arrange(ww_biogrpahic_baseball, PLAY.DEBUT)
head(arrange_ww_biogrpahic_baseball)
PLAYERID LAST FIRST NICKNAME BIRTHDATE
1 huhne101 Huhn Emil Hugo Emil 03/10/1892
2 uphab101 Upham William Lawrence Bill 04/04/1888
3 roggc101 Rogge Francis Clinton Clint 07/19/1889
4 coucj101 Couch John Daniel Johnny 03/31/1891
5 joneb103 Jones Robert Walter Bob 12/02/1889
6 nichf101 Nicholson Fred Fred 09/01/1894
BIRTH.CITY BIRTH.STATE BIRTH.COUNTRY PLAY.DEBUT PLAY.LASTGAME
1 North Vernon Indiana USA 04/10/1915 06/30/1917
2 Akron Ohio USA 04/10/1915 06/26/1918
3 Memphis Michigan USA 04/11/1915 06/06/1921
4 Vaughn Montana USA 04/11/1917 09/21/1925
5 Clayton California USA 04/11/1917 09/24/1925
6 Honey Grove Texas USA 04/11/1917 09/16/1922
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Martins (2022, Jan. 8). Data Analytics and Computational Social Science: Homework 3 Identify Dataset, Read in, Identify Potential Questions. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommartike8852865/
BibTeX citation
@misc{martins2022homework, author = {Martins, K}, title = {Data Analytics and Computational Social Science: Homework 3 Identify Dataset, Read in, Identify Potential Questions}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommartike8852865/}, year = {2022} }