Effects of War on Major League Baseball
Sport is often a metaphor for war and the two have similar words attached to them like offense, defense, leadership, weaknesses, strengths, strategies, victory, and defeat. For most of the middle 20th century, baseball was America’s favorite sport with popularity peaking in the immediate years after the Second World War. It is relatively unique that the sport of baseball played throughout World War 2 (WW2) despite mandatory drafts and the duration of the conflict – though this wasn’t unprecedented as baseball was also played during the United States involvement in World War 1.
For this data analysis I wanted to evaluate the effect of war on baseball. (And this is NOT the darling baseball statistic “WAR” which is wins over replacement). Using biographical data such as birth date, debut date, last game date, and Hall of Fame status (HOF), I will evaluate:
-how WW2 affects the debuting players’ age -how long they lasted in the league (career length) -whether or not WW2 debuting players were inducted into the HOF
To do this I will be analyzing biographical data within the parameters of 4 years prior to WW2, during WW2, and after WW2 from 1938 to 1949.
The data that is being used is a 21,575 record csv file from retrosheet.org which is a website that compiles baseball statistics from plays to games and player information. Each line represents either players , non-playing managers and coaches, or non-playing umpires. Many of the variables are dates such as debut date or the player’s first MLB game to the player’s last game. The only binary variable used within the analysis was HOF where HOF = inducted into the Hall of Fame and blank means the player is not in the Hall of Fame.
To get the data ready, I needed to read in the file and limit debut years needed to be limited to the 4 years prior to WW2, WW2, and post WW2. These dates include 1938 to 1949. Of note while the US declared war on Japan in December of 1941, the 1941 season is considered to be pre WW2 as the regular season ends in September. Additionally, I needed to tidy the data to change formats , calculate time between dates to find age and career length and categorize player records into PreWW2, WW2, and PostWW2 debut dates.
Here I read in the baseball biographical file from retrosheet.org that I saved within my set working directory.
# Read in Baseball Biographical File dataset
biographic_baseball <- read.csv("BIOFILE_RETROSHEET_kjmcleaner.csv",TRUE,',')
str(biographic_baseball)
'data.frame': 21574 obs. of 38 variables:
$ PLAYERID : chr "aardd001" "aaroh101" "aarot101" "aased001" ...
$ LAST : chr "Aardsma" "Aaron" "Aaron" "Aase" ...
$ FIRST : chr "David Allan" "Henry Louis" "Tommie Lee" "Donald William" ...
$ NICKNAME : chr "David" "Hank" "Tommie" "Don" ...
$ BIRTHDATE : chr "12/27/1981" "02/05/1934" "08/05/1939" "09/08/1954" ...
$ Alt_BIRTHDATE : chr "12/27/1981" "02/05/1934" "08/05/1939" "09/08/1954" ...
$ BIRTH.CITY : chr "Denver" "Mobile" "Mobile" "Orange" ...
$ BIRTH.STATE : chr "Colorado" "Alabama" "Alabama" "California" ...
$ BIRTH.COUNTRY : chr "USA" "USA" "USA" "USA" ...
$ PLAY.DEBUT : chr "04/06/2004" "04/13/1954" "04/10/1962" "07/26/1977" ...
$ Alt_DEBUTDATE : chr "04/06/2004" "04/13/1954" "04/10/1962" "07/26/1977" ...
$ DEBUT_AGE : chr "22" "20" "22" "22" ...
$ PLAY.LASTGAME : chr "08/23/2015" "10/03/1976" "09/26/1971" "10/03/1990" ...
$ Alt_LASTGAME : chr "08/23/2015" "10/03/1976" "09/26/1971" "10/03/1990" ...
$ Complete_Years_Active: int 11 22 9 13 4 11 0 13 4 4 ...
$ MGR.DEBUT : chr "" "" "" "" ...
$ MGR.LASTGAME : chr "" "" "" "" ...
$ COACH.DEBUT : chr "" "" "04/06/1979" "" ...
$ COACH.LASTGAME : chr "" "" "09/30/1984" "" ...
$ UMP.DEBUT : chr "" "" "" "" ...
$ UMP.LASTGAME : chr "" "" "" "" ...
$ DEATHDATE : chr "" "01/22/2021" "08/16/1984" "" ...
$ DEATH.CITY : chr "" "Atlanta" "Atlanta" "" ...
$ DEATH.STATE : chr "" "Georgia" "Georgia" "" ...
$ DEATH.COUNTRY : chr "" "USA" "USA" "" ...
$ BATS : chr "R" "R" "R" "R" ...
$ THROWS : chr "R" "R" "R" "R" ...
$ HEIGHT : chr "6-05" "6-00" "6-03" "6-03" ...
$ WEIGHT : int 200 180 190 190 184 205 192 170 175 169 ...
$ CEMETERY : chr "" "Southview Cemetery" "Catholic Cemetery" "" ...
$ CEME.CITY : chr "" "Atlanta" "Mobile" "" ...
$ CEME.STATE : chr "" "Georgia" "Alabama" "" ...
$ CEME.COUNTRY : chr "" "USA" "USA" "" ...
$ CEME.NOTE : chr "" "" "" "" ...
$ BIRTH.NAME : chr "" "" "" "" ...
$ NAME.CHG : logi NA NA NA NA NA NA ...
$ BAT.CHG : logi NA NA NA NA NA NA ...
$ HOF : chr "NOT" "HOF" "NOT" "NOT" ...
Here I am Tidying my data to:
-Change the Date fields to mdy from chr -Limiting the selection to the the 4 years prior to WW2, the 4 years of WW2 (US involvement), and the 4 years after WW2 -Add columns to add a Debut Age, Career Length, a Debut Category for PreWW2, WW2, and PostWW2, and add a debut season. For example the number “2” debut seasonwithin the WW2 category would be 1943.
#I have changed the Date fields to mdy from chr
biographic_baseball$PLAY.DEBUT <- mdy(biographic_baseball$PLAY.DEBUT)
biographic_baseball$BIRTHDATE <- mdy(biographic_baseball$BIRTHDATE)
biographic_baseball$PLAY.LASTGAME <- mdy(biographic_baseball$PLAY.LASTGAME)
biographic_baseball$MGR.DEBUT <- mdy(biographic_baseball$MGR.DEBUT)
biographic_baseball$MGR.LASTGAME <- mdy(biographic_baseball$MGR.LASTGAME)
biographic_baseball$DEATHDATE <- mdy(biographic_baseball$DEATHDATE)
# Here I am limiting the selection to the the 4 years prior to WW2, the 4 years
#of WW2 (US involvement), and the 4 years after WW2
WW2_biographic_baseball <- select(biographic_baseball, PLAYERID, LAST, FIRST,
NICKNAME, BIRTHDATE, BIRTH.CITY,
BIRTH.STATE, BIRTH.COUNTRY, PLAY.DEBUT,
PLAY.LASTGAME, MGR.DEBUT,MGR.LASTGAME,
DEATH.COUNTRY,BATS, THROWS,HOF) %>%
dplyr::filter(grepl('1938|1939|1940|1941|1942|1943|1944|1945|1946|1947|1948|1949', PLAY.DEBUT))
# I then add columns to add a Debut Age, Career Length, and a Debut Category for
#PreWW2, WW2, and PostWW2, and debut season by category.
WW2_bio_baseball.add.yrs.debutcat <- WW2_biographic_baseball %>% mutate(
DEBUT_AGE = year(as.period(interval
(start = BIRTHDATE, end = PLAY.DEBUT))),
CAREER_LENGTH = year(as.period(interval
(start = PLAY.DEBUT,
end = PLAY.LASTGAME))),
DEBUT_CATEGORY = case_when(
grepl('1938|1939|1940|1941',PLAY.DEBUT) ~ "1938 - 1941 PreWW2",
grepl('1942|1943|1944|1945', PLAY.DEBUT) ~ "1942 - 1945 WW2",
grepl('1946|1947|1948|1949',PLAY.DEBUT) ~ "1946 - 1949 PostWW2"),
DEBUT_YEAR = case_when(
grepl('1938', PLAY.DEBUT) ~ "1938",
grepl('1939', PLAY.DEBUT) ~ "1939",
grepl('1940', PLAY.DEBUT) ~ "1940",
grepl('1941', PLAY.DEBUT) ~ "1941",
grepl('1942', PLAY.DEBUT) ~ "1942",
grepl('1943', PLAY.DEBUT) ~ "1943",
grepl('1944', PLAY.DEBUT) ~ "1944",
grepl('1945', PLAY.DEBUT) ~ "1945",
grepl('1946', PLAY.DEBUT) ~ "1946",
grepl('1947', PLAY.DEBUT) ~ "1947",
grepl('1948', PLAY.DEBUT) ~ "1948",
grepl('1949', PLAY.DEBUT) ~ "1949"),
DEBUT_Season_Per_Category = case_when(
grepl('1938|1942|1946', PLAY.DEBUT) ~ "1",
grepl('1939|1943|1947', PLAY.DEBUT) ~ "2",
grepl('1940|1944|1948', PLAY.DEBUT) ~ "3",
grepl('1941|1945|1949', PLAY.DEBUT) ~ "4"))
str(WW2_bio_baseball.add.yrs.debutcat)
'data.frame': 1404 obs. of 21 variables:
$ PLAYERID : chr "aberc101" "abert102" "aberw101" "abrac101" ...
$ LAST : chr "Aberson" "Abernathy" "Abernathy" "Abrams" ...
$ FIRST : chr "Clifford Alexander" "Talmadge Lafayette" "Virgil Woodrow" "Calvin Ross" ...
$ NICKNAME : chr "Cliff" "Ted" "Woody" "Cal" ...
$ BIRTHDATE : Date, format: "1921-08-28" ...
$ BIRTH.CITY : chr "Chicago" "Bynum" "Forest City" "Philadelphia" ...
$ BIRTH.STATE : chr "Illinois" "North Carolina" "North Carolina" "Pennsylvania" ...
$ BIRTH.COUNTRY : chr "USA" "USA" "USA" "USA" ...
$ PLAY.DEBUT : Date, format: "1947-07-18" ...
$ PLAY.LASTGAME : Date, format: "1949-05-09" ...
$ MGR.DEBUT : Date, format: NA ...
$ MGR.LASTGAME : Date, format: NA ...
$ DEATH.COUNTRY : chr "USA" "USA" "USA" "USA" ...
$ BATS : chr "R" "R" "L" "L" ...
$ THROWS : chr "R" "L" "L" "L" ...
$ HOF : chr "NOT" "NOT" "NOT" "NOT" ...
$ DEBUT_AGE : num 25 20 31 25 28 31 23 24 27 20 ...
$ CAREER_LENGTH : num 1 1 0 7 0 5 8 13 0 1 ...
$ DEBUT_CATEGORY : chr "1946 - 1949 PostWW2" "1942 - 1945 WW2" "1946 - 1949 PostWW2" "1946 - 1949 PostWW2" ...
$ DEBUT_YEAR : chr "1947" "1942" "1946" "1949" ...
$ DEBUT_Season_Per_Category: chr "2" "1" "1" "4" ...
head(WW2_bio_baseball.add.yrs.debutcat)
PLAYERID LAST FIRST NICKNAME BIRTHDATE
1 aberc101 Aberson Clifford Alexander Cliff 1921-08-28
2 abert102 Abernathy Talmadge Lafayette Ted 1921-10-30
3 aberw101 Abernathy Virgil Woodrow Woody 1915-02-01
4 abrac101 Abrams Calvin Ross Cal 1924-03-02
5 abrej101 Abreu Joseph Lawrence Joe 1913-05-24
6 adama101 Adams Ace Townsend Ace 1910-03-02
BIRTH.CITY BIRTH.STATE BIRTH.COUNTRY PLAY.DEBUT PLAY.LASTGAME
1 Chicago Illinois USA 1947-07-18 1949-05-09
2 Bynum North Carolina USA 1942-09-19 1944-04-29
3 Forest City North Carolina USA 1946-07-28 1947-04-17
4 Philadelphia Pennsylvania USA 1949-04-19 1956-05-09
5 Oakland California USA 1942-04-23 1942-07-11
6 Willows California USA 1941-04-15 1946-04-24
MGR.DEBUT MGR.LASTGAME DEATH.COUNTRY BATS THROWS HOF DEBUT_AGE
1 <NA> <NA> USA R R NOT 25
2 <NA> <NA> USA R L NOT 20
3 <NA> <NA> USA L L NOT 31
4 <NA> <NA> USA L L NOT 25
5 <NA> <NA> USA R R NOT 28
6 <NA> <NA> USA R R NOT 31
CAREER_LENGTH DEBUT_CATEGORY DEBUT_YEAR
1 1 1946 - 1949 PostWW2 1947
2 1 1942 - 1945 WW2 1942
3 0 1946 - 1949 PostWW2 1946
4 7 1946 - 1949 PostWW2 1949
5 0 1942 - 1945 WW2 1942
6 5 1938 - 1941 PreWW2 1941
DEBUT_Season_Per_Category
1 2
2 1
3 1
4 4
5 1
6 4
Within the visualization section I calculate the mean debut age and career length, produce a bar graph of the career length, and plot all of the debuting players within the period and highlight HOF and other key players of the time.
Many of the visualizations within categorize the players into debut periods:
PreWW2 = Seasons 1938 through 1941 WW2 = Seasons 1942 through 1945 PostWW2 = Seasons 1946 through 1949
For each period, the mean debut age and years active (career length) was calculated along with the standard deviation. Included in this section is the table representation of these descriptive statistics along with a line graph faceted by the debut category.
There were only small increases to debut age in the wartime years compared to pre and post WW2 periods. This is also evident within the line graph. This aims answer how WW2 affects the age of debuting players. In this case, it appears the WW2 didn’t make a large difference when looking at the mean.
There were larger increases to years active in pre and post WW2 periods as compared to wartime. Here we see that WW2 era debuting players’ careers were a years shorter on average compared to the pre and post WW2 eras. In the “Calculating the Median Career Length” section below I will explore this further.
#Calculating the Mean of Debut Age and Career Length
WW2_bio_baseball.add.yrs.debutcat %>%
group_by(DEBUT_CATEGORY) %>%
summarize(MEAN_DEBUT_AGE = mean(DEBUT_AGE),
MEAN_CAREER_LENGTH = mean(CAREER_LENGTH),
SD_DEBUT_AGE = sd(DEBUT_AGE), SD_CAREER_LENGTH = sd(CAREER_LENGTH))
# A tibble: 3 x 5
DEBUT_CATEGORY MEAN_DEBUT_AGE MEAN_CAREER_LENGTH SD_DEBUT_AGE
<chr> <dbl> <dbl> <dbl>
1 1938 - 1941 PreWW2 23.8 5.54 2.91
2 1942 - 1945 WW2 24.9 3.26 3.97
3 1946 - 1949 PostWW2 24.8 4.83 3.30
# ... with 1 more variable: SD_CAREER_LENGTH <dbl>
#calculating the Mean Debut Age and graphing via line graph and facet based upon
#debut categories. Note the a debut season rather than the year, for example the number "2" debut
#season within the WW2 category would be 1943.
WW2_bio_baseball.add.yrs.debutcat %>%
group_by(DEBUT_CATEGORY, DEBUT_Season_Per_Category) %>%
summarize(MEAN_DEBUT_AGE = mean(DEBUT_AGE)) %>%
ggplot(aes(`DEBUT_Season_Per_Category`, `MEAN_DEBUT_AGE`,
group = `DEBUT_CATEGORY`, color = `DEBUT_CATEGORY`)) +
geom_line() +
labs(title = "Mean Debut Age", y = "Mean Debut Age", x = "Season") +
scale_y_continuous(name="Mean Debut Age", limits=c(15, 45)) +
facet_wrap(~ DEBUT_CATEGORY)
For each period the median years active career length was calculated.
When looking at the median there were larger increases in years active in pre and post WW2 periods as compared to wartime. Captured in the bar graph, it is evident that players that debuted during WW2 tended to have shorter careers compared to the pre and post periods.
#Calculating the Median of Career Length and Visualization
WW2_bio_baseball.add.yrs.debutcat %>%
group_by(DEBUT_CATEGORY) %>%
summarize(MEDIAN_CAREER_LENGTH = median(CAREER_LENGTH)) %>%
ggplot(aes(x = DEBUT_CATEGORY, y = MEDIAN_CAREER_LENGTH)) +
geom_bar(stat = 'identity', color="steelblue", fill="steelblue") +
ggtitle("Median Career Legnth") +
labs(y= "Median Career Legnth", x = "Debut Category")
Using the calculated age and debut date, I created a scatterplot visualization. I have highlighted the HOF players (red) while also highlighting some interesting observations of specific players.
Some of the significant players were Ted Williams (blue), Joe Nuxhall (yellow), and Satchel Paige (green). Each of these players bring an interesting perspective on my research questions of how WW2 affects the debuting players age and whether or not WW2 debuting players were inducted into the HOF.
-Ted Williams debuted prior to WW2 US involvement in 1939, served in WW2 and became a Hall of Famer.
-Joe Nuxhall is the youngest player ever to debut in the MLB at 15 years-old. He did so during the WW2 period in 1944.
-Satchel Paige is the oldest player to ever debut in the MLB at age 42. He did so post WW2 in 1948. Paige is a Hall of Famer. Paige brings an interesting wrinkle into this study as he is one of the best representations of the effect of race and racism during this period of Major League Baseball. His debut came after a long career in the Negro Leagues. He entered the MLB after the groundbreaking debut of 28 year-old Jackie Robinson in 1947 - a former Negro Leagues player.
Observations of the clusters doesn’t show anything conclusive that the average ages increased or decreased during wartime. However, it is interesting that the appear to be more scattered outliers of much older and younger debuting players.
As for the HOF players there were only 3 HOF in red for these wartime seasons out of a total of the total 24 HOF players within that time period. It appears that the quality of players debuting in the WW2 tended to be less as there weren’t as many HOF players.
# I created variables to highlight significant players within my scatterplot and highlight HOF players
Ted_Williams <- WW2_bio_baseball.add.yrs.debutcat %>%
filter(PLAYERID == "willt103")
Satchel_Paige <- WW2_bio_baseball.add.yrs.debutcat %>%
filter(PLAYERID == "paigs101")
Joe_Nuxhall <- WW2_bio_baseball.add.yrs.debutcat %>%
filter(PLAYERID == "nuxhj101")
HOF_PLAYERS <- WW2_bio_baseball.add.yrs.debutcat %>%
filter(HOF == "HOF")
#Scatter plot of Debut Age and Debut Date for each
WW2_bio_baseball.add.yrs.debutcat %>%
ggplot(aes(x=PLAY.DEBUT,y=DEBUT_AGE)) +
xlab("Date of MLB Debut") +
ylab("Age") +
ggtitle("MBL Players Age by Debut Date") +
geom_point(alpha=0.3) +
geom_point(data=HOF_PLAYERS,
aes(x=PLAY.DEBUT,y=DEBUT_AGE, col = 'HOF Players'),
size=3) +
geom_point(data=Ted_Williams,
aes(x=PLAY.DEBUT,y=DEBUT_AGE, col = 'Ted Williams, HOF'),
size=3) +
geom_point(data=Satchel_Paige,
aes(x=PLAY.DEBUT,y=DEBUT_AGE, col = "Satchel Paige, HOF"),
size=3) +
geom_point(data=Joe_Nuxhall,
aes(x=PLAY.DEBUT,y=DEBUT_AGE, col = "Joe Nuxhall"),
size=3)
This was my first time working in R and my first time learning coding language. Ever since childhood my favorite sport has been baseball. The game has evolved a lot since its inception in the late 1800s. One of the most recent changes to the game is the prevalence of statistical analysis as the game is able to produce a lot of data related to individual performance.
Initially, I had set out to try to statistically prove who the best clutch hitter of my lifetime is. This proved to be much more difficult than I released as I would have needed to look at 3 different data sets to be able to determine game situation of a given at bat then develop a rating system. For example, a double that brings in 2 runs in a 13-0 game would be worth less than a lead-off single of a playoff game.
At the same time I was reading a book that was set in the WW2 era and I became curious on how this affected baseball being played at the time.
One of the easiest parts was finding a data set. Fortunately, there is a great community and dedicated individuals who gather incredible details related to baseball - I am very indebted to retrosheet.org for this.
What I found most difficult was tidying the data. There were a lot of elements that I had to add to answer my three questions. I had to calculate ages from dates, categorize, and limit my data set to the years in question. This took some time first to conceptualize and then to execute and get the tidy code correct.
The next steps if I were to continue are to look at actual performance of these players and how many championships they won. This gets a little tricky trying to compare pitchers to batters. That’s where some of the new baseball stats like wins over replacement (hence the “WAR” joke in the intro) try to bridge the two into one convenient stat. Another area I would have liked to expand on was the affect of race. There were many Negro Leagues players that entered the MLB in the years following WW2. I had highlighted Satchel Paige who was 42 in his debut!
Something that may be skewing the career lengths of preWW2 debuting players could be that teams were honoring the spots of players that went to war. …lies damned lies and statistics… Mark Twain
Something I found interesting was that the scatter plot showed that the 1943 season started late, which is clear by the gap in the scatter plot. According to baseball-reference.com the first game was in August where the baseball season typically starts in April. I would have never known that if not for this project.
In this analysis I had set out, generally, to find out how WW2 affected baseball.
I looked at debuting players age and found that there wasn’t a large difference either older or younger of debuting players compared to pre and post WW2 seasons despite the draft. The was evident in calculating the mean and plotting it for visualization inspection.
I also compared the career lengths of players that debuted during WW2 and found that these players tended to not stay in the league as long as pre and post WW2 debuting players. I determined this by caluclating the median and displaying this in a bar chart.
Finally, I looked at the HOF status of the WW2 debuting players and determined that the quality of the these players was less as there were only 3 of these players inducted into the HOF compared to the other periods where pre WW2 had 9 players and post WW2 had 12 HOF players.
As for the remaining questions, I believe that there may have been more to explore within the standard deviation of the debut age. While the mean was relatively the same of the WW2 compared to the pre and post periods, the higher standard deviation indicated a higher spread of older and younger debut ages. Another remaining question is the affect of race on this analysis and the MLB didn’t become integrated until 1947.
R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Retrosheet. “Retrosheet.” Biographical Information, 2 Dec. 2021, https://www.retrosheet.org/.
Anapolis, Nick. “ROBINSON DEBUTS FIVE DAYS AFTER SIGNING WITH DODGERS.” Baseball Hall of Fame, 2013, https://baseballhall.org/.
“MLB Stats, Scores, History, & Records.” Baseball Reference, https://www.baseball-reference.com/.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Martins (2022, Jan. 25). Data Analytics and Computational Social Science: DACSS 601 Final: Effects of War on MLB. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommartike8858127/
BibTeX citation
@misc{martins2022dacss, author = {Martins, K}, title = {Data Analytics and Computational Social Science: DACSS 601 Final: Effects of War on MLB}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommartike8858127/}, year = {2022} }