Homework 6 Working Draft of Effects of War on MLB

This HW6 is a working draft

K Martins
2022-01-19

Introduction

INTRODUCTION NEEDS MORE CONTEXT TO HIT PAGES

This is K Martins’s submission for Homework Five in DACSS 601. In this I will read in a baseball player biographical dataset from retrosheet.org. I will look at the 4 years prior to World War 2 (WW2), the 4 years of WW2 (US involvement), and the 4 years after WW2 to determine how the effect of war on the quality of players that had their Major League Baseball (MLB) debut during the war.

I will determine the quality of the players that debuted during WW2 by looking at how old the players were that debuted during the war, how long these players played in the MLB, and how many of these players received the highest honor of the MLB which is being inducted into the HOF.

RESEARCH QUESTIONS SUMMARIZED

What effects did WW2 have on the quality of the players in the MLB?

How can data show us quality of players at that time?

What insights are missing in the data? (i.e. the affect of racism)

DATA

Kevin’s Set-Ups

# Setting my working directory on PC to my R folder
my_dir <- "C:/Users/MH821/Documents/R"
setwd(my_dir)

library(tidyverse)
library(dplyr)
library(lubridate)
library(janitor)

Read in the Data

Read In Baseball Biographical File and Explain the Variables

Here I read in the baseball biographical file from retrosheet.org that I saved within my set working directory.

# Read in Baseball Biographical File dataset
biographic_baseball <- read.csv("BIOFILE_RETROSHEET_kjmcleaner.csv",TRUE,',')

str(biographic_baseball)
'data.frame':   21574 obs. of  38 variables:
 $ PLAYERID             : chr  "aardd001" "aaroh101" "aarot101" "aased001" ...
 $ LAST                 : chr  "Aardsma" "Aaron" "Aaron" "Aase" ...
 $ FIRST                : chr  "David Allan" "Henry Louis" "Tommie Lee" "Donald William" ...
 $ NICKNAME             : chr  "David" "Hank" "Tommie" "Don" ...
 $ BIRTHDATE            : chr  "12/27/1981" "02/05/1934" "08/05/1939" "09/08/1954" ...
 $ Alt_BIRTHDATE        : chr  "12/27/1981" "02/05/1934" "08/05/1939" "09/08/1954" ...
 $ BIRTH.CITY           : chr  "Denver" "Mobile" "Mobile" "Orange" ...
 $ BIRTH.STATE          : chr  "Colorado" "Alabama" "Alabama" "California" ...
 $ BIRTH.COUNTRY        : chr  "USA" "USA" "USA" "USA" ...
 $ PLAY.DEBUT           : chr  "04/06/2004" "04/13/1954" "04/10/1962" "07/26/1977" ...
 $ Alt_DEBUTDATE        : chr  "04/06/2004" "04/13/1954" "04/10/1962" "07/26/1977" ...
 $ DEBUT_AGE            : chr  "22" "20" "22" "22" ...
 $ PLAY.LASTGAME        : chr  "08/23/2015" "10/03/1976" "09/26/1971" "10/03/1990" ...
 $ Alt_LASTGAME         : chr  "08/23/2015" "10/03/1976" "09/26/1971" "10/03/1990" ...
 $ Complete_Years_Active: int  11 22 9 13 4 11 0 13 4 4 ...
 $ MGR.DEBUT            : chr  "" "" "" "" ...
 $ MGR.LASTGAME         : chr  "" "" "" "" ...
 $ COACH.DEBUT          : chr  "" "" "04/06/1979" "" ...
 $ COACH.LASTGAME       : chr  "" "" "09/30/1984" "" ...
 $ UMP.DEBUT            : chr  "" "" "" "" ...
 $ UMP.LASTGAME         : chr  "" "" "" "" ...
 $ DEATHDATE            : chr  "" "01/22/2021" "08/16/1984" "" ...
 $ DEATH.CITY           : chr  "" "Atlanta" "Atlanta" "" ...
 $ DEATH.STATE          : chr  "" "Georgia" "Georgia" "" ...
 $ DEATH.COUNTRY        : chr  "" "USA" "USA" "" ...
 $ BATS                 : chr  "R" "R" "R" "R" ...
 $ THROWS               : chr  "R" "R" "R" "R" ...
 $ HEIGHT               : chr  "6-05" "6-00" "6-03" "6-03" ...
 $ WEIGHT               : int  200 180 190 190 184 205 192 170 175 169 ...
 $ CEMETERY             : chr  "" "Southview Cemetery" "Catholic Cemetery" "" ...
 $ CEME.CITY            : chr  "" "Atlanta" "Mobile" "" ...
 $ CEME.STATE           : chr  "" "Georgia" "Alabama" "" ...
 $ CEME.COUNTRY         : chr  "" "USA" "USA" "" ...
 $ CEME.NOTE            : chr  "" "" "" "" ...
 $ BIRTH.NAME           : chr  "" "" "" "" ...
 $ NAME.CHG             : logi  NA NA NA NA NA NA ...
 $ BAT.CHG              : logi  NA NA NA NA NA NA ...
 $ HOF                  : chr  "NOT" "HOF" "NOT" "NOT" ...

Tidying the Data

DESCIPTION NEEDS MORE DETAIL AT THIS TIME

I am Tidying my data to:

-change the Date fields to mdy from chr -limiting the selection to the the 4 years prior to WW2, the 4 years of WW2 (US involvement), and the 4 years after WW2 -add columns to add a Debut Age, Career Length, and a Debut Category for PreWW2, WW2, and PostWW2

#I have changed the Date fields to mdy from chr

biographic_baseball$PLAY.DEBUT <- mdy(biographic_baseball$PLAY.DEBUT)

biographic_baseball$BIRTHDATE <- mdy(biographic_baseball$BIRTHDATE)

biographic_baseball$PLAY.LASTGAME <- mdy(biographic_baseball$PLAY.LASTGAME)

biographic_baseball$MGR.DEBUT <- mdy(biographic_baseball$MGR.DEBUT)

biographic_baseball$MGR.LASTGAME <- mdy(biographic_baseball$MGR.LASTGAME)

biographic_baseball$DEATHDATE <- mdy(biographic_baseball$DEATHDATE)

# Here I am limiting the selection to the the 4 years prior to WW2, the 4 years of WW2 (US involvement), and the 4 years after WW2 

WW2_biographic_baseball <- select(biographic_baseball, PLAYERID, LAST, FIRST, 
                                    NICKNAME, BIRTHDATE, BIRTH.CITY, 
                                    BIRTH.STATE, BIRTH.COUNTRY, PLAY.DEBUT,
                                    PLAY.LASTGAME, MGR.DEBUT,MGR.LASTGAME, 
                                    DEATH.COUNTRY,BATS, THROWS,HOF) %>%
  
  dplyr::filter(grepl('1938|1939|1940|1941|1942|1943|1944|1945|1946|1947|1948|1949', PLAY.DEBUT))

# I then add columns to add a Debut Age, Career Length, and a Debut Category for PreWW2, WW2, and PostWW2

WW2_bio_baseball.add.yrs.debutcat <-  WW2_biographic_baseball %>% mutate(
                                  DEBUT_AGE = year(as.period(interval
                                        (start = BIRTHDATE, end = PLAY.DEBUT))),
                                  CAREER_LENGTH = year(as.period(interval
                                        (start = PLAY.DEBUT, 
                                          end = PLAY.LASTGAME))),
                                  DEBUT_CATEGORY = case_when(
                  grepl('1938|1939|1940|1941',PLAY.DEBUT) ~ "1938 - 1941 PreWW2",
                  grepl('1942|1943|1944|1945', PLAY.DEBUT) ~ "1942 - 1945 WW2",
                  grepl('1946|1947|1948|1949',PLAY.DEBUT) ~ "1946 - 1949 PostWW2"),
                                  DEBUT_YEAR = case_when(
                  grepl('1938', PLAY.DEBUT) ~ "1938", 
                  grepl('1939', PLAY.DEBUT) ~ "1939",
                  grepl('1940', PLAY.DEBUT) ~ "1940",
                  grepl('1941', PLAY.DEBUT) ~ "1941",
                  grepl('1942', PLAY.DEBUT) ~ "1942",  
                  grepl('1943', PLAY.DEBUT) ~ "1943",
                  grepl('1944', PLAY.DEBUT) ~ "1944",
                  grepl('1945', PLAY.DEBUT) ~ "1945",
                  grepl('1946', PLAY.DEBUT) ~ "1946",
                  grepl('1947', PLAY.DEBUT) ~ "1947",
                  grepl('1948', PLAY.DEBUT) ~ "1948",
                  grepl('1949', PLAY.DEBUT) ~ "1949"),
                                  DEBUT_Season_Per_Category = case_when(
                  grepl('1938|1942|1946', PLAY.DEBUT) ~ "1", 
                  grepl('1939|1943|1947', PLAY.DEBUT) ~ "2",
                  grepl('1940|1944|1948', PLAY.DEBUT) ~ "3",
                  grepl('1941|1945|1949', PLAY.DEBUT) ~ "4"))


str(WW2_bio_baseball.add.yrs.debutcat)
'data.frame':   1404 obs. of  21 variables:
 $ PLAYERID                 : chr  "aberc101" "abert102" "aberw101" "abrac101" ...
 $ LAST                     : chr  "Aberson" "Abernathy" "Abernathy" "Abrams" ...
 $ FIRST                    : chr  "Clifford Alexander" "Talmadge Lafayette" "Virgil Woodrow" "Calvin Ross" ...
 $ NICKNAME                 : chr  "Cliff" "Ted" "Woody" "Cal" ...
 $ BIRTHDATE                : Date, format: "1921-08-28" ...
 $ BIRTH.CITY               : chr  "Chicago" "Bynum" "Forest City" "Philadelphia" ...
 $ BIRTH.STATE              : chr  "Illinois" "North Carolina" "North Carolina" "Pennsylvania" ...
 $ BIRTH.COUNTRY            : chr  "USA" "USA" "USA" "USA" ...
 $ PLAY.DEBUT               : Date, format: "1947-07-18" ...
 $ PLAY.LASTGAME            : Date, format: "1949-05-09" ...
 $ MGR.DEBUT                : Date, format: NA ...
 $ MGR.LASTGAME             : Date, format: NA ...
 $ DEATH.COUNTRY            : chr  "USA" "USA" "USA" "USA" ...
 $ BATS                     : chr  "R" "R" "L" "L" ...
 $ THROWS                   : chr  "R" "L" "L" "L" ...
 $ HOF                      : chr  "NOT" "NOT" "NOT" "NOT" ...
 $ DEBUT_AGE                : num  25 20 31 25 28 31 23 24 27 20 ...
 $ CAREER_LENGTH            : num  1 1 0 7 0 5 8 13 0 1 ...
 $ DEBUT_CATEGORY           : chr  "1946 - 1949 PostWW2" "1942 - 1945 WW2" "1946 - 1949 PostWW2" "1946 - 1949 PostWW2" ...
 $ DEBUT_YEAR               : chr  "1947" "1942" "1946" "1949" ...
 $ DEBUT_Season_Per_Category: chr  "2" "1" "1" "4" ...
head(WW2_bio_baseball.add.yrs.debutcat)
  PLAYERID      LAST              FIRST NICKNAME  BIRTHDATE
1 aberc101   Aberson Clifford Alexander    Cliff 1921-08-28
2 abert102 Abernathy Talmadge Lafayette      Ted 1921-10-30
3 aberw101 Abernathy     Virgil Woodrow    Woody 1915-02-01
4 abrac101    Abrams        Calvin Ross      Cal 1924-03-02
5 abrej101     Abreu    Joseph Lawrence      Joe 1913-05-24
6 adama101     Adams       Ace Townsend      Ace 1910-03-02
    BIRTH.CITY    BIRTH.STATE BIRTH.COUNTRY PLAY.DEBUT PLAY.LASTGAME
1      Chicago       Illinois           USA 1947-07-18    1949-05-09
2        Bynum North Carolina           USA 1942-09-19    1944-04-29
3  Forest City North Carolina           USA 1946-07-28    1947-04-17
4 Philadelphia   Pennsylvania           USA 1949-04-19    1956-05-09
5      Oakland     California           USA 1942-04-23    1942-07-11
6      Willows     California           USA 1941-04-15    1946-04-24
  MGR.DEBUT MGR.LASTGAME DEATH.COUNTRY BATS THROWS HOF DEBUT_AGE
1      <NA>         <NA>           USA    R      R NOT        25
2      <NA>         <NA>           USA    R      L NOT        20
3      <NA>         <NA>           USA    L      L NOT        31
4      <NA>         <NA>           USA    L      L NOT        25
5      <NA>         <NA>           USA    R      R NOT        28
6      <NA>         <NA>           USA    R      R NOT        31
  CAREER_LENGTH      DEBUT_CATEGORY DEBUT_YEAR
1             1 1946 - 1949 PostWW2       1947
2             1     1942 - 1945 WW2       1942
3             0 1946 - 1949 PostWW2       1946
4             7 1946 - 1949 PostWW2       1949
5             0     1942 - 1945 WW2       1942
6             5  1938 - 1941 PreWW2       1941
  DEBUT_Season_Per_Category
1                         2
2                         1
3                         1
4                         4
5                         1
6                         4

Visualization

Bar Graph total players debut per year?

If I am able to I’d like to show total players debut per year

Calculating the Mean of Debut Age and Years Active for Each Period

For each period the mean debut age and years active was calculated.

There were small increases to debut age rose in the wartime years compared to pre and post WW2 periods.

There were larger increases to years active in pre and post WW2 periods as compared to wartime.

#Calculating the Mean of Debut Age and Career Length

WW2_bio_baseball.add.yrs.debutcat %>%
  group_by(DEBUT_CATEGORY) %>%
  summarize(MEAN_DEBUT_AGE = mean(DEBUT_AGE), 
            MEAN_CAREER_LENGTH = mean(CAREER_LENGTH), 
            SD_DEBUT_AGE = sd(DEBUT_AGE), SD_CAREER_LENGTH = sd(CAREER_LENGTH))
# A tibble: 3 x 5
  DEBUT_CATEGORY      MEAN_DEBUT_AGE MEAN_CAREER_LENGTH SD_DEBUT_AGE
  <chr>                        <dbl>              <dbl>        <dbl>
1 1938 - 1941 PreWW2            23.8               5.54         2.91
2 1942 - 1945 WW2               24.9               3.26         3.97
3 1946 - 1949 PostWW2           24.8               4.83         3.30
# ... with 1 more variable: SD_CAREER_LENGTH <dbl>

Line Graph of Median

For each period the median debut age and years active was calculated.

There was a 2 year increase to debut age in the wartime years compared to pre WW2 periods with no change comparing wartime period to post-WW2 period.

Larger increases were observed in years active in pre and post WW2 periods as compared to wartime.

#Calculating the Median of Debut Age and Career Length and Visualization

WW2_bio_baseball.add.yrs.debutcat %>%
  group_by(DEBUT_CATEGORY, DEBUT_YEAR) %>%
  summarize(MEDIAN_CAREER_LENGTH = median(CAREER_LENGTH)) %>%
  ggplot(aes(`DEBUT_YEAR`, `MEDIAN_CAREER_LENGTH`, 
             group = `DEBUT_CATEGORY`, color = `DEBUT_CATEGORY`)) +
  geom_line() +
  labs(title = "Median of Career Legnth", y = "Median Career Length", x = "Season")

Plotting the Age of Each Player by the Debut Date

I created a scatterplot to show the age of each player by the debut date. I have highlighted the HOF players in red while also highlighting some interesting observations of specific players.

Some of the significant players were Ted Williams (blue), Joe Nuxhall (yellow), and Satchel Paige (green), each brings an interesting perspective on my research question.

Ted Williams debuted prior to WW2 US involvement in 1939, served in WW2 and became a Hall of Famer.

Joe Nuxhall is the youngest player ever to debut in the MLB at 15 years-old. He did so during the WW2 period in 1944.

Satchel Paige is the oldest player to ever debut in the MLB at age 42. He did so post WW2 in 1948. Paige is a Hall of Famer. Paige brings an interesting wrinkle into this study as he is one of the best representations of the effect of race and racism during this period of Major League Baseball. His debut came after a long career in the Negro Leagues. He entered the MLB after the groundbreaking debut of 28 year-old Jackie Robinson in 1947 - a former Negro Leagues player.

Another observations was that the scatter plot showed that the 1943 season started late, which is clear by the gap in the scatter plot. According to baseball-almanac.com the first game was in August where the baseball season typically starts in April.

Observations of the clusters show that ages did tend to be older during the wartime seasons of 1942 - 1945. Additionally, there are only 3 HOF in red for these wartime seasons.

# I created variables to highlight significant players within my scatterplot and highlight HOF players

Ted_Williams <- WW2_bio_baseball.add.yrs.debutcat %>% 
             filter(PLAYERID == "willt103")

Satchel_Paige <- WW2_bio_baseball.add.yrs.debutcat %>% 
             filter(PLAYERID == "paigs101")

Joe_Nuxhall <- WW2_bio_baseball.add.yrs.debutcat %>% 
             filter(PLAYERID == "nuxhj101")

HOF_PLAYERS <- WW2_bio_baseball.add.yrs.debutcat %>% 
             filter(HOF == "HOF")

#Scatter plot of Debut Age and Debut Date for each 

WW2_bio_baseball.add.yrs.debutcat %>% 
  ggplot(aes(x=PLAY.DEBUT,y=DEBUT_AGE)) + 
  xlab("Date of MLB Debut") +
  ylab("Age") +
  ggtitle("MBL Players Age by Debut Date") +
  geom_point(alpha=0.3) +
  geom_point(data=HOF_PLAYERS, 
             aes(x=PLAY.DEBUT,y=DEBUT_AGE, col = 'HOF Players'),
             size=3) +  
  geom_point(data=Ted_Williams, 
             aes(x=PLAY.DEBUT,y=DEBUT_AGE, col = 'Ted Williams, HOF'),
             size=3) +
  geom_point(data=Satchel_Paige, 
             aes(x=PLAY.DEBUT,y=DEBUT_AGE, col = "Satchel Paige, HOF"),
             size=3) +
  geom_point(data=Joe_Nuxhall, 
             aes(x=PLAY.DEBUT,y=DEBUT_AGE, col = "Joe Nuxhall"),
             size=3)

Relfection

-Describe Process including decisions made what was most challenging and what I wished I would have known

-Next steps if I were to continue

Conclusion

-explain conclusions I can draw from my work and what questions remain unanswer

Blibliography

-retrosheet -R -Course Textbook -insights on the specific players

This is the end of the document.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Martins (2022, Jan. 20). Data Analytics and Computational Social Science: Homework 6 Working Draft of Effects of War on MLB. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommartike8856953/

BibTeX citation

@misc{martins2022homework,
  author = {Martins, K},
  title = {Data Analytics and Computational Social Science: Homework 6 Working Draft of Effects of War on MLB},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommartike8856953/},
  year = {2022}
}