Homework 5 Read, Stats, Visualizations

This HW5 Defines Research Questions and Modifies Visualizations

K Martins
2022-01-18

Introduction

This is K Martins’s submission for Homework Five in DACSS 601. In this I will read in a baseball player biographical dataset from retrosheet.org. I will look at the 4 years prior to World War 2 (WW2), the 4 years of WW2 (US involvement), and the 4 years after WW2 to determine how the effect of war on the quality of players that had their Major League Baseball (MLB) debut during the war.

I will determine the quality of the players that debuted during WW2 by looking at how old the players were that debuted during the war, how long these players played in the MLB, and how many of these players received the highest honor of the MLB which is being inducted into the HOF.

Kevin’s Set-Ups

# Setting my working directory on PC to my R folder
my_dir <- "C:/Users/MH821/Documents/R"
setwd(my_dir)

library(tidyverse)
library(dplyr)
library(lubridate)
library(janitor)

Read In Baseball Biographical File and Explain the Variables

Here I read in the baseball biographical file from retrosheet.org that I saved within my set working directory.

# Read in Baseball Biographical File dataset
biographic_baseball <- read.csv("BIOFILE_RETROSHEET_kjmcleaner.csv",TRUE,',')

str(biographic_baseball)
'data.frame':   21574 obs. of  38 variables:
 $ PLAYERID             : chr  "aardd001" "aaroh101" "aarot101" "aased001" ...
 $ LAST                 : chr  "Aardsma" "Aaron" "Aaron" "Aase" ...
 $ FIRST                : chr  "David Allan" "Henry Louis" "Tommie Lee" "Donald William" ...
 $ NICKNAME             : chr  "David" "Hank" "Tommie" "Don" ...
 $ BIRTHDATE            : chr  "12/27/1981" "02/05/1934" "08/05/1939" "09/08/1954" ...
 $ Alt_BIRTHDATE        : chr  "12/27/1981" "02/05/1934" "08/05/1939" "09/08/1954" ...
 $ BIRTH.CITY           : chr  "Denver" "Mobile" "Mobile" "Orange" ...
 $ BIRTH.STATE          : chr  "Colorado" "Alabama" "Alabama" "California" ...
 $ BIRTH.COUNTRY        : chr  "USA" "USA" "USA" "USA" ...
 $ PLAY.DEBUT           : chr  "04/06/2004" "04/13/1954" "04/10/1962" "07/26/1977" ...
 $ Alt_DEBUTDATE        : chr  "04/06/2004" "04/13/1954" "04/10/1962" "07/26/1977" ...
 $ DEBUT_AGE            : chr  "22" "20" "22" "22" ...
 $ PLAY.LASTGAME        : chr  "08/23/2015" "10/03/1976" "09/26/1971" "10/03/1990" ...
 $ Alt_LASTGAME         : chr  "08/23/2015" "10/03/1976" "09/26/1971" "10/03/1990" ...
 $ Complete_Years_Active: int  11 22 9 13 4 11 0 13 4 4 ...
 $ MGR.DEBUT            : chr  "" "" "" "" ...
 $ MGR.LASTGAME         : chr  "" "" "" "" ...
 $ COACH.DEBUT          : chr  "" "" "04/06/1979" "" ...
 $ COACH.LASTGAME       : chr  "" "" "09/30/1984" "" ...
 $ UMP.DEBUT            : chr  "" "" "" "" ...
 $ UMP.LASTGAME         : chr  "" "" "" "" ...
 $ DEATHDATE            : chr  "" "01/22/2021" "08/16/1984" "" ...
 $ DEATH.CITY           : chr  "" "Atlanta" "Atlanta" "" ...
 $ DEATH.STATE          : chr  "" "Georgia" "Georgia" "" ...
 $ DEATH.COUNTRY        : chr  "" "USA" "USA" "" ...
 $ BATS                 : chr  "R" "R" "R" "R" ...
 $ THROWS               : chr  "R" "R" "R" "R" ...
 $ HEIGHT               : chr  "6-05" "6-00" "6-03" "6-03" ...
 $ WEIGHT               : int  200 180 190 190 184 205 192 170 175 169 ...
 $ CEMETERY             : chr  "" "Southview Cemetery" "Catholic Cemetery" "" ...
 $ CEME.CITY            : chr  "" "Atlanta" "Mobile" "" ...
 $ CEME.STATE           : chr  "" "Georgia" "Alabama" "" ...
 $ CEME.COUNTRY         : chr  "" "USA" "USA" "" ...
 $ CEME.NOTE            : chr  "" "" "" "" ...
 $ BIRTH.NAME           : chr  "" "" "" "" ...
 $ NAME.CHG             : logi  NA NA NA NA NA NA ...
 $ BAT.CHG              : logi  NA NA NA NA NA NA ...
 $ HOF                  : chr  "NOT" "HOF" "NOT" "NOT" ...

Tidying the Data

I am Tidying my data to:

-change the Date fields to mdy from chr -limiting the selection to the the 4 years prior to WW2, the 4 years of WW2 (US involvement), and the 4 years after WW2 -add columns to add a Debut Age, Career Length, and a Debut Category for PreWW2, WW2, and PostWW2

#I have changed the Date fields to mdy from chr

biographic_baseball$PLAY.DEBUT <- mdy(biographic_baseball$PLAY.DEBUT)

biographic_baseball$BIRTHDATE <- mdy(biographic_baseball$BIRTHDATE)

biographic_baseball$PLAY.LASTGAME <- mdy(biographic_baseball$PLAY.LASTGAME)

biographic_baseball$MGR.DEBUT <- mdy(biographic_baseball$MGR.DEBUT)

biographic_baseball$MGR.LASTGAME <- mdy(biographic_baseball$MGR.LASTGAME)

biographic_baseball$DEATHDATE <- mdy(biographic_baseball$DEATHDATE)

# Here I am limiting the selection to the the 4 years prior to WW2, the 4 years of WW2 (US involvement), and the 4 years after WW2 

WW2_biographic_baseball <- select(biographic_baseball, PLAYERID, LAST, FIRST, 
                                    NICKNAME, BIRTHDATE, BIRTH.CITY, 
                                    BIRTH.STATE, BIRTH.COUNTRY, PLAY.DEBUT,
                                    PLAY.LASTGAME, MGR.DEBUT,MGR.LASTGAME, 
                                    DEATH.COUNTRY,BATS, THROWS,HOF) %>%
  
  dplyr::filter(grepl('1938|1939|1940|1941|1942|1943|1944|1945|1946|1947|1948|
                      1949', PLAY.DEBUT))

# I then add columns to add a Debut Age, Career Length, and a Debut Category for PreWW2, WW2, and PostWW2

WW2_bio_baseball.add.yrs.debutcat <-  WW2_biographic_baseball %>% mutate(
                                  DEBUT_AGE = year(as.period(interval
                                        (start = BIRTHDATE, end = PLAY.DEBUT))),
                                   CAREER_LENGTH = year(as.period(interval
                                        (start = PLAY.DEBUT, 
                                          end = PLAY.LASTGAME))),
                                  DEBUT_CATEGORY = case_when(
                        grepl('1938|1939|1940|1941',PLAY.DEBUT) ~ "a.PreWW2",
                        grepl('1942|1943|1944|1945', PLAY.DEBUT) ~ "b.WW2",
                        grepl('1946|1947|1948|1949',PLAY.DEBUT) ~ "c.PostWW2"))


str(WW2_bio_baseball.add.yrs.debutcat)
'data.frame':   1312 obs. of  19 variables:
 $ PLAYERID      : chr  "aberc101" "abert102" "aberw101" "abrej101" ...
 $ LAST          : chr  "Aberson" "Abernathy" "Abernathy" "Abreu" ...
 $ FIRST         : chr  "Clifford Alexander" "Talmadge Lafayette" "Virgil Woodrow" "Joseph Lawrence" ...
 $ NICKNAME      : chr  "Cliff" "Ted" "Woody" "Joe" ...
 $ BIRTHDATE     : Date, format: "1921-08-28" ...
 $ BIRTH.CITY    : chr  "Chicago" "Bynum" "Forest City" "Oakland" ...
 $ BIRTH.STATE   : chr  "Illinois" "North Carolina" "North Carolina" "California" ...
 $ BIRTH.COUNTRY : chr  "USA" "USA" "USA" "USA" ...
 $ PLAY.DEBUT    : Date, format: "1947-07-18" ...
 $ PLAY.LASTGAME : Date, format: "1949-05-09" ...
 $ MGR.DEBUT     : Date, format: NA ...
 $ MGR.LASTGAME  : Date, format: NA ...
 $ DEATH.COUNTRY : chr  "USA" "USA" "USA" "USA" ...
 $ BATS          : chr  "R" "R" "L" "R" ...
 $ THROWS        : chr  "R" "L" "L" "R" ...
 $ HOF           : chr  "NOT" "NOT" "NOT" "NOT" ...
 $ DEBUT_AGE     : num  25 20 31 28 31 23 24 27 20 24 ...
 $ CAREER_LENGTH : num  1 1 0 0 5 8 13 0 1 0 ...
 $ DEBUT_CATEGORY: chr  "c.PostWW2" "b.WW2" "c.PostWW2" "b.WW2" ...
head(WW2_bio_baseball.add.yrs.debutcat)
  PLAYERID      LAST              FIRST NICKNAME  BIRTHDATE
1 aberc101   Aberson Clifford Alexander    Cliff 1921-08-28
2 abert102 Abernathy Talmadge Lafayette      Ted 1921-10-30
3 aberw101 Abernathy     Virgil Woodrow    Woody 1915-02-01
4 abrej101     Abreu    Joseph Lawrence      Joe 1913-05-24
5 adama101     Adams       Ace Townsend      Ace 1910-03-02
6 adamb101     Adams        Elvin Clark   Buster 1915-06-24
   BIRTH.CITY    BIRTH.STATE BIRTH.COUNTRY PLAY.DEBUT PLAY.LASTGAME
1     Chicago       Illinois           USA 1947-07-18    1949-05-09
2       Bynum North Carolina           USA 1942-09-19    1944-04-29
3 Forest City North Carolina           USA 1946-07-28    1947-04-17
4     Oakland     California           USA 1942-04-23    1942-07-11
5     Willows     California           USA 1941-04-15    1946-04-24
6    Trinidad       Colorado           USA 1939-04-27    1947-09-21
  MGR.DEBUT MGR.LASTGAME DEATH.COUNTRY BATS THROWS HOF DEBUT_AGE
1      <NA>         <NA>           USA    R      R NOT        25
2      <NA>         <NA>           USA    R      L NOT        20
3      <NA>         <NA>           USA    L      L NOT        31
4      <NA>         <NA>           USA    R      R NOT        28
5      <NA>         <NA>           USA    R      R NOT        31
6      <NA>         <NA>           USA    R      R NOT        23
  CAREER_LENGTH DEBUT_CATEGORY
1             1      c.PostWW2
2             1          b.WW2
3             0      c.PostWW2
4             0          b.WW2
5             5       a.PreWW2
6             8       a.PreWW2

Calculating the Mean of Debut Age and Years Active for Each Period

For each period the mean debut age and years active was calculated.

There were small increases to debut age rose in the wartime years compared to pre and post WW2 periods.

There were larger increases to years active in pre and post WW2 periods as compared to wartime.

#Calculating the Mean of Debut Age and Career Length

WW2_bio_baseball.add.yrs.debutcat %>%
  group_by(DEBUT_CATEGORY) %>%
  summarize(MEAN_DEBUT_AGE = mean(DEBUT_AGE), 
            MEAN_CAREER_LENGTH = mean(CAREER_LENGTH))
# A tibble: 3 x 3
  DEBUT_CATEGORY MEAN_DEBUT_AGE MEAN_CAREER_LENGTH
  <chr>                   <dbl>              <dbl>
1 a.PreWW2                 23.8               5.54
2 b.WW2                    24.9               3.26
3 c.PostWW2                24.8               4.90

Calculating the Median of Debut Age and Years Active for Each Period

For each period the median debut age and years active was calculated.

There was a 2 year increase to debut age in the wartime years compared to pre WW2 periods with no change comparing wartime period to post-WW2 period.

Larger increases were observed in years active in pre and post WW2 periods as compared to wartime.

#Calculating the Median of Debut Age and Career Length

WW2_bio_baseball.add.yrs.debutcat %>%
  group_by(DEBUT_CATEGORY) %>%
  summarize(MEDIAN_DEBUT_AGE = median(DEBUT_AGE), 
            MEDIAN_CAREER_LENGTH = median(CAREER_LENGTH))
# A tibble: 3 x 3
  DEBUT_CATEGORY MEDIAN_DEBUT_AGE MEDIAN_CAREER_LENGTH
  <chr>                     <dbl>                <dbl>
1 a.PreWW2                     23                    5
2 b.WW2                        25                    1
3 c.PostWW2                    25                    4

Calculating Standard Deviation of Debut Age and Years Active for Each Period

For each period the standard deviation of debut age and years active was calculated.

Standard deviation in debut age during wartime years increased compared to pre and post WW2 periods.

This contrasted to decreased in standard deviation of years active in wartime period compared to pre and post WW2 periods.

##Calculating the Standard Deviation of Debut Age and Career Length

WW2_bio_baseball.add.yrs.debutcat %>%
  group_by(DEBUT_CATEGORY) %>%
  summarize(SD_DEBUT_AGE = sd(DEBUT_AGE), 
            SD_CAREER_LENGTH = sd(CAREER_LENGTH))
# A tibble: 3 x 3
  DEBUT_CATEGORY SD_DEBUT_AGE SD_CAREER_LENGTH
  <chr>                 <dbl>            <dbl>
1 a.PreWW2               2.91             5.24
2 b.WW2                  3.97             4.44
3 c.PostWW2              3.38             4.84

Plotting the Age of Each Player by the Debut Date

I created a scatterplot to show the age of each player by the debut date. I have highlighted the HOF players in red while also highlighting some interesting observations of specific players.

Some of the significant players were Ted Williams (blue), Joe Nuxhall (yellow), and Satchel Paige (green), each brings an interesting perspective on my research question.

Ted Williams debuted prior to WW2 US involvement in 1939, served in WW2 and became a Hall of Famer.

Joe Nuxhall is the youngest player ever to debut in the MLB at 15 years-old. He did so during the WW2 period in 1944.

Satchel Paige is the oldest player to ever debut in the MLB at age 42. He did so post WW2 in 1948. Paige is a Hall of Famer. Paige brings an interesting wrinkle into this study as he is one of the best representations of the effect of race and racism during this period of Major League Baseball. His debut came after a long career in the Negro Leagues. He entered the MLB after the grounbreaking debut of 28 year-old Jackie Robinson in 1947 - a former Negro Leagues player.

Another observations was that the scatterplot showed that the 1943 season started late, which is clear by the gap in the scatterplot. According to baseball-almanac.com the first game was in August where the baseball season typically starts in April.

Observations of the clusters show that ages did tend to be older during the wartime seasons of 1942 - 1945. Additionally, there are only 3 HOF in in red for these wartime seasons.

# I created variables to highlight significant players within my scatterplot and highlight HOF players

Ted_Williams <- WW2_bio_baseball.add.yrs.debutcat %>% 
             filter(PLAYERID == "willt103")

Satchel_Paige <- WW2_bio_baseball.add.yrs.debutcat %>% 
             filter(PLAYERID == "paigs101")

Joe_Nuxhall <- WW2_bio_baseball.add.yrs.debutcat %>% 
             filter(PLAYERID == "nuxhj101")

HOF_PLAYERS <- WW2_bio_baseball.add.yrs.debutcat %>% 
             filter(HOF == "HOF")

#Scatter plot of Debut Age and Debut Date for each 

WW2_bio_baseball.add.yrs.debutcat %>% 
  ggplot(aes(x=PLAY.DEBUT,y=DEBUT_AGE)) + 
  xlab("Date of MLB Debut") +
  ylab("Age") +
  ggtitle("Age of MLB Debut by Date") +
  geom_point(alpha=0.3) +
  geom_point(data=HOF_PLAYERS, 
             aes(x=PLAY.DEBUT,y=DEBUT_AGE), 
             color='red',
             size=3) +  
  geom_point(data=Ted_Williams, 
             aes(x=PLAY.DEBUT,y=DEBUT_AGE), 
             color='blue',
             size=3) +
  geom_point(data=Satchel_Paige, 
             aes(x=PLAY.DEBUT,y=DEBUT_AGE), 
             color='green',
             size=3) +
  geom_point(data=Joe_Nuxhall, 
             aes(x=PLAY.DEBUT,y=DEBUT_AGE), 
             color='yellow',
             size=3)

In Progress Showing the % of HOF to Players that Debut

Here I attempt to show the percentage of HOF player to the total amount of players to debut within the PreWW2, WW2, and PostWW2 categories. At this time this is still in progress and unfinished. I wil attempt to display this in a barplot and facet.

# Attempt to create a table to then show the % of HOF to Players debut

ggplot(WW2_bio_baseball.add.yrs.debutcat) + 
  geom_bar(mapping = aes(x = HOF))

This is the end of the document.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Martins (2022, Jan. 20). Data Analytics and Computational Social Science: Homework 5 Read, Stats, Visualizations. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommartike8856355/

BibTeX citation

@misc{martins2022homework,
  author = {Martins, K},
  title = {Data Analytics and Computational Social Science: Homework 5 Read, Stats, Visualizations},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommartike8856355/},
  year = {2022}
}