Homework 3 Identify Dataset, Read in, Identify Potential Questions

This HW3 is my dataset, reading in data, and my research question

K Martins
2022-01-05

Introduction

This is K Martins’s submission for Homework Three in DACSS 601. In this I will read in a baseball player biographical dataset from retrosheet.org.

Then I will explain the variables and select only the information necessary for my potential questions.

For my final project I wanted to evaluate the effect of war on baseball. (And this is NOT the darling baseball statistic “WAR” which is wins over replacement) It is relatively unique that baseball played throughout both world wars only having to shorten the 1918 season (which the Boston Red Sox won with their next title coming, of course, 86 years later in 2004). Using biographical data such as birth date, birth city and also debut date and last game date I look to evaluate how the world wars affects the types of players, how old they were and how long they lasted in the league. Were these replacement players? Did they still make the team the following years after the war ended? Where were these players from? Because of this I will be looking at dates straddling the US involvement in 1917 - 1918 and 1941 - 1945 for World War 1 and World War 2, respectively.

Kevin’s Set-Ups

# Setting my working directory on PC to my R folder
my_dir <- "C:/Users/MH821/Documents/R"
setwd(my_dir)

library(tidyverse)
library(dplyr)

Read In Baseball Biographical File and Explain the Variables

Here I read in the baseball biographical file from retrosheet.org that I saved within my set working directory. The file is a relatively clean csv data set. The variables are the listed below .

# Read in Baseball Biographical File dataset
biographic_baseball <- read.csv("BIOFILE_RETROSHEET.txt",TRUE,',')

head(biographic_baseball)
  PLAYERID    LAST            FIRST NICKNAME  BIRTHDATE BIRTH.CITY
1 aardd001 Aardsma      David Allan    David 12/27/1981     Denver
2 aaroh101   Aaron      Henry Louis     Hank 02/05/1934     Mobile
3 aarot101   Aaron       Tommie Lee   Tommie 08/05/1939     Mobile
4 aased001    Aase   Donald William      Don 09/08/1954     Orange
5 abada001    Abad    Fausto Andres     Andy 08/25/1972 Palm Beach
6 abadf001    Abad Fernando Antonio Fernando 12/17/1985  La Romana
  BIRTH.STATE      BIRTH.COUNTRY PLAY.DEBUT PLAY.LASTGAME MGR.DEBUT
1    Colorado                USA 04/06/2004    08/23/2015          
2     Alabama                USA 04/13/1954    10/03/1976          
3     Alabama                USA 04/10/1962    09/26/1971          
4  California                USA 07/26/1977    10/03/1990          
5     Florida                USA 09/10/2001    04/13/2006          
6   La Romana Dominican Republic 07/28/2010    10/01/2021          
  MGR.LASTGAME COACH.DEBUT COACH.LASTGAME UMP.DEBUT UMP.LASTGAME
1                                                               
2                                                               
3               04/06/1979     09/30/1984                       
4                                                               
5                                                               
6                                                               
   DEATHDATE DEATH.CITY DEATH.STATE DEATH.COUNTRY BATS THROWS HEIGHT
1                                                    R      R   6-05
2 01/22/2021    Atlanta     Georgia           USA    R      R   6-00
3 08/16/1984    Atlanta     Georgia           USA    R      R   6-03
4                                                    R      R   6-03
5                                                    L      L   6-01
6                                                    L      L   6-02
  WEIGHT           CEMETERY CEME.CITY CEME.STATE CEME.COUNTRY
1    200                                                     
2    180 Southview Cemetery   Atlanta    Georgia          USA
3    190  Catholic Cemetery    Mobile    Alabama          USA
4    190                                                     
5    184                                                     
6    205                                                     
  CEME.NOTE BIRTH.NAME NAME.CHG BAT.CHG HOF
1                            NA      NA NOT
2                            NA      NA HOF
3                            NA      NA NOT
4                            NA      NA NOT
5                            NA      NA NOT
6                            NA      NA NOT
str(biographic_baseball)
'data.frame':   21574 obs. of  33 variables:
 $ PLAYERID      : chr  "aardd001" "aaroh101" "aarot101" "aased001" ...
 $ LAST          : chr  "Aardsma" "Aaron" "Aaron" "Aase" ...
 $ FIRST         : chr  "David Allan" "Henry Louis" "Tommie Lee" "Donald William" ...
 $ NICKNAME      : chr  "David" "Hank" "Tommie" "Don" ...
 $ BIRTHDATE     : chr  "12/27/1981" "02/05/1934" "08/05/1939" "09/08/1954" ...
 $ BIRTH.CITY    : chr  "Denver" "Mobile" "Mobile" "Orange" ...
 $ BIRTH.STATE   : chr  "Colorado" "Alabama" "Alabama" "California" ...
 $ BIRTH.COUNTRY : chr  "USA" "USA" "USA" "USA" ...
 $ PLAY.DEBUT    : chr  "04/06/2004" "04/13/1954" "04/10/1962" "07/26/1977" ...
 $ PLAY.LASTGAME : chr  "08/23/2015" "10/03/1976" "09/26/1971" "10/03/1990" ...
 $ MGR.DEBUT     : chr  "" "" "" "" ...
 $ MGR.LASTGAME  : chr  "" "" "" "" ...
 $ COACH.DEBUT   : chr  "" "" "04/06/1979" "" ...
 $ COACH.LASTGAME: chr  "" "" "09/30/1984" "" ...
 $ UMP.DEBUT     : chr  "" "" "" "" ...
 $ UMP.LASTGAME  : chr  "" "" "" "" ...
 $ DEATHDATE     : chr  "" "01/22/2021" "08/16/1984" "" ...
 $ DEATH.CITY    : chr  "" "Atlanta" "Atlanta" "" ...
 $ DEATH.STATE   : chr  "" "Georgia" "Georgia" "" ...
 $ DEATH.COUNTRY : chr  "" "USA" "USA" "" ...
 $ BATS          : chr  "R" "R" "R" "R" ...
 $ THROWS        : chr  "R" "R" "R" "R" ...
 $ HEIGHT        : chr  "6-05" "6-00" "6-03" "6-03" ...
 $ WEIGHT        : int  200 180 190 190 184 205 192 170 175 169 ...
 $ CEMETERY      : chr  "" "Southview Cemetery" "Catholic Cemetery" "" ...
 $ CEME.CITY     : chr  "" "Atlanta" "Mobile" "" ...
 $ CEME.STATE    : chr  "" "Georgia" "Alabama" "" ...
 $ CEME.COUNTRY  : chr  "" "USA" "USA" "" ...
 $ CEME.NOTE     : chr  "" "" "" "" ...
 $ BIRTH.NAME    : chr  "" "" "" "" ...
 $ NAME.CHG      : logi  NA NA NA NA NA NA ...
 $ BAT.CHG       : logi  NA NA NA NA NA NA ...
 $ HOF           : chr  "NOT" "HOF" "NOT" "NOT" ...
# I selected only the variables using necessary for my questions.

biographic_baseball_small <- select(biographic_baseball, PLAYERID, LAST, FIRST, NICKNAME, BIRTHDATE, BIRTH.CITY, BIRTH.STATE, BIRTH.COUNTRY, PLAY.DEBUT, PLAY.LASTGAME)

Filter for Wartime Debut Dates

In this data-wrangling I have filtered for 2 years before World War Periods, World War periods, and after World War periods. I did this by filtering for column Play.DEBUT using grepl which returns a logical vector indicating which element of a character vector contains the match.

#  Filter for Wartime Debut Dates as well as Dates 2 Season Prior and After

ww_biogrpahic_baseball <- dplyr::filter(biographic_baseball_small, 
                                        grepl('1915|1916|1917|1918|1919|1940
                                              |1941|1942|1943|1944|1945|1946|
                                              1947',PLAY.DEBUT))


head(ww_biogrpahic_baseball)
  PLAYERID      LAST              FIRST NICKNAME  BIRTHDATE
1 abert102 Abernathy Talmadge Lafayette      Ted 10/30/1921
2 aberw101 Abernathy     Virgil Woodrow    Woody 02/01/1915
3 abrej101     Abreu    Joseph Lawrence      Joe 05/24/1913
4 adama101     Adams       Ace Townsend      Ace 03/02/1910
5 adamb103     Adams       Robert Henry    Bobby 12/14/1921
6 adamr101     Adams     Charles Dwight      Red 10/07/1921
   BIRTH.CITY    BIRTH.STATE BIRTH.COUNTRY PLAY.DEBUT PLAY.LASTGAME
1       Bynum North Carolina           USA 09/19/1942    04/29/1944
2 Forest City North Carolina           USA 07/28/1946    04/17/1947
3     Oakland     California           USA 04/23/1942    07/11/1942
4     Willows     California           USA 04/15/1941    04/24/1946
5    Tuolumne     California           USA 04/16/1946    04/22/1959
6     Parlier     California           USA 05/05/1946    07/02/1946

Arrange by Debut Date of the Player

In this dplyr function I arranged by debut date.

# Arrange by Debut Date Ascending
arrange_ww_biogrpahic_baseball <- arrange(ww_biogrpahic_baseball, PLAY.DEBUT)

head(arrange_ww_biogrpahic_baseball)
  PLAYERID      LAST            FIRST NICKNAME  BIRTHDATE
1 huhne101      Huhn        Emil Hugo     Emil 03/10/1892
2 uphab101     Upham William Lawrence     Bill 04/04/1888
3 roggc101     Rogge  Francis Clinton    Clint 07/19/1889
4 coucj101     Couch      John Daniel   Johnny 03/31/1891
5 joneb103     Jones    Robert Walter      Bob 12/02/1889
6 nichf101 Nicholson             Fred     Fred 09/01/1894
    BIRTH.CITY BIRTH.STATE BIRTH.COUNTRY PLAY.DEBUT PLAY.LASTGAME
1 North Vernon     Indiana           USA 04/10/1915    06/30/1917
2        Akron        Ohio           USA 04/10/1915    06/26/1918
3      Memphis    Michigan           USA 04/11/1915    06/06/1921
4       Vaughn     Montana           USA 04/11/1917    09/21/1925
5      Clayton  California           USA 04/11/1917    09/24/1925
6  Honey Grove       Texas           USA 04/11/1917    09/16/1922

This is the end of the document.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Martins (2022, Jan. 8). Data Analytics and Computational Social Science: Homework 3 Identify Dataset, Read in, Identify Potential Questions. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommartike8852865/

BibTeX citation

@misc{martins2022homework,
  author = {Martins, K},
  title = {Data Analytics and Computational Social Science: Homework 3 Identify Dataset, Read in, Identify Potential Questions},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommartike8852865/},
  year = {2022}
}