Homework 4 Read, Stats, Visualizations

This HW4 reads in data, creates statistics, and visualizations

K Martins
2022-01-11

Introduction

This is K Martins’s submission for Homework Four in DACSS 601. In this I will read in a baseball player biographical dataset from retrosheet.org. Note, Compared to the previous homeworks, I will now be only looking at the 4 years prior to World War 2, the 4 years of World War 2 (US involvement), and the 4 years, after World War 2.

Then I will perform descriptive statistics using means, median, and SDs. Primarily, I am looking at how the debut ages, years of baseball play, and Hall of Fall status was affected (or not) by the Second World War. I will also create 2 visualizations using ggplot2 package and explain the visualization.

Kevin’s Set-Ups

# Setting my working directory on PC to my R folder
my_dir <- "C:/Users/MH821/Documents/R"
setwd(my_dir)

library(tidyverse)
library(dplyr)
library(lubridate)

Read In Baseball Biographical File and Explain the Variables

Here I read in the baseball biographical file from retrosheet.org that I saved within my set working directory.

# Read in Baseball Biographical File dataset
biographic_baseball <- read.csv("BIOFILE_RETROSHEET_kjmcleaner.csv",TRUE,',')


head(biographic_baseball)
  PLAYERID    LAST            FIRST NICKNAME  BIRTHDATE Alt_BIRTHDATE
1 aardd001 Aardsma      David Allan    David 12/27/1981    12/27/1981
2 aaroh101   Aaron      Henry Louis     Hank 02/05/1934    02/05/1934
3 aarot101   Aaron       Tommie Lee   Tommie 08/05/1939    08/05/1939
4 aased001    Aase   Donald William      Don 09/08/1954    09/08/1954
5 abada001    Abad    Fausto Andres     Andy 08/25/1972    08/25/1972
6 abadf001    Abad Fernando Antonio Fernando 12/17/1985    12/17/1985
  BIRTH.CITY BIRTH.STATE      BIRTH.COUNTRY PLAY.DEBUT Alt_DEBUTDATE
1     Denver    Colorado                USA 04/06/2004    04/06/2004
2     Mobile     Alabama                USA 04/13/1954    04/13/1954
3     Mobile     Alabama                USA 04/10/1962    04/10/1962
4     Orange  California                USA 07/26/1977    07/26/1977
5 Palm Beach     Florida                USA 09/10/2001    09/10/2001
6  La Romana   La Romana Dominican Republic 07/28/2010    07/28/2010
  DEBUT_AGE PLAY.LASTGAME Alt_LASTGAME Complete_Years_Active
1        22    08/23/2015   08/23/2015                    11
2        20    10/03/1976   10/03/1976                    22
3        22    09/26/1971   09/26/1971                     9
4        22    10/03/1990   10/03/1990                    13
5        29    04/13/2006   04/13/2006                     4
6        24    10/01/2021   10/01/2021                    11
  MGR.DEBUT MGR.LASTGAME COACH.DEBUT COACH.LASTGAME UMP.DEBUT
1                                                            
2                                                            
3                         04/06/1979     09/30/1984          
4                                                            
5                                                            
6                                                            
  UMP.LASTGAME  DEATHDATE DEATH.CITY DEATH.STATE DEATH.COUNTRY BATS
1                                                                 R
2              01/22/2021    Atlanta     Georgia           USA    R
3              08/16/1984    Atlanta     Georgia           USA    R
4                                                                 R
5                                                                 L
6                                                                 L
  THROWS HEIGHT WEIGHT           CEMETERY CEME.CITY CEME.STATE
1      R   6-05    200                                        
2      R   6-00    180 Southview Cemetery   Atlanta    Georgia
3      R   6-03    190  Catholic Cemetery    Mobile    Alabama
4      R   6-03    190                                        
5      L   6-01    184                                        
6      L   6-02    205                                        
  CEME.COUNTRY CEME.NOTE BIRTH.NAME NAME.CHG BAT.CHG HOF
1                                         NA      NA NOT
2          USA                            NA      NA HOF
3          USA                            NA      NA NOT
4                                         NA      NA NOT
5                                         NA      NA NOT
6                                         NA      NA NOT
str(biographic_baseball)
'data.frame':   21574 obs. of  38 variables:
 $ PLAYERID             : chr  "aardd001" "aaroh101" "aarot101" "aased001" ...
 $ LAST                 : chr  "Aardsma" "Aaron" "Aaron" "Aase" ...
 $ FIRST                : chr  "David Allan" "Henry Louis" "Tommie Lee" "Donald William" ...
 $ NICKNAME             : chr  "David" "Hank" "Tommie" "Don" ...
 $ BIRTHDATE            : chr  "12/27/1981" "02/05/1934" "08/05/1939" "09/08/1954" ...
 $ Alt_BIRTHDATE        : chr  "12/27/1981" "02/05/1934" "08/05/1939" "09/08/1954" ...
 $ BIRTH.CITY           : chr  "Denver" "Mobile" "Mobile" "Orange" ...
 $ BIRTH.STATE          : chr  "Colorado" "Alabama" "Alabama" "California" ...
 $ BIRTH.COUNTRY        : chr  "USA" "USA" "USA" "USA" ...
 $ PLAY.DEBUT           : chr  "04/06/2004" "04/13/1954" "04/10/1962" "07/26/1977" ...
 $ Alt_DEBUTDATE        : chr  "04/06/2004" "04/13/1954" "04/10/1962" "07/26/1977" ...
 $ DEBUT_AGE            : chr  "22" "20" "22" "22" ...
 $ PLAY.LASTGAME        : chr  "08/23/2015" "10/03/1976" "09/26/1971" "10/03/1990" ...
 $ Alt_LASTGAME         : chr  "08/23/2015" "10/03/1976" "09/26/1971" "10/03/1990" ...
 $ Complete_Years_Active: int  11 22 9 13 4 11 0 13 4 4 ...
 $ MGR.DEBUT            : chr  "" "" "" "" ...
 $ MGR.LASTGAME         : chr  "" "" "" "" ...
 $ COACH.DEBUT          : chr  "" "" "04/06/1979" "" ...
 $ COACH.LASTGAME       : chr  "" "" "09/30/1984" "" ...
 $ UMP.DEBUT            : chr  "" "" "" "" ...
 $ UMP.LASTGAME         : chr  "" "" "" "" ...
 $ DEATHDATE            : chr  "" "01/22/2021" "08/16/1984" "" ...
 $ DEATH.CITY           : chr  "" "Atlanta" "Atlanta" "" ...
 $ DEATH.STATE          : chr  "" "Georgia" "Georgia" "" ...
 $ DEATH.COUNTRY        : chr  "" "USA" "USA" "" ...
 $ BATS                 : chr  "R" "R" "R" "R" ...
 $ THROWS               : chr  "R" "R" "R" "R" ...
 $ HEIGHT               : chr  "6-05" "6-00" "6-03" "6-03" ...
 $ WEIGHT               : int  200 180 190 190 184 205 192 170 175 169 ...
 $ CEMETERY             : chr  "" "Southview Cemetery" "Catholic Cemetery" "" ...
 $ CEME.CITY            : chr  "" "Atlanta" "Mobile" "" ...
 $ CEME.STATE           : chr  "" "Georgia" "Alabama" "" ...
 $ CEME.COUNTRY         : chr  "" "USA" "USA" "" ...
 $ CEME.NOTE            : chr  "" "" "" "" ...
 $ BIRTH.NAME           : chr  "" "" "" "" ...
 $ NAME.CHG             : logi  NA NA NA NA NA NA ...
 $ BAT.CHG              : logi  NA NA NA NA NA NA ...
 $ HOF                  : chr  "NOT" "HOF" "NOT" "NOT" ...
# I selected only the variables using necessary for my questions.

biographic_baseball_small <- select(biographic_baseball, PLAYERID, LAST, FIRST, 
                                    NICKNAME, BIRTHDATE, BIRTH.CITY, 
                                    BIRTH.STATE, BIRTH.COUNTRY, PLAY.DEBUT, 
                                    DEBUT_AGE, PLAY.LASTGAME, 
                                    Complete_Years_Active, BATS, THROWS,HOF)

Filter for Pre, Post, and Wartime Debut Dates

In this data-wrangling I have filtered for the 4 years prior to World War 2, the 4 years of World War 2, and the 4 years after World War 2.

#  Filter for Wartime Debut Dates as well as Dates 4 Seasons Prior and After

ww_biogrpahic_baseball <- dplyr::filter(biographic_baseball_small, 
                                        grepl('1938|1939|1940|1941|1942|1943
                                        |1944|1945|1946|1947|1948|1949',
                                        PLAY.DEBUT))

#Tidying Debut_Age and Complete_Years_Active to Numeric from Character

ww_biogrpahic_baseball$DEBUT_AGE <- as.numeric(as.character(
  ww_biogrpahic_baseball$DEBUT_AGE))

ww_biogrpahic_baseball$Complete_Years_Active <- as.numeric(as.character(
  ww_biogrpahic_baseball$Complete_Years_Active))

ww_biogrpahic_baseball$PLAY.DEBUT <- mdy(ww_biogrpahic_baseball$PLAY.DEBUT)

head(ww_biogrpahic_baseball)
  PLAYERID      LAST              FIRST NICKNAME  BIRTHDATE
1 aberc101   Aberson Clifford Alexander    Cliff 08/28/1921
2 abert102 Abernathy Talmadge Lafayette      Ted 10/30/1921
3 aberw101 Abernathy     Virgil Woodrow    Woody 02/01/1915
4 abrac101    Abrams        Calvin Ross      Cal 03/02/1924
5 abrej101     Abreu    Joseph Lawrence      Joe 05/24/1913
6 adama101     Adams       Ace Townsend      Ace 03/02/1910
    BIRTH.CITY    BIRTH.STATE BIRTH.COUNTRY PLAY.DEBUT DEBUT_AGE
1      Chicago       Illinois           USA 1947-07-18        25
2        Bynum North Carolina           USA 1942-09-19        20
3  Forest City North Carolina           USA 1946-07-28        31
4 Philadelphia   Pennsylvania           USA 1949-04-19        25
5      Oakland     California           USA 1942-04-23        28
6      Willows     California           USA 1941-04-15        31
  PLAY.LASTGAME Complete_Years_Active BATS THROWS HOF
1    05/09/1949                     1    R      R NOT
2    04/29/1944                     1    R      L NOT
3    04/17/1947                     0    L      L NOT
4    05/09/1956                     7    L      L NOT
5    07/11/1942                     0    R      R NOT
6    04/24/1946                     5    R      R NOT

Creating the Three (Pre, During, Post) Periods using Filtering in dplyr

preWW2_biogrpahic_baseball <- dplyr::filter(ww_biogrpahic_baseball, 
                                        grepl('1938|1939|1940|1941',PLAY.DEBUT))


postWW2_biogrpahic_baseball <- dplyr::filter(ww_biogrpahic_baseball, 
                                        grepl('1946|1947|1948|1949',PLAY.DEBUT))

duringWW2_biogrpahic_baseball <- dplyr::filter(ww_biogrpahic_baseball, 
                                        grepl('1942|1943|1944|1945',PLAY.DEBUT))

Calculating the Mean of Debut Age and Years Active for Each Period

For each period the mean debut age and years active was calculated.

There were small increases to debut age rose in the wartime years compared to pre and post WW2 periods.

There were larger increases to years active in pre and post WW2 periods as compared to wartime.

#Pre WW2 Means of Debut Age and Years Active

summarise(preWW2_biogrpahic_baseball, mean_preWW2_Debut_Age = 
            mean(`DEBUT_AGE`), mean_prepostWWI_Complete_Years_Active = 
              mean(`Complete_Years_Active`))
  mean_preWW2_Debut_Age mean_prepostWWI_Complete_Years_Active
1              23.79565                              5.541304
#During WW2 Means of Debut Age and Years Active

summarise(duringWW2_biogrpahic_baseball, mean_duringWW2_Debut_Age = 
            mean(`DEBUT_AGE`), duringWW2_Complete_Years_Active = 
              mean(`Complete_Years_Active`))
  mean_duringWW2_Debut_Age duringWW2_Complete_Years_Active
1                 24.90698                        3.108527
#Post WW2 Means of Debut Age and Years Active

summarise(postWW2_biogrpahic_baseball, mean_postWW2_Debut_Age = 
            mean(`DEBUT_AGE`), postWW2_Complete_Years_Active = 
              mean(`Complete_Years_Active`))
  mean_postWW2_Debut_Age postWW2_Complete_Years_Active
1               24.77129                      4.827251

Calculating the Median of Debut Age and Years Active for Each Period

For each period the median debut age and years active was calculated.

There was a 2 year increase to debut age in the wartime years compared to pre WW2 periods with no change comparing wartime period to post-WW2 period.

Larger increases were observed in years active in pre and post WW2 periods as compared to wartime.

#Pre WW2 Median of Debut Age and Years Active

summarise(preWW2_biogrpahic_baseball, median_preWW2_Debut_Age = 
            median(`DEBUT_AGE`), median_prepostWWI_Complete_Years_Active = 
              median(`Complete_Years_Active`))
  median_preWW2_Debut_Age median_prepostWWI_Complete_Years_Active
1                      23                                       5
#During WW2 Median of Debut Age and Years Active

summarise(duringWW2_biogrpahic_baseball, median_duringWW2_Debut_Age = 
            median(`DEBUT_AGE`), duringWW2_Complete_Years_Active = 
              median(`Complete_Years_Active`))
  median_duringWW2_Debut_Age duringWW2_Complete_Years_Active
1                         25                               1
#Post WW2 Median of Debut Age and Years Active

summarise(postWW2_biogrpahic_baseball, median_postWW2_Debut_Age = 
            median(`DEBUT_AGE`), postWW2_Complete_Years_Active = 
              median(`Complete_Years_Active`))
  median_postWW2_Debut_Age postWW2_Complete_Years_Active
1                       25                             4

Calculating Standard Deviation of Debut Age and Years Active for Each Period

For each period the standard deviation of debut age and years active was calculated.

Standard deviation in debut age during wartime years increased compared to pre and post WW2 periods.

This contrasted to decreased in standard deviation of years active in wartime period compared to pre and post WW2 periods.

#Pre WW2 SD of Debut Age and Years Active

summarise(preWW2_biogrpahic_baseball, sd_preWW2_Debut_Age = 
            sd(`DEBUT_AGE`), sd_prepostWWI_Complete_Years_Active = 
              median(`Complete_Years_Active`))
  sd_preWW2_Debut_Age sd_prepostWWI_Complete_Years_Active
1            2.911472                                   5
#During WW2 SD of Debut Age and Years Active

summarise(duringWW2_biogrpahic_baseball, sd_duringWW2_Debut_Age = 
            sd(`DEBUT_AGE`), duringWW2_Complete_Years_Active = 
              sd(`Complete_Years_Active`))
  sd_duringWW2_Debut_Age duringWW2_Complete_Years_Active
1                4.13147                        4.504299
#Post WW2 SD of Debut Age and Years Active

summarise(postWW2_biogrpahic_baseball, sd_postWW2_Debut_Age = 
            sd(`DEBUT_AGE`), postWW2_Complete_Years_Active = 
              sd(`Complete_Years_Active`))
  sd_postWW2_Debut_Age postWW2_Complete_Years_Active
1             3.295415                      4.961494

Plotting Debut Age as Compared to Debut Date

I have used ggplot2 to show Debut Age and Debut Date on a scatterplot.

What variable(s) you are visualizing?

The variables are Debut Age and Debut Date

What question(s) you are attempting to answer with the visualization?

I am attempting to answer whether the players were older as they debuted during the wartime period compared to pre and post periods due to the draft or other feelings of obligation to join the armed forces.

What conclusions you can make from the visualization?

I didn’t mean to but I concluded that the 1943 season started late, which is clear by the gap in the scatterplot. According to baseball-almanac.com the first game was in August where the baseball season typicall starts in April. I have also concluded that the clusters of ages did tend to be older during the wartime seasons of 1942 - 1945 but it was not glaringly obvious.

What questions are left unanswered with your visualizations?

While this does get at the age of the player, it does not speak to the quality of the player who debuted. With my dataset, I have information on Hall of Fame status as well as active years in the league and I hope to explore these.

What about the visualizations may be unclear to a naive viewer?

It is unclear what the draft or minimum age requirements is (for example it looks like there was a 15 year-old that debuted, Joe Nuxhall btw). I may be naive to just have been looking at older ages and should have also been looking at ages below the draft age of 21 and the minimum 18 age to volunteer.

How could you improve the visualizations for the final project?

It is unclear exactly what season each of clusters are refering to and I should interval by every year.

#Scatter plot of Debut Age and Debut Date

ggplot(ww_biogrpahic_baseball, aes(PLAY.DEBUT, DEBUT_AGE)) + geom_point()

Showing Hall of Famers within the Dataset

I have used ggplot2 to display the Hall of Famers within the dataset.

What variable(s) you are visualizing?

The variable I am visualizing is players from my dataset that debuted between 1938 and 1949 that are hall of famers.

What question(s) you are attempting to answer with the visualization?

I am attempting to answer the quality question. How good were the players in this period that I am looking at. Seeing whether a player is in the Hall of Fame indicates success.

What conclusions you can make from the visualization?

I can only conclude that there were HOF players that debuted in this time period.

What questions are left unanswered with your visualizations?

I am not able to see what years these Hall of Famers debuted and whether or not there is a change to the amount of Hall of Famers in wartime compared to peacetime.

What about the visualizations may be unclear to a naive viewer?

It is unclear how many total players and how many HOF and non-HOF players there are.

How could you improve the visualizations for the final project?

I would bucket debut years into pre wW2, during WW2, and post WW2 to compare the amounts of players along with whether or not they became HOFs. I would also include counts at the tops of each bar.

ggplot(ww_biogrpahic_baseball) + 
  geom_bar(mapping = aes(x = HOF))

This is the end of the document.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Martins (2022, Jan. 11). Data Analytics and Computational Social Science: Homework 4 Read, Stats, Visualizations. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommartike8854623/

BibTeX citation

@misc{martins2022homework,
  author = {Martins, K},
  title = {Data Analytics and Computational Social Science: Homework 4 Read, Stats, Visualizations},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscommartike8854623/},
  year = {2022}
}