Final Project Check In

finalpart1

Author

Young Soo Choi

Published

March 22, 2023

Intro

Spring is the season when baseball starts. In Korea, baseball is also the most popular league that attracts the largest number of spectators among all professional sports, and I have been a big fan since I was young. As you know, baseball is a sport with countless numbers, and this data analysis is actually done a lot. I can’t think of an interesting way to use and learn various statistical techniques in the subject of quantity analysis than using the sport of baseball. Therefore, I chose baseball as the my final project topic.

Setting Research Problem

Baseball is a game in which each nine-member team competes for nine innings to score more points. It is obviously a player who participates here. However, the baseball team is not made up of only players. Each baseball team has a manager, who decides who will participate in the game before the game, sets the batting order, and during the game, replaces pitchers, uses pinch hitters, and steals.

However, there are various controversies over how much of these managers account for in this game. This controversy varies from league to league, for example, in the U.S. Major League Baseball, the manager tends to be viewed as an organization manager, and in Asia, such as Japan and Korea, the manager is often treated as a general commanding the army in battle. In other words, in the case of the Asian league, there are many views that the manager actually has the ability to increase the team’s score and reduce the number of runs by performing various operations within the game. On the contrary, in the case of Major League Baseball, the prevailing perception is that the proportion of manager actually involved in the team’s scoring and losing points in the game is very small.

There may be various opinions on where this difference in perception comes from. Of course, it should be considered that the environment of the game called baseball is very different from country to country. However, in the United States, various empirical studies have been conducted on these topics using vast amounts of actual data, but in Korea, these studies are still in their early stages. With this in mind, I would like to use actual data from Korean baseball to empirically check how the manager’s intervention affects a single game. In other words, the manager’s organizational management capabilities, the ability to manage and motivate players, are separate, and simply want to find out how often he operates in a game, that is, how much he intervenes in the game, affects the team’s scoring ability.(Here, the correlation between the team’s ability to curb losing points and the manager’s intervention is also important, but it will be excluded from this project. This is because unlike batting, in the case of defense, it is not only difficult to quantify related variables, but also difficult to obtain such data.)

In conclusion, the research question of this project that I set is, will the more intervention from the coach, the better the scoring ability of the baseball team?

Accordingly, the research hypothesis will be set as “the more intervention the coach has, the more points the team has.” At this time, the null hypothesis is that “the manager’s intervention in the game does not increase the team’s score.”

Research Design

Key Concepts and Operational Definitions

A baseball team’s score is simply available. However, it is necessary to first determine how to define the manager’s intervention. There are various roles that managers can play in baseball’s offense. They include determining the batting order, using pinch-hitters, bunting, etc. Although it has not been confirmed yet, the director’s intervention in this project will be determined by the number of attempts to steal and bunt in consideration of the ease of obtaining data. In summary, the independent variable is the baseball manager’s intervention in the game(offense), and the dependent variable is the team’s score.

How to test hypothesis

There are many ways to do this. First of all, it is possible to simply compare the average score between teams with a lot of manager intervention and teams with less manager intervention. However, this simple comparison may not take into account the team’s differences in offensive power. In other words, if a team with strong offensive power (i.e., a team with the ability to score more points without the manager’s intervention) had more coach intervention, the conclusion could be distorted. Therefore, considering this, a method of obtaining the expected score level of each team using a regression model and comparing how much the actual score was can be used.

Explore the data

The Korean professional baseball league was launched in 1982 and is called the KBO(Korea Baseball Organization) League. Fortunately, a jounalist provided a career record of the kbo league from 1982 to 2020 and I used it, and all Korean words included in the data were modified in English.

source: https://github.com/bigkini/kindeR

Code

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

batting <- read_csv("_data/kbo_team_batting_eng.csv")

Rows: 313 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): team
dbl (21): year, g, batters, tpa, ab, h, 2b, 3b, hr, bb, ibb, hbp, so, rbi, r...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

win_lose <- read_csv("_data/kbo_win_lose_eng.csv")

Rows: 313 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): team
dbl (7): year, games, win, lose, tie, runs scored, runs allowed

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

‘batting’ is a team-specific batting record related to the offense, and ‘win_lose’ is a team-specific winning and losing record by year.

For easy analysis, these two data are combined into one.

Code

df<-merge(win_lose, batting, by=c("year", "team"))

Now, let’s briefly explore the data as an introduction to the kbo league.(For reference, each team’s name has been unified to the most recent name if it has been changed.)

Code

head(df)

  year     team games win lose tie runs scored runs allowed  g batters  tpa
1 1982    Bears    80  56   24   0         399          318 80     930 3098
2 1982   Giants    80  31   49   0         353          385 80     863 3062
3 1982    Lions    80  54   26   0         429          257 80     887 3043
4 1982   Tigers    80  38   42   0         374          388 80     873 2990
5 1982    Twins    80  46   34   0         419          350 80     952 3061
6 1982 Unicorns    80  15   65   0         302          574 80     867 2954
    ab   h  2b 3b hr  bb ibb hbp  so rbi   r sh sf  sb cs gidp   e
1 2745 778 137 23 57 247  22  41 254 362 399 46 18 106 61   35  98
2 2628 674 112  8 59 326   8  40 315 325 353 41 27  83 53   61  97
3 2647 705 126 18 57 307   2  30 349 374 429 36 18 147 42   50  81
4 2665 696 110 14 84 235  12  41 296 332 374 28 21 155 52   59 102
5 2686 757 124 12 65 268  20  47 316 381 419 32 27 134 60   56 105
6 2653 637 117 20 40 221   3  29 369 272 302 33 17  74 43   44 117

Code

tail(df)

    year    team games win lose tie runs scored runs allowed   g batters  tpa
308 2020   Heros   144  80   63   1         759          692 144    1767 5721
309 2020 Landers   144  51   92   1         634          846 144    1897 5502
310 2020   Lions   144  64   75   5         699          745 144    1865 5574
311 2020  Tigers   144  73   71   0         724          795 144    1809 5642
312 2020   Twins   144  79   61   4         802          694 144    1986 5681
313 2020     Wiz   144  81   62   1         813          715 144    2002 5762
      ab    h  2b 3b  hr  bb ibb hbp   so rbi   r sh sf  sb cs gidp  e
308 4945 1332 254 25 127 608  16  73 1030 713 759 43 52 113 27  107 98
309 4839 1212 177 17 143 511  18  69  981 595 634 49 34  81 45  112 91
310 4923 1317 211 12 129 486  10  62  990 658 699 50 53 132 49  117 84
311 4937 1355 224 13 130 535  10  69  957 692 724 63 38  47 25  114 82
312 4999 1384 253 29 149 509  21  75  969 760 802 49 49  83 39  115 74
313 5047 1432 238 21 163 554  20  52 1097 767 813 64 45 106 50  104 92

Code

table(df$year)


1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 
   6    6    6    6    7    7    7    7    7    8    8    8    8    8    8    8 
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 
   8    8    8    8    8    8    8    8    8    8    8    8    8    8    8    9 
2014 2015 2016 2017 2018 2019 2020 
   9   10   10   10   10   10   10

Code

table(df$team)


   Bears    Dinos   Eagles   Giants    Heros  Landers    Lions  Raiders 
      39        8       35       39       13       21       39        9 
  Tigers    Twins Unicorns      Wiz 
      39       39       26        6

You can see the number of teams participating in the league by year and know the number of years participating in the league for each team. In other words, in the 39 years leading up to 2020, the number of participating teams increased from 6 to 10, and the total number of teams that existed was 12. (2 teams disbanded, Unicorns and Raiders)

Code

plot(df$year, df$games)

Looking at the number of games played by each team by year, it can be seen that it increased from 80 games d in 1982 to 144 games in 2020.

Finally, let’s briefly look at various variables related to hitting.

Code

summary(df[,10:28])

    batters          tpa             ab             h              2b       
 Min.   : 863   Min.   :2954   Min.   :2628   Min.   : 637   Min.   :110.0  
 1st Qu.:1560   1st Qu.:4710   1st Qu.:4119   1st Qu.:1040   1st Qu.:171.0  
 Median :1645   Median :4975   Median :4318   Median :1143   Median :197.0  
 Mean   :1620   Mean   :4927   Mean   :4301   Mean   :1149   Mean   :198.1  
 3rd Qu.:1729   3rd Qu.:5254   3rd Qu.:4549   3rd Qu.:1264   3rd Qu.:223.0  
 Max.   :2022   Max.   :5870   Max.   :5176   Max.   :1601   Max.   :304.0  
       3b             hr              bb             ibb            hbp        
 Min.   : 3.0   Min.   : 29.0   Min.   :221.0   Min.   : 2.0   Min.   : 23.00  
 1st Qu.:16.0   1st Qu.: 76.0   1st Qu.:397.0   1st Qu.:13.0   1st Qu.: 51.00  
 Median :20.0   Median :101.0   Median :453.0   Median :17.0   Median : 65.00  
 Mean   :21.4   Mean   :106.8   Mean   :445.8   Mean   :17.8   Mean   : 66.05  
 3rd Qu.:26.0   3rd Qu.:134.0   3rd Qu.:499.0   3rd Qu.:22.0   3rd Qu.: 80.00  
 Max.   :62.0   Max.   :234.0   Max.   :621.0   Max.   :48.0   Max.   :130.00  
       so              rbi              r               sh        
 Min.   : 254.0   Min.   :272.0   Min.   :302.0   Min.   : 21.00  
 1st Qu.: 622.0   1st Qu.:464.0   1st Qu.:499.0   1st Qu.: 57.00  
 Median : 780.0   Median :554.0   Median :590.0   Median : 75.00  
 Mean   : 755.8   Mean   :558.4   Mean   :594.9   Mean   : 76.36  
 3rd Qu.: 905.0   3rd Qu.:650.0   3rd Qu.:687.0   3rd Qu.: 91.00  
 Max.   :1208.0   Max.   :898.0   Max.   :944.0   Max.   :153.00  
       sf              sb            cs              gidp       
 Min.   :12.00   Min.   : 35   Min.   : 20.00   Min.   : 35.00  
 1st Qu.:30.00   1st Qu.: 85   1st Qu.: 45.00   1st Qu.: 86.00  
 Median :37.00   Median :105   Median : 53.00   Median : 97.00  
 Mean   :37.13   Mean   :108   Mean   : 54.61   Mean   : 97.46  
 3rd Qu.:44.00   3rd Qu.:130   3rd Qu.: 63.00   3rd Qu.:111.00  
 Max.   :83.00   Max.   :220   Max.   :101.00   Max.   :148.00  
       e         
 Min.   : 61.00  
 1st Qu.: 85.00  
 Median : 95.00  
 Mean   : 95.47  
 3rd Qu.:105.00  
 Max.   :135.00

Before finishing this task, I will modified the column name as a whole, delete one duplicate column, and save it as a csv file to facilitate the next task.

Code

colnames(df) <- c("year","team","game", "win", "lose", "tie", "run_scored", "run_allowed", "game
                  _played", "batters", "tpa", "ab", "hit", "double", "triple", "hr", "bb", "ibb", "hbp", "so", "rbi", "r", "sh", "sf", "sb", "cs", "gidp", "e")

df<-df[,-9]
head(df)

  year     team game win lose tie run_scored run_allowed batters  tpa   ab hit
1 1982    Bears   80  56   24   0        399         318     930 3098 2745 778
2 1982   Giants   80  31   49   0        353         385     863 3062 2628 674
3 1982    Lions   80  54   26   0        429         257     887 3043 2647 705
4 1982   Tigers   80  38   42   0        374         388     873 2990 2665 696
5 1982    Twins   80  46   34   0        419         350     952 3061 2686 757
6 1982 Unicorns   80  15   65   0        302         574     867 2954 2653 637
  double triple hr  bb ibb hbp  so rbi   r sh sf  sb cs gidp   e
1    137     23 57 247  22  41 254 362 399 46 18 106 61   35  98
2    112      8 59 326   8  40 315 325 353 41 27  83 53   61  97
3    126     18 57 307   2  30 349 374 429 36 18 147 42   50  81
4    110     14 84 235  12  41 296 332 374 28 21 155 52   59 102
5    124     12 65 268  20  47 316 381 419 32 27 134 60   56 105
6    117     20 40 221   3  29 369 272 302 33 17  74 43   44 117

Code

setwd("~/R/603_Spring_2023/posts")
write.csv(df, "~/R/603_Spring_2023/posts/_data/kbo_df.csv")