Final Project Part 1

finalpart1
tyler tewksbury
chess
Author

Tyler Tewksbury

Published

March 21, 2023

Code
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE)

Background and Research Question

In 2020, after the release of the Netflix series The Queen’s Gambit, interest in chess was at an all-time high. This led many fans of the show to use popular websites such as Chess.com and Lichess.org to begin learning the game. These websites use a rating system that mimics that of official in-person chess leagues, increasing your rating number as you win and decreasing as you lose. This can be used to measure one’s skill in chess, and determining if they can enter certain competitions.

When playing chess online, as you face someone completely random that the website matches you against, there is no guarantee that you will play against someone with an identical rating. Thus, there will typically be a difference between the two players’ rating. Obviously the player with the higher rating would be more likely to win, right? That is where this study comes in. By quantifying the effect of rating difference on win chance, players may be able to understand more about the match they are currently in. Knowing how likely they are to win, how likely their opponent is to win, and this could lead to further interesting research about making the most fair chess matches possible. As there are not any academic studies on the topic, there is no proven indicator that a slight discrepancy in rating has is an indicator to a player’s win chance. This poses the research question:

How strong of a predictor is the difference between players chess rating in determining the victor?

Dataset

The dataset being used is sourced from Kaggle: https://www.kaggle.com/datasets/datasnaek/chess

Gathered in 2016, the dataset contains information from over 20,000 matches on Lichess.org via the Lichess API. Information on the players’, their opening moves, the results of the match, and more are all columns within the dataset.

Code
#reading in the dataset
chess <- read.csv("_data/chess_games.csv")

Descriptive Statistics

Code
str(chess)
'data.frame':   20058 obs. of  16 variables:
 $ id            : chr  "TZJHLljE" "l1NXvwaE" "mIICvQHh" "kWKvrqYL" ...
 $ rated         : chr  "FALSE" "TRUE" "TRUE" "TRUE" ...
 $ created_at    : num  1.5e+12 1.5e+12 1.5e+12 1.5e+12 1.5e+12 ...
 $ last_move_at  : num  1.5e+12 1.5e+12 1.5e+12 1.5e+12 1.5e+12 ...
 $ turns         : int  13 16 61 61 95 5 33 9 66 119 ...
 $ victory_status: chr  "outoftime" "resign" "mate" "mate" ...
 $ winner        : chr  "white" "black" "white" "white" ...
 $ increment_code: chr  "15+2" "5+10" "5+10" "20+0" ...
 $ white_id      : chr  "bourgris" "a-00" "ischia" "daniamurashov" ...
 $ white_rating  : int  1500 1322 1496 1439 1523 1250 1520 1413 1439 1381 ...
 $ black_id      : chr  "a-00" "skinnerua" "a-00" "adivanov2009" ...
 $ black_rating  : int  1191 1261 1500 1454 1469 1002 1423 2108 1392 1209 ...
 $ moves         : chr  "d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5 Bf4" "d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6 Qe5+ Nxe5 c4 Bb4+" "e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc6 bxc6 Ra6 Nc4 a4 c3 a3 Nxa3 Rxa3 Rxa3 c4 dxc4 d5 cxd5 Qxd5 exd5 "| __truncated__ "d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O-O O-O-O Nb5 Nb4 Rc1 Nxa2 Ra1 Nb4 Nxa7+ Kb8 Nb5 Bxc2 Bxc7+ Kc8 Qd"| __truncated__ ...
 $ opening_eco   : chr  "D10" "B00" "C20" "D02" ...
 $ opening_name  : chr  "Slav Defense: Exchange Variation" "Nimzowitsch Defense: Kennedy Variation" "King's Pawn Game: Leonardis Variation" "Queen's Pawn Game: Zukertort Variation" ...
 $ opening_ply   : int  5 4 3 3 5 4 10 5 6 4 ...

The dataset contains 20058 observations across 16 variables.

Code
summary(chess)
      id               rated             created_at         last_move_at      
 Length:20058       Length:20058       Min.   :1.377e+12   Min.   :1.377e+12  
 Class :character   Class :character   1st Qu.:1.478e+12   1st Qu.:1.478e+12  
 Mode  :character   Mode  :character   Median :1.496e+12   Median :1.496e+12  
                                       Mean   :1.484e+12   Mean   :1.484e+12  
                                       3rd Qu.:1.503e+12   3rd Qu.:1.503e+12  
                                       Max.   :1.504e+12   Max.   :1.504e+12  
     turns        victory_status        winner          increment_code    
 Min.   :  1.00   Length:20058       Length:20058       Length:20058      
 1st Qu.: 37.00   Class :character   Class :character   Class :character  
 Median : 55.00   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 60.47                                                           
 3rd Qu.: 79.00                                                           
 Max.   :349.00                                                           
   white_id          white_rating    black_id          black_rating 
 Length:20058       Min.   : 784   Length:20058       Min.   : 789  
 Class :character   1st Qu.:1398   Class :character   1st Qu.:1391  
 Mode  :character   Median :1567   Mode  :character   Median :1562  
                    Mean   :1597                      Mean   :1589  
                    3rd Qu.:1793                      3rd Qu.:1784  
                    Max.   :2700                      Max.   :2723  
    moves           opening_eco        opening_name        opening_ply    
 Length:20058       Length:20058       Length:20058       Min.   : 1.000  
 Class :character   Class :character   Class :character   1st Qu.: 3.000  
 Mode  :character   Mode  :character   Mode  :character   Median : 4.000  
                                                          Mean   : 4.817  
                                                          3rd Qu.: 6.000  
                                                          Max.   :28.000  

Looking at the summary, it is clear what variables will be used and if any new columns will be added. The following will prove relevance to the research question:

  • `rated``
  • `victory_status``
  • winner
  • white_id
  • white_rating
  • black_id
  • black_rating

A new column containing the difference between the rating will be added in the next iteration for analysis. Having this added column will make the functions necessary for analysis easier, as calculating the difference will not need to be repeated for each observation.

white_rating and black_rating

Code
range(chess$white_rating)
[1]  784 2700
Code
range(chess$black_rating)
[1]  789 2723

The ranges of the two sides are nearly identical, and are quite large nearing 2000. This could be both good and bad for the study, as the large range could prove significant, but it may be necessary to break the models into smaller ranges. This could also be interesting, perhaps seeing if the rating differences at a lower level matter more than that of a higher level, or vice versa.

Proposed Models

The obvious model for this question will be a a linear probability regression, as the victory status is a binary variable. Proposed models initially are:

Linear probability including unranked Linear probability excluding unranked

There will be more models, potentially differentiating between the different ranges as stated earlier. More possibilities include looking at exclusively drawn game data, analyzing favored openings depending on rank, or possibly finding other predictors if the rank difference is not significant.