Code
library(tidyverse)
::opts_chunk$set(echo = TRUE) knitr
Tyler Tewksbury
March 21, 2023
In 2020, after the release of the Netflix series The Queen’s Gambit, interest in chess was at an all-time high. This led many fans of the show to use popular websites such as Chess.com and Lichess.org to begin learning the game. These websites use a rating system that mimics that of official in-person chess leagues, increasing your rating number as you win and decreasing as you lose. This can be used to measure one’s skill in chess, and determining if they can enter certain competitions.
When playing chess online, as you face someone completely random that the website matches you against, there is no guarantee that you will play against someone with an identical rating. Thus, there will typically be a difference between the two players’ rating. Obviously the player with the higher rating would be more likely to win, right? That is where this study comes in. By quantifying the effect of rating difference on win chance, players may be able to understand more about the match they are currently in. Knowing how likely they are to win, how likely their opponent is to win, and this could lead to further interesting research about making the most fair chess matches possible. As there are not any academic studies on the topic, there is no proven indicator that a slight discrepancy in rating has is an indicator to a player’s win chance. This poses the research question:
How strong of a predictor is the difference between players chess rating in determining the victor?
The dataset being used is sourced from Kaggle: https://www.kaggle.com/datasets/datasnaek/chess
Gathered in 2016, the dataset contains information from over 20,000 matches on Lichess.org via the Lichess API. Information on the players’, their opening moves, the results of the match, and more are all columns within the dataset.
'data.frame': 20058 obs. of 16 variables:
$ id : chr "TZJHLljE" "l1NXvwaE" "mIICvQHh" "kWKvrqYL" ...
$ rated : chr "FALSE" "TRUE" "TRUE" "TRUE" ...
$ created_at : num 1.5e+12 1.5e+12 1.5e+12 1.5e+12 1.5e+12 ...
$ last_move_at : num 1.5e+12 1.5e+12 1.5e+12 1.5e+12 1.5e+12 ...
$ turns : int 13 16 61 61 95 5 33 9 66 119 ...
$ victory_status: chr "outoftime" "resign" "mate" "mate" ...
$ winner : chr "white" "black" "white" "white" ...
$ increment_code: chr "15+2" "5+10" "5+10" "20+0" ...
$ white_id : chr "bourgris" "a-00" "ischia" "daniamurashov" ...
$ white_rating : int 1500 1322 1496 1439 1523 1250 1520 1413 1439 1381 ...
$ black_id : chr "a-00" "skinnerua" "a-00" "adivanov2009" ...
$ black_rating : int 1191 1261 1500 1454 1469 1002 1423 2108 1392 1209 ...
$ moves : chr "d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5 Bf4" "d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6 Qe5+ Nxe5 c4 Bb4+" "e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc6 bxc6 Ra6 Nc4 a4 c3 a3 Nxa3 Rxa3 Rxa3 c4 dxc4 d5 cxd5 Qxd5 exd5 "| __truncated__ "d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O-O O-O-O Nb5 Nb4 Rc1 Nxa2 Ra1 Nb4 Nxa7+ Kb8 Nb5 Bxc2 Bxc7+ Kc8 Qd"| __truncated__ ...
$ opening_eco : chr "D10" "B00" "C20" "D02" ...
$ opening_name : chr "Slav Defense: Exchange Variation" "Nimzowitsch Defense: Kennedy Variation" "King's Pawn Game: Leonardis Variation" "Queen's Pawn Game: Zukertort Variation" ...
$ opening_ply : int 5 4 3 3 5 4 10 5 6 4 ...
The dataset contains 20058 observations across 16 variables.
id rated created_at last_move_at
Length:20058 Length:20058 Min. :1.377e+12 Min. :1.377e+12
Class :character Class :character 1st Qu.:1.478e+12 1st Qu.:1.478e+12
Mode :character Mode :character Median :1.496e+12 Median :1.496e+12
Mean :1.484e+12 Mean :1.484e+12
3rd Qu.:1.503e+12 3rd Qu.:1.503e+12
Max. :1.504e+12 Max. :1.504e+12
turns victory_status winner increment_code
Min. : 1.00 Length:20058 Length:20058 Length:20058
1st Qu.: 37.00 Class :character Class :character Class :character
Median : 55.00 Mode :character Mode :character Mode :character
Mean : 60.47
3rd Qu.: 79.00
Max. :349.00
white_id white_rating black_id black_rating
Length:20058 Min. : 784 Length:20058 Min. : 789
Class :character 1st Qu.:1398 Class :character 1st Qu.:1391
Mode :character Median :1567 Mode :character Median :1562
Mean :1597 Mean :1589
3rd Qu.:1793 3rd Qu.:1784
Max. :2700 Max. :2723
moves opening_eco opening_name opening_ply
Length:20058 Length:20058 Length:20058 Min. : 1.000
Class :character Class :character Class :character 1st Qu.: 3.000
Mode :character Mode :character Mode :character Median : 4.000
Mean : 4.817
3rd Qu.: 6.000
Max. :28.000
Looking at the summary, it is clear what variables will be used and if any new columns will be added. The following will prove relevance to the research question:
winner
white_id
white_rating
black_id
black_rating
A new column containing the difference between the rating will be added in the next iteration for analysis. Having this added column will make the functions necessary for analysis easier, as calculating the difference will not need to be repeated for each observation.
The ranges of the two sides are nearly identical, and are quite large nearing 2000. This could be both good and bad for the study, as the large range could prove significant, but it may be necessary to break the models into smaller ranges. This could also be interesting, perhaps seeing if the rating differences at a lower level matter more than that of a higher level, or vice versa.
The obvious model for this question will be a a linear probability regression, as the victory status is a binary variable. Proposed models initially are:
Linear probability including unranked Linear probability excluding unranked
There will be more models, potentially differentiating between the different ranges as stated earlier. More possibilities include looking at exclusively drawn game data, analyzing favored openings depending on rank, or possibly finding other predictors if the rank difference is not significant.
---
title: "Final Project Part 1"
author: "Tyler Tewksbury"
desription: "First Final Project check-in"
date: "03/21/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- finalpart1
- tyler tewksbury
- chess
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE)
```
## Background and Research Question
In 2020, after the release of the Netflix series *The Queen's Gambit*, interest in chess was at an all-time high. This led many fans of the show to use popular websites such as Chess.com and Lichess.org to begin learning the game. These websites use a rating system that mimics that of official in-person chess leagues, increasing your rating number as you win and decreasing as you lose. This can be used to measure one's skill in chess, and determining if they can enter certain competitions.
When playing chess online, as you face someone completely random that the website matches you against, there is no guarantee that you will play against someone with an identical rating. Thus, there will typically be a difference between the two players' rating. Obviously the player with the higher rating would be more likely to win, right? That is where this study comes in. By quantifying the effect of rating difference on win chance, players may be able to understand more about the match they are currently in. Knowing how likely they are to win, how likely their opponent is to win, and this could lead to further interesting research about making the most fair chess matches possible. As there are not any academic studies on the topic, there is no proven indicator that a slight discrepancy in rating has is an indicator to a player's win chance. This poses the research question:
**How strong of a predictor is the difference between players chess rating in determining the victor?**
## Dataset
The dataset being used is sourced from Kaggle: https://www.kaggle.com/datasets/datasnaek/chess
Gathered in 2016, the dataset contains information from over 20,000 matches on Lichess.org via the Lichess API. Information on the players', their opening moves, the results of the match, and more are all columns within the dataset.
```{r}
#reading in the dataset
chess <- read.csv("_data/chess_games.csv")
```
## Descriptive Statistics
```{r}
str(chess)
```
The dataset contains 20058 observations across 16 variables.
```{r}
summary(chess)
```
Looking at the summary, it is clear what variables will be used and if any new columns will be added. The following will prove relevance to the research question:
* `rated``
* `victory_status``
* `winner`
* `white_id`
* `white_rating`
* `black_id`
* `black_rating`
A new column containing the difference between the rating will be added in the next iteration for analysis. Having this added column will make the functions necessary for analysis easier, as calculating the difference will not need to be repeated for each observation.
#### white_rating and black_rating
```{r}
range(chess$white_rating)
range(chess$black_rating)
```
The ranges of the two sides are nearly identical, and are quite large nearing 2000. This could be both good and bad for the study, as the large range could prove significant, but it may be necessary to break the models into smaller ranges. This could also be interesting, perhaps seeing if the rating differences at a lower level matter more than that of a higher level, or vice versa.
## Proposed Models
The obvious model for this question will be a a linear probability regression, as the victory status is a binary variable. Proposed models initially are:
Linear probability including unranked
Linear probability excluding unranked
There will be more models, potentially differentiating between the different ranges as stated earlier. More possibilities include looking at exclusively drawn game data, analyzing favored openings depending on rank, or possibly finding other predictors if the rank difference is not significant.