Data Analytics and Computational Social Science: ATP Tennis Analysis - Homework 6 / Final Project Draft 1

Jason O'Connell

Latest Overview

ATP Tennis Statistics Analysis - Iteration 6

For this iteration: I will begin exploring total player population, results of various sub-populations and prize money from tournament performance. I am interested to understand the equity versus pay for performance in the game. First I will develop routines for categorizing the data data for XX years but later will bringing more years data.

Also for this iteration I have tried the following new things: RMarkdown features: TOC, aside, code_folding

paged_table, kable, full_join, and arrange.

Initialize Libraries

Show code

library(tidyverse)
library(readr)
library(readxl)
library(dplyr)
library(ggplot2)
library(rmarkdown)
library(kableExtra)

Read the data

First read ATP data files, here I am using only the results from 1980, 1990, 2000, 2010 from GITHUB source.

(Eventually I want to read many more years using a for loop and/or function)

The data includes detailed information about each match played on the ATP tour in a given year. Tournament Data: tourney-id (YYYY-###), tourney_name, surface, draw_size, tourney_level, tourney_date Match Data: match_num, score, best_of, minutes, round Player Data (for each winner & loser): _id, _seed, _entry (WC, Q), _name, _hand, _height, _ioc, _age, _rank, _points Match Statistics (for each w & l): _ace, _df, _svpt, _1stIn, _1stWon, _2ndWon, _svGms, _bpSaved, _bpFaced

Show code

ATP1980 <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1980.csv")
ATP1990 <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1990.csv")
ATP2000 <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2000.csv")
ATP2010 <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2010.csv")

Create Working Data Set

This will require further development but for starts I will manually append the data sets into a working version.

tourney_level “D” is for Davis Cup - I will exclude this from my study for now.

Show code

ATP_Base<-bind_rows(mutate(ATP1980,season = "1980"),mutate(ATP1990, season = "1990"), mutate(ATP2000, season = "2000"), mutate(ATP2010, season = "2010")) %>%
  filter(!str_detect("D", tourney_level))

paged_table(head(ATP_Base,100))

Show code

#rm(list = c("ATP1980","ATP1990","ATP2000","ATP2010"))

Base Statistics

Lets start with some simple statistics by year. ¹

Tournament Matches

Show code

ATP_Base %>%
  group_by(season) %>%
  summarize(season, num_matches = n()) %>%
  distinct() %>%
  kable(col.names = c("Season","Total Matches"), caption = "Number of Tournament Matches by Year")

Table 1: Number of Tournament Matches by Year
Season	Total Matches
1980	3945
1990	3362
2000	3053
2010	2705

Tournaments

Show code

ATP_Base %>%
  select(season,tourney_id) %>%
  distinct() %>%
  group_by(season) %>%
  summarize(season, num_tournaments = n()) %>%
  distinct() %>%
  kable(col.names = c("Season","Tournaments"), caption = "Number of Tournaments by Year")

Table 2: Number of Tournaments by Year
Season	Tournaments
1980	101
1990	81
2000	71
2010	66

Players

Show code

bind_rows(
  ATP_Base %>%
    select(season,winner_id) %>%
    distinct(),
  ATP_Base %>%
    select(season,loser_id) %>%
    distinct()
) %>%
  group_by(season) %>%
  summarize(season, num_players = n()) %>%
  distinct() %>%
  kable(col.names = c("Season","Players"), caption = "Number of Players by Year")

Table 3: Number of Players by Year
Season	Players
1980	901
1990	678
2000	578
2010	538

Top Winners

Show code

ATP_Base %>%
    select(season,winner_id, winner_name) %>%
    group_by(season, winner_id, winner_name) %>%
    summarize(season, winner_id, winner_name, num_wins = n()) %>%
    distinct() %>%
    arrange(desc(num_wins)) %>%
    head(20)%>%
    kable(col.names = c("Season","Player ID","Player Name","Wins"), caption = "Top Winners")%>%
    remove_column(c(2))

Table 4: Top Winners
Season	Player Name	Wins
1980	Ivan Lendl	104
1980	John McEnroe	83
1980	Jimmy Connors	73
1980	Gene Mayer	72
1980	Eliot Teltscher	72
2000	Marat Safin	72
1990	Boris Becker	71
2010	Rafael Nadal	71
1980	Jose Luis Clerc	69
1980	Brian Gottfried	68
1980	Bjorn Borg	68
1990	Stefan Edberg	68
1980	Guillermo Vilas	67
2000	Magnus Norman	67
2000	Yevgeny Kafelnikov	66
2010	Roger Federer	66
1980	Harold Solomon	65
1980	Brian Teacher	63
2000	Gustavo Kuerten	62
1980	Wojtek Fibak	59

Create Player Statistics Table

Show code

player_table<-full_join(
  ATP_Base %>%    
    select(season,winner_id, winner_name) %>%
    group_by(season, winner_id, winner_name) %>%
    summarize(season, winner_id, winner_name, num_wins = n()) %>%
    distinct() %>%
    rename(player_id = winner_id, player_name = winner_name),
  ATP_Base %>%
    select(season,loser_id, loser_name) %>%
    group_by(season, loser_id, loser_name) %>%
    summarize(season, loser_id, loser_name, num_loses = n()) %>%
    distinct() %>%
    rename(player_id = loser_id, player_name = loser_name),
  by=c("season"="season","player_id"="player_id","player_name"="player_name")
) %>%
  mutate(num_matches=num_wins+num_loses, win_percent=100*num_wins/(num_wins+num_loses))

Player Statistics

Show code

player_table%>%
  select(season, player_id, player_name, win_percent)%>%
  arrange(desc(win_percent))%>%
#  top_n(20, win_percent) %>%
  head(20)%>%
  kable(digits = 0, col.names = c("Season","Player ID","Player Name","Wins Percent"), caption = "Top Winners")%>%
  remove_column(c(2))

Table 5: Top Winners
Season	Player Name	Wins Percent
1980	Bjorn Borg	91
2010	Rafael Nadal	88
2010	Roger Federer	84
1980	John McEnroe	83
1980	Jimmy Connors	83
1990	Boris Becker	83
1990	Stefan Edberg	82
1980	Gene Mayer	82
1990	Ivan Lendl	82
1990	Andre Agassi	81
1980	Vadim Borisov	80
1980	Guillermo Vilas	79
1980	Ivan Lendl	79
2000	Gustavo Kuerten	77
2000	Lleyton Hewitt	76
2000	Pete Sampras	76
1980	Jose Luis Clerc	76
2010	Novak Djokovic	76
1980	Konstantin Pugayev	75
1990	Scott Melville	75

Player total wins histogram

This is a key chart that is beginning to paint the picture about how many ATP players are participating in tournaments but winning very few matches.

Show code

player_table%>%
  select(season, player_id, num_wins)%>%
#  group_by(season, num_wins)%>%
#  summarize(season, num_wins, count = n())%>%
#  distinct()%>%
  ggplot(aes(x=num_wins))+
  geom_histogram(binwidth = 15, fill = "SkyBlue")+
#  scale_fill_brewer(palette = "Blues")+
  facet_grid(rows = vars(season))+
  labs(title="Number of ATP players and win total (Binwidth=15)", x="Wins", y="Number of Players")

Players total wins point chart

This chart demonstrates that same point but visually it is a little more diffcult to see the trend given how many player literally win 0-1 match all year.

Show code

player_table%>%
  select(season, player_id, num_wins)%>%
  group_by(season, num_wins)%>%
  summarize(season, num_wins, count = n())%>%
  distinct()%>%
  ggplot(aes(x=num_wins, y=count))+
  geom_point(color = "Blue")+
  geom_smooth(fill = "SkyBlue") +
  facet_grid(cols = vars(season))+
  labs(title="Number of ATP players and win totals", x="Wins", y="Number of Players")

Conclusion

For this HW 6 - I’ve started working with the part of the data set to support my hypothesis that there are many, many players at the fringe of ATP who are not winning enough matches to support there career. At the same time there are a select few who win the majority of matches (and associate prize money) but it’s all made possible by those striving for the top but never making it.

Distill is a publication format for scientific and technical writing, native to the web.

Learn more about using Distill at https://rstudio.github.io/distill.

Only ATP tournament matches are include. Excludes Davis Cup, Challenge, etc.↩︎

Comment on this article Share:

ATP Tennis Analysis - Homework 6 / Final Project Draft 1

Latest Overview

Initialize Libraries

Read the data

Create Working Data Set

Base Statistics

Tournament Matches

Tournaments

Players

Top Winners

Create Player Statistics Table

Player Statistics

Player total wins histogram

Players total wins point chart

Conclusion

Reuse

Citation