ATP Tennis Analysis - Final Project (Draft)

First Draft of Final Project - Using ATP Tennis Data Files will begin some analysis of player populations and success.

Jason O’Connell
2022-04-03
Show code
ul {
  line-height: 1;
}
li {
  line-height: 1;
}
...

Latest Overview

ATP Tennis Statistics Analysis - Iteration 7

For this iteration: I will begin exploring total player population, results of various sub-populations and prize money from tournament performance. I am interested to understand the equity versus pay for performance in the game. First I will develop routines for categorizing the data data for XX years but later will bringing more years data.

Items to be developed

Initialize Libraries

Show code

Read the data

First read ATP data files, here I am using only the results from 1980, 1990, 2000, 2010 from GITHUB source.

(Eventually I want to read many more years using a for loop and/or function)

The data includes detailed information about each match played on the ATP tour in a given year.

Show code
ATP1980 <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1980.csv")
ATP1990 <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1990.csv")
ATP2000 <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2000.csv")
ATP2010 <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2010.csv")

Create Working Data Set

This will require further development but for starts I will manually append the data sets into a working version.

tourney_level “D” is for Davis Cup - I will exclude this from my study for now.

Show code
ATP_Base<-bind_rows(mutate(ATP1980,season = "1980"),mutate(ATP1990, season = "1990"), mutate(ATP2000, season = "2000"), mutate(ATP2010, season = "2010")) %>%
  filter(!str_detect("D", tourney_level))

paged_table(head(ATP_Base,100))
Show code
#rm(list = c("ATP1980","ATP1990","ATP2000","ATP2010"))

Base Statistics

Lets start with some simple statistics by year. 1

Tournament Matches

Show code
ATP_Base %>%
  group_by(season) %>%
  summarize(season, num_matches = n()) %>%
  distinct() %>%
  kable(col.names = c("Season","Total Matches"), caption = "Number of Tournament Matches by Year")
Table 1: Number of Tournament Matches by Year
Season Total Matches
1980 3945
1990 3362
2000 3053
2010 2705

Tournaments

Show code
ATP_Base %>%
  select(season,tourney_id) %>%
  distinct() %>%
  group_by(season) %>%
  summarize(season, num_tournaments = n()) %>%
  distinct() %>%
  kable(col.names = c("Season","Tournaments"), caption = "Number of Tournaments by Year")
Table 2: Number of Tournaments by Year
Season Tournaments
1980 101
1990 81
2000 71
2010 66

Players

Show code
bind_rows(
  ATP_Base %>%
    select(season,winner_id) %>%
    distinct(),
  ATP_Base %>%
    select(season,loser_id) %>%
    distinct()
) %>%
  group_by(season) %>%
  summarize(season, num_players = n()) %>%
  distinct() %>%
  kable(col.names = c("Season","Players"), caption = "Number of Players by Year")
Table 3: Number of Players by Year
Season Players
1980 901
1990 678
2000 578
2010 538

Top Winners

Show code
ATP_Base %>%
    select(season,winner_id, winner_name) %>%
    group_by(season, winner_id, winner_name) %>%
    summarize(season, winner_id, winner_name, num_wins = n()) %>%
    distinct() %>%
    arrange(desc(num_wins)) %>%
    head(20)%>%
    kable(col.names = c("Season","Player ID","Player Name","Wins"), caption = "Top Winners")%>%
    remove_column(c(2))
Table 4: Top Winners
Season Player Name Wins
1980 Ivan Lendl 104
1980 John McEnroe 83
1980 Jimmy Connors 73
1980 Gene Mayer 72
1980 Eliot Teltscher 72
2000 Marat Safin 72
1990 Boris Becker 71
2010 Rafael Nadal 71
1980 Jose Luis Clerc 69
1980 Brian Gottfried 68
1980 Bjorn Borg 68
1990 Stefan Edberg 68
1980 Guillermo Vilas 67
2000 Magnus Norman 67
2000 Yevgeny Kafelnikov 66
2010 Roger Federer 66
1980 Harold Solomon 65
1980 Brian Teacher 63
2000 Gustavo Kuerten 62
1980 Wojtek Fibak 59

Create Player Statistics Table

Show code
player_table<-full_join(
  ATP_Base %>%    
    select(season,winner_id, winner_name) %>%
    group_by(season, winner_id, winner_name) %>%
    summarize(season, winner_id, winner_name, num_wins = n()) %>%
    distinct() %>%
    rename(player_id = winner_id, player_name = winner_name),
  ATP_Base %>%
    select(season,loser_id, loser_name) %>%
    group_by(season, loser_id, loser_name) %>%
    summarize(season, loser_id, loser_name, num_loses = n()) %>%
    distinct() %>%
    rename(player_id = loser_id, player_name = loser_name),
  by=c("season"="season","player_id"="player_id","player_name"="player_name")
) %>%
  mutate(num_wins=if_else(is.na(num_wins), as.integer(0), num_wins)) %>%
  mutate(num_matches=num_wins+num_loses, win_percent=100*num_wins/(num_wins+num_loses))

Player Statistics

Show code
player_table%>%
  select(season, player_id, player_name, win_percent)%>%
  arrange(desc(win_percent))%>%
#  top_n(20, win_percent) %>%
  head(20)%>%
  kable(digits = 0, col.names = c("Season","Player ID","Player Name","Wins Percent"), caption = "Top Winners")%>%
  remove_column(c(2))
Table 5: Top Winners
Season Player Name Wins Percent
1980 Bjorn Borg 91
2010 Rafael Nadal 88
2010 Roger Federer 84
1980 John McEnroe 83
1980 Jimmy Connors 83
1990 Boris Becker 83
1990 Stefan Edberg 82
1980 Gene Mayer 82
1990 Ivan Lendl 82
1990 Andre Agassi 81
1980 Vadim Borisov 80
1980 Guillermo Vilas 79
1980 Ivan Lendl 79
2000 Gustavo Kuerten 77
2000 Lleyton Hewitt 76
2000 Pete Sampras 76
1980 Jose Luis Clerc 76
2010 Novak Djokovic 76
1980 Konstantin Pugayev 75
1990 Scott Melville 75

Pack Circles

Well this is cool looking visual and kind of shows the few players winning a ton of matches. But all players with zero wins are missing. I am not so sure how to contol this visual since right now it is copied almost directly from the example I found online.

Show code
data<-player_table%>%
  select(season, player_name, num_wins)%>%
  filter(season == "1980")

packing<-circleProgressiveLayout(data$num_wins, sizetype='area')

data<-cbind(data,packing)

dat.gg<-circleLayoutVertices(packing, npoints=50)

ggplot() + 
  
  # Make the bubbles
  geom_polygon(data = dat.gg, aes(x, y, group = id, fill=as.factor(id)), colour = "black", alpha = 0.6) +
  
  # Add text in the center of each bubble + control its size
  geom_text(data = data, aes(x, y, size=num_wins, label = player_name)) +
  scale_size_continuous(range = c(1,4)) +
  
  # General theme:
  theme_void() + 
  theme(legend.position="none") +
  coord_equal()

Player total wins histogram

This is a key chart that is beginning to paint the picture about how many ATP players are participating in tournaments but winning very few matches.

Show code
player_table%>%
  select(season, player_id, num_wins)%>%
#  group_by(season, num_wins)%>%
#  summarize(season, num_wins, count = n())%>%
#  distinct()%>%
  ggplot(aes(x=num_wins))+
  geom_histogram(binwidth = 15, fill = "SkyBlue")+
#  scale_fill_brewer(palette = "Blues")+
  facet_grid(rows = vars(season))+
  labs(title="Number of ATP players and win total (Binwidth=15)", x="Wins", y="Number of Players")

Players total wins point chart

This chart demonstrates that same point but visually it is a little more diffcult to see the trend given how many player literally win 0-1 match all year.

Show code
player_table%>%
  select(season, player_id, num_wins)%>%
  group_by(season, num_wins)%>%
  summarize(season, num_wins, count = n())%>%
  distinct()%>%
  ggplot(aes(x=num_wins, y=count))+
  geom_point(color = "Blue")+
  geom_smooth(fill = "SkyBlue") +
  facet_grid(cols = vars(season))+
  labs(title="Number of ATP players and win totals", x="Wins", y="Number of Players")

Treemap

Some treemaps

Show code
player_table%>%
  select(season, player_name, num_wins)%>%
  filter(season == "1980")%>%
  group_by(season, num_wins)%>%
  summarize(season, num_wins, count = n())%>%
  distinct()%>%
  ggplot(aes(area = count, fill = count, label = num_wins))+
    geom_treemap()+
   geom_treemap_text(colour = "white", place = "center", alpha = 0.5)+
  labs(title="ATP 1980, Label = Number of wins, Size = Number of Players")

Tournaments & Surfaces

Show code
ATP_Base %>%
  select (season, surface, tourney_id) %>%
  filter(str_detect('Clay|Hard|Carpet|Grass', surface)) %>%
  distinct()%>%
  ggplot(aes(as.integer(season), fill=surface))+
    geom_histogram()
Show code
  # group_by(season, surface)%>%
  # summarize(season, surface, count = n())%>%
  # distinct()%>%
  # ggplot(aes(x=season, y=count, fill=surface)) +
  # geom_line(aes(color=surface))
  
#    geom_area(alpha=0.6 , size=.5, colour="white")

Conclusion

Final Paper: 1. Messing around with different visuals to see how they work and if they help my project. I’ve started working with the part of the data set to support my hypothesis that there are many, many players at the fringe of ATP who are not winning enough matches to support there career. At the same time there are a select few who win the majority of matches (and associate prize money) but it’s all made possible by those striving for the top but never making it.


  1. Only ATP tournament matches are include. Excludes Davis Cup, Challenge, etc.↩︎

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

O'Connell (2022, April 3). Data Analytics and Computational Social Science: ATP Tennis Analysis - Final Project (Draft). Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsnorthonumgithubiodacss601-blogposts2022-04-01-finalproject/

BibTeX citation

@misc{o'connell2022atp,
  author = {O'Connell, Jason},
  title = {Data Analytics and Computational Social Science: ATP Tennis Analysis - Final Project (Draft)},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsnorthonumgithubiodacss601-blogposts2022-04-01-finalproject/},
  year = {2022}
}