Data Analytics and Computational Social Science: ATP Tennis Analysis - Final Project

Jason O'Connell

Show code

ul {
  line-height: 1;
}
li {
  line-height: 1;
}
...

Latest Overview

ATP Tennis Statistics Analysis - Final Project

This analysis looks at the ATP tennis statistics since the beginning of the open era in 1969. The data is sourced from GitHub ¹ excludes any Davis Cup matches including only ATP professional tennis tournaments from each year through 2019. Years 2020 and 2021 have been excluded due to interruptions in professional tournament play due to the Covid-19 pandemic.

An analysis of the general number of players participating in tournaments and the overall number of tournaments in the dominance of the top players is presented. Some commentary about the court surface and the number of players who win very few matches on tour will be discussed.

Primary topics of inquiry:

ATP Tournaments
- Decline on number of tournaments overall per year
- Surfaces being played on changing and more limited over time
Players
- Generally less players involved
- Players who win 0,1 maybe 2 matches all year (take a look at some player in this category that play many tournaments)
- Player Network Analysis - are top players only lossing against each other.
- Win stats of “the Field” vs. top ranked players

Initialize Libraries

Show code

library(tidyverse)
library(readr)
library(readxl)
library(dplyr)
library(ggplot2)
library(treemapify)
library(rmarkdown)
library(viridis)
library(packcircles)
library(lubridate)
library(kableExtra)
library(igraph)
library(visNetwork)

Create Working Data Set

First read ATP data files from GITHUB source (https://github.com/JeffSackmann/tennis_atp/).

The data is collected into a single dataframe by pending each Year’s ATP tournament data file. These files include both tournament match information and match information from other competitions such as the Davis Cup; Davis Cup matches will be excluded from the analysis.

Finally I am only interested in ATP tournament data so will filter our Davis Cup matches which are also included in the data set.

Show code

atp_files <- paste("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_",1969:2019,".csv", sep = '')
atp_base<-map_dfr(atp_files, read_csv)

Data Enrichment

This r chunk will enrich the atp_base data with some variable for use later.

Show code

atp_base<-atp_base%>%  
  mutate(season=year(ymd(tourney_date)))%>%
#  mutate_at(c("winner_rank","loser_rank"), ~replace(.,is.na(.), 9999))%>%
  mutate(topten_players=case_when(
    winner_rank <= 10 & loser_rank <= 10 ~ as.integer("2"),
    winner_rank <= 10 | loser_rank <= 10 ~ as.integer("1"),
    TRUE ~ as.integer("0"),
  ))

Calculate Ranking Information

Following code collects ranking information by season and player and calculates each players average ranking by season. Players appear in the data set for each match they play in either the winner or loser column. To accurately calculate the average ranking the data must be transformed to include only one row for each player for each tournament and then average over the season.

Show code

avg_ranking<-atp_base%>%
#  filter(season == 1980)%>%
  select(season, tourney_id, winner_id, winner_rank, loser_id, loser_rank)%>%
  pivot_longer(cols = c("winner_id","loser_id"), names_to = "winner_loser_cat", values_to = "player_id")%>%
  pivot_longer(cols = ends_with("rank"), names_to = "rank_type", values_to = "rank")%>%
#the following filter is critical to eliminate duplicate rows for winner/loser since the 2 pivot longers create a 2 rows for each winner and loser. I wonly wont loser/loser and winner/winner eliminating loser/winner and winner/loser rows
  filter((winner_loser_cat == "loser_id" & rank_type == "loser_rank") | (winner_loser_cat == "winner_id" & rank_type == "winner_rank"))%>%
  select(season, tourney_id, player_id, rank)%>%
  distinct()%>%
  filter(!is.na(rank))%>%
  group_by(season, player_id)%>%
  summarize(avg_rank = mean(rank))

Base Statistics

Lets start with some simple statistics by year. ²

Tournament Matches

With the advent of the open era of professional tennis in 1969 conversion from the historically amateur status tournaments to professional tournaments was accomplished over the course of 1969 - 1971 demonstrated by the growth in the number of tournaments. Since a peak in 1975 the overall number of matches played on the ATP professional tennis tour has declined. In 1988 players took more control and took a more active role in governance of the ATP and participation increased for a few years. However in the early 90s there’s a significant increase in the prize money at each tournament requiring greater levels of commitment from sponsors in an overall decline in the number of matches going forward.

Also of Interest is the type of surface that is being played on for each tournament professional tennis match. In recent years hard courts have come to dominate the game of tennis. Early in the open era of tennis tournaments four distinct surfaces were played; clay, grass, hard courts and, when indoors, carpet. In more recent years the play on carpet has been eliminated as technology has allowed for temporary hard courts to be used in Indoor Stadium arenas. Grass Court tennis originally the surface of choice for 3 out of the 4 grand slam tournaments has gradually been reduced with the conversion of the Australian Open and the US Open to various versions of hard court. Grass courts have been used at Wimbledon since 1877, the US Open from 1881 to 1974, and the Australian Open from 1905 to 1987.³ The last several decades the number of tournaments in matches played on grass courts has remained relatively stable due to the commitment of the All England Club to host Wimbledon championships on grass and several tournaments Also played on grass for players to prepare for the major.

Show code

atp_base %>%
  group_by(season) %>%
  summarize(season, num_matches = n()) %>%
  distinct() %>%
#  kable(col.names = c("Season","Total Matches"), caption = "Number of Tournament Matches by Year")
  ggplot(aes(x=season,y=num_matches)) +
    geom_line(color="#0000FF") +
    labs(x="Season", y="ATP Matches Played", title="Total ATP Matched Played by Season\n") +
    theme_classic()+
    theme(axis.text.x = element_text(size = 10), axis.title.x = element_text(size = 12),
        axis.text.y = element_text(size = 10), axis.title.y = element_text(size = 12),
        plot.title = element_text(size = 14, face = "bold", color = "darkblue"))

Show code

atp_base %>%
  filter(str_detect('Clay|Hard|Carpet|Grass', surface)) %>%
  group_by(season, surface) %>%
  summarize(season, surface, num_matches = n()) %>%
  distinct() %>%
#  kable(col.names = c("Season","Total Matches"), caption = "Number of Tournament Matches by Year")
  ggplot(aes(x=season,y=num_matches)) +
    geom_line(aes(color=surface)) +
    facet_grid(cols = vars(surface)) +
    scale_color_manual(labels=c("Carpet","Clay","Grass","Hard"), values=c("orange", "#FF0000", "darkgreen", "#0000FF")) +
    guides(x = guide_axis(n.dodge = 2))+
    labs(x="Season", y="ATP Matches Played", title="Total ATP Matched Played by Season by Surface\n", color = "Surface\n") +
    theme_minimal()

Tournaments

As can be seen in the figure below the number of tournaments being played follows the same pattern as the total number of matches played by season given that each tournament is played with a fairly consistent number of entries.

Show code

atp_base %>%
  select(season,tourney_id) %>%
  distinct() %>%
  group_by(season) %>%
  summarize(season, num_tournaments = n()) %>%
  distinct() %>%
#  kable(col.names = c("Season","Tournaments"), caption = "Number of Tournaments by Year")
  ggplot(aes(x=season,y=num_tournaments)) +
    geom_line(color="blue")+
    labs(x="Season", y="ATP Tournaments Held", title="Total ATP Tournaments Held by Season\n") +
    theme_classic()+
    theme(axis.text.x = element_text(size = 10), axis.title.x = element_text(size = 12),
        axis.text.y = element_text(size = 10), axis.title.y = element_text(size = 12),
        plot.title = element_text(size = 14, face = "bold", color = "darkblue"))

Players

More Surprisingly however is that overall participation in the number of players participating in ATP tournaments overall has sharply decreased consistently through the decades. Historically there was a smaller contingent of players who would travel to all the tournaments in the participation from local qualifiers contributed to the overall participation rates being much higher. Since 1990 ended the escalation of prize money in sponsorships at each tournament a greater number of players have been able to support travel expenses at two tournaments throughout the globe. As a consequence each tournament is getting higher ranked players throughout the draw of the tournament. to maintain prize money in sponsorships and TV viewership getting the top ranked players to participate in Tournaments has become essential. and therefore the number of spots open to qualifiers who made live more local to the tournament geography has diminished.

Show code

bind_rows(
  atp_base %>%
    select(season,winner_id) %>%
    distinct(),
  atp_base %>%
    select(season,loser_id) %>%
    distinct()
) %>%
  group_by(season) %>%
  summarize(season, num_players = n()) %>%
  distinct() %>%
#  kable(col.names = c("Season","Players"), caption = "Number of Players by Year")
  ggplot(aes(x=season,y=num_players)) +
    geom_line(color="blue")+
    labs(x="Season", y="ATP Players Involved", title="Total ATP Professional Participation by Season\n") +
    theme_classic()+
    theme(axis.text.x = element_text(size = 10), axis.title.x = element_text(size = 12),
        axis.text.y = element_text(size = 10), axis.title.y = element_text(size = 12),
        plot.title = element_text(size = 14, face = "bold", color = "darkblue"))

Top Winners

If we look at the top players in terms of wins in a single season we can see that a significant number of players from the early 1970s and 1980s. This is due to the fact that the top players were playing in many more tournaments each week then today’s top ranked players. The only player since 1982 to be included in the top 20 wins in a single 80P season is Roger Federer in 2006 But we will see later Ted Rogers 2006 season what’s the second best season for a male professional tennis player in history.

Show code

atp_base %>%
    select(season,winner_id, winner_name) %>%
    group_by(season, winner_id, winner_name) %>%
    summarize(season, winner_id, winner_name, num_wins = n()) %>%
    distinct() %>%
    arrange(desc(num_wins)) %>%
    head(20)%>%
    kable(col.names = c("Season","Player ID","Player Name","Wins"), caption = "Top Winners")%>%
    remove_column(c(2))

Table 1: Top Winners
Season	Player Name	Wins
1977	Guillermo Vilas	131
1973	Ilie Nastase	120
1972	Ilie Nastase	119
1982	Ivan Lendl	109
1980	Ivan Lendl	108
1977	Brian Gottfried	107
1975	Arthur Ashe	102
1974	Bjorn Borg	99
1979	John McEnroe	99
1976	Jimmy Connors	98
1976	Raul Ramirez	97
1981	Ivan Lendl	95
1974	Jimmy Connors	94
1975	Manuel Orantes	94
1974	John Newcombe	93
1973	Tom Okker	92
1973	Jimmy Connors	92
1974	Guillermo Vilas	92
2006	Roger Federer	92
1975	Ilie Nastase	91

Create Player Statistics Table

A player statistics table is created to summarize wins and loses and other statitics for each player each season. Step 1: joins the atp data with itself to include first winner data and then loser data. Step 2: correct NA win totals, calcs total matches, and win_percent each year Step 3: include new columns to categorize the player data.

Show code

player_table<-full_join(
  atp_base %>%    
    select(season,winner_id, winner_name) %>%
    group_by(season, winner_id, winner_name) %>%
    summarize(season, winner_id, winner_name, num_wins = n()) %>%
    distinct() %>%
    rename(player_id = winner_id, player_name = winner_name),
  atp_base %>%
    select(season,loser_id, loser_name) %>%
    group_by(season, loser_id, loser_name) %>%
    summarize(season, loser_id, loser_name, num_loses = n()) %>%
    distinct() %>%
    rename(player_id = loser_id, player_name = loser_name),
  by=c("season"="season","player_id"="player_id","player_name"="player_name")
) %>%
  mutate(num_wins=if_else(is.na(num_wins), as.integer(0), num_wins)) %>%
  mutate(num_matches=num_wins+num_loses, win_percent=100*num_wins/(num_wins+num_loses))%>%
  mutate(decade=10*trunc(season/10))%>%
  merge(avg_ranking, by = c("season","player_id"), all.x = TRUE)%>%
  group_by(season)%>%
  mutate(rank_bin=ntile(avg_rank,10))%>%
  ungroup()%>%
  merge(tibble(rank_bin=c(1:10),rank_binA=c("Top 10th %ile","11-20th %ile","21-30th %ile","31-40th %ile","41-50th %ile","51-60th %ile","61-70th %ile","71-80th %ile","81-90th %ile","Bottom 10th %ile"),rank_bin_sort=c(1:10)),by = c("rank_bin"), all.x=TRUE)%>%
  replace_na(list(rank_binA = 'Unranked', rank_bin_sort = 11))

I found it useful later to look at match wins for rounds other then the first round so here that measure is added to the player table. It will be shown later that players in the lower ranks win very few match all year. When wins in only the first round of the tournment are excluded they win on average very few. Given that prize money for first or second round loses barely cover the expense to travel, compete, pay for coaching, trainers etc. These lower ranked players will struggle to make ends meet.

Show code

# exFirstRnd<-atp_base%>%
#   filter(draw_size%in%c(128,64,32))%>%
#   filter(draw_size != as.integer(str_replace(round,"R","")))


player_table<-merge(player_table, 
      atp_base%>%
        filter(draw_size%in%c(128,64,32))%>%
        filter(draw_size != as.integer(str_replace(round,"R","")))%>%
        select(season, winner_id)%>%
        group_by(season, winner_id)%>%
        summarise(season, winner_id, num_wins_2rplus=n())%>%
        distinct() %>%
        rename(player_id = winner_id),
      by = c("season","player_id"),
      all.x=TRUE
)%>%
  replace_na(list(num_wins_2rplus = 0))

Player total wins histogram

This chart has an early view about how many ATP players are participating in tournaments but winning very few matches.

Show code

player_table%>%
  select(season, player_id, num_wins)%>%
  filter(season %in% c(1980,1990,2000,2010))%>%
#  group_by(season, num_wins)%>%
#  summarize(season, num_wins, count = n())%>%
#  distinct()%>%
  ggplot(aes(x=num_wins))+
  geom_histogram(binwidth = 15, fill = "SkyBlue")+
#  scale_fill_brewer(palette = "Blues")+
  facet_grid(rows = vars(season))+
  labs(title="Number of ATP players and win total (Binwidth=15)", x="Wins", y="Number of Players")+
  theme_bw()

Players total wins point chart

This chart demonstrates that same point but visually it is a little more diffcult to see the trend given how many player literally win 0-1 match all year.

Show code

top_winner<-player_table%>%
  select(season, player_name, num_wins)%>%
  filter(season %in% c(1980,1990,2000,2010))%>%
  group_by(season)%>%
  slice(which.max(num_wins))

player_table%>%
  select(season, player_id, num_wins)%>%
  filter(season %in% c(1980,1990,2000,2010))%>%
  group_by(season, num_wins)%>%
  summarize(season, num_wins, count = n())%>%
  distinct()%>%
  ggplot(aes(x=num_wins, y=count))+
  geom_point(color = "Blue")+
  geom_smooth(fill = "SkyBlue") +
  geom_point(data = top_winner, y=1, x=top_winner$num_wins, color="red") +
  geom_text(data = top_winner, aes(y=1, x=top_winner$num_wins, label=top_winner$player_name), hjust=1, vjust=2, size=3)+
  facet_grid(cols = vars(season))+
  labs(title="Number of ATP players and win totals", subtitle = "(Highest win total labeled)", x="Wins", y="Number of Players")+
  theme_bw()

Player Statistics

John McEnroe’s 1984 season is considered by most as the most dominate Men’s tennis season ever (ranked #3 all-time behind Steffi Graff’s 1988 and Martina Navratilova’s 1983 seasons) according to the Bleacher Report ⁴

Show code

player_table%>%
  select(season, player_id, player_name, win_percent)%>%
  arrange(desc(win_percent))%>%
#  top_n(20, win_percent) %>%
  head(20)%>%
  kable(digits = 0, col.names = c("Season","Player ID","Player Name","Wins Percent"), caption = "Top Winners")%>%
  remove_column(c(2))

Table 2: Top Winners
Season	Player Name	Wins Percent
1984	John McEnroe	96
2005	Roger Federer	95
2006	Roger Federer	95
2015	Novak Djokovic	93
1974	Jimmy Connors	93
1986	Ivan Lendl	93
2004	Roger Federer	93
1977	Bjorn Borg	92
1982	Ivan Lendl	92
1978	Jimmy Connors	92
1980	Bjorn Borg	92
2013	Rafael Nadal	92
1979	Bjorn Borg	92
1989	Ivan Lendl	92
2018	Rafael Nadal	92
1976	Jimmy Connors	92
1987	Ivan Lendl	91
2017	Roger Federer	91
2011	Novak Djokovic	91
1985	Ivan Lendl	91

Network

Let’s take a closer look at the 1984 ATP season and see the matches being played amongst the most successful players that year.

Show code

# create data:
vSeason <- as.integer("1984")
player_nodes <- player_table%>% 
  filter(season==vSeason, num_matches>20)%>%
  arrange(desc(win_percent))%>%
  head(10)%>%
  select(player_id, player_name, win_percent)%>%
  rename(id=player_id, label=player_name)
  
player_links <- atp_base %>%
  filter(season==vSeason, loser_id%in%player_nodes$id, winner_id%in%player_nodes$id)%>%
  select(loser_id, winner_id)%>%
  group_by(loser_id, winner_id)%>%
  summarize(loser_id, winner_id, occurances=n())%>%
  distinct()%>%
  rename(from=loser_id,to=winner_id)%>%
  mutate(
    color=case_when(from==as.integer(100581) ~ "red", to==as.integer(100581) ~ "green", TRUE ~ "lightgray"),
    width=case_when(from==as.integer(100581) ~ occurances, to==as.integer(100581) ~ occurances, TRUE ~ as.integer(1)),
    label=case_when(from==as.integer(100581) ~ toString(occurances), to==as.integer(100581) ~ toString(occurances), TRUE ~ ""),
    font.size=11,
    font.color="black",
    smooth=TRUE,
#    arrows="middle",
    shadow=FALSE
    )

player_nodes<-player_nodes%>%
  merge(player_links%>%
          group_by(to)%>%
          summarize(to, total_wins=sum(occurances))%>%
          select(to, total_wins)%>%
          distinct()%>%
          rename(id = to), by=c("id"), all.x=TRUE)%>%
          replace_na(list(total_wins = 0))%>%
          mutate(
            shape="dot",
            shadow=TRUE, 
            title = paste(total_wins," top 10 wins"), 
            size=(total_wins+1)*1.5, 
            font.size=12,
            font.color="black",
            border_width = 2
            )

visNetwork(player_nodes, player_links, height = "500px", width = "100%",
           main="Matches between Top 10 Players (1984)",
           footer="Highlighting John McEnroe's wins (green) & losses (red)")%>%
  visLayout(randomSeed = 133,improvedLayout = TRUE)

Comparison of players grouped into 10 categories of ranking

The figure below begins to show the situation of match wins by ranking category for a sample of years, 1980, 1990, 2000, 2010. We can see here that average wins for an entire year get close to 0-2 for players outside the top third of the rankings. We can see a few outliers in the lower rankings and these are likely younger players on rapid raise to the uppper ranking.

Show code

player_table%>%
  filter(as.integer(decade/10) %in% c(198:201))%>%
  ggplot(aes(x=factor(reorder(rank_binA,rank_bin_sort)), y=num_wins))+
  geom_boxplot()+
  coord_flip()+
    facet_wrap(facets = vars(decade)) +
    labs(title = "Wins by relative ranking category by decade\n", x="", y="Toal Wins") +
    theme_bw()

The competitive landscape for less ranked players is gradually increaseing over time. As seen in the following figure, which now included all matched each decade (not just a sampling of years), the top third highest ranked players accounts for roughly 75% of the wins over the year.

Show code

player_table%>%
  filter(as.integer(decade/10) %in% c(198:201))%>%
  ggplot(aes(x=decade, y=num_wins, fill=factor(reorder(rank_binA,rank_bin_sort))))+
    geom_bar(position="fill", stat = "identity")+
    
    #scale_color_viridis(discrete = TRUE, option = "turbo")+
    scale_fill_viridis(discrete = TRUE, option = "turbo") +
    labs(title= "Matches Won by relative ranking each decade\n", x="Decade", y="Percentage of Matches Won", fill="Ranking Category")+
    theme_bw()

Pack Circles

Let’s take a look at this data in a slight different visual across the entire 4+ decade period of ATP match data. We can clearly see that the top third ranked professional tennis players are accounting for the lions share of maych wins.

Show code

data<-player_table%>%
  filter(season>1973)%>%
  group_by(rank_binA)%>%
  summarize(rank_binA,num_wins1 = sum(num_wins))%>%
  select(rank_binA, num_wins1)%>%
    distinct()%>%
#  filter(season == "1980")%>%
  arrange(desc(num_wins1))
#  head(10)

# player_table%>%
#   select(season, player_name, num_wins)%>%
#   filter(season == "1980")%>%
#   groud_by(season,num_wind)
#   arrange(desc(num_wins))%>%
#   head(10)


packing<-circleProgressiveLayout(data$num_wins1, sizetype='area')

data<-cbind(data,packing)

dat.gg<-circleLayoutVertices(packing, npoints=50)

ggplot() + 
  
  # Make the bubbles
  geom_polygon(data = dat.gg, aes(x, y, group = id, fill=as.factor(id)), colour = "black", alpha = 0.6) +
  
  # Add text in the center of each bubble + control its size
  geom_text(data = data, aes(x, y, size=num_wins1, label = paste(rank_binA,"\n",num_wins1, sep = ""))) +
  scale_size_continuous(range = c(1,4)) +
  scale_fill_viridis(discrete = TRUE, option = "turbo") +
  
  # General theme:
  theme_void() + 
  theme(legend.position="none") +
  coord_equal()+
  
  labs(title="Ranking Category by Win Totals\n(1973-2019)\n")

Treemap

The following treemap diagram provide another view of the entire match data set of match wins by ranking category. Interestingly even for the top 10% of ranked players the average number of wins for an entire season is less then 40. Also shown is the fact that lower ranked players are winning perhaps 1 or 2 matches all year. Clearly the prize money in these lower categories will not support an entire season of travel and coaching.

Show code

player_table%>%
  select(season, rank_binA, rank_bin_sort, num_wins)%>%
  filter(season > 1973)%>%
  group_by(rank_binA, rank_bin_sort)%>%
  summarize(rank_binA, rank_bin_sort, totalwins=sum(num_wins), count = n())%>%
  distinct()%>%
  ggplot(aes(area = totalwins/count, label = round(totalwins/count,0), fill=reorder(rank_binA,rank_bin_sort)))+
    geom_treemap()+
#  scale_fill_discrete(high="red", mid="white",low="blue",midpoint = 100) +
   geom_treemap_text(colour = "white", place = "center", size = 12)+
    #scale_color_viridis(discrete = TRUE, option = "turbo")+
    scale_fill_viridis(discrete = TRUE, option = "turbo") +
    labs(title="Average wins per player per season (all time 1973-2019)\n by Relative Ranking\n",
         x="", y="",
         fill = "Rank Percentile") +
    theme_bw()

Show code

#  labs(title="ATP 1980, Label = Number of wins, Size = Number of Players")

Finally, to complete the analysis, first round match wins will be excluded. It is clear that even for player in the top 10% of rankings on average they are winning 10 matches outside the first round of each tournament.

Show code

player_table%>%
  select(season, rank_binA, rank_bin_sort, num_wins_2rplus)%>%
  filter(season > 1973)%>%
  group_by(rank_binA, rank_bin_sort)%>%
  summarize(rank_binA, rank_bin_sort, totalwins=sum(num_wins_2rplus), count = n())%>%
  distinct()%>%
  ggplot(aes(area = totalwins/count, label = round(totalwins/count,0), fill=reorder(rank_binA,rank_bin_sort)))+
    geom_treemap()+
#  scale_fill_discrete(high="red", mid="white",low="blue",midpoint = 100) +
   geom_treemap_text(colour = "white", place = "center", size = 12)+
    #scale_color_viridis(discrete = TRUE, option = "turbo")+
    scale_fill_viridis(discrete = TRUE, option = "turbo") +
    labs(title="Average wins per player per season (all time 1973-2019)\nby Relative Ranking\nExcluding 1st Round",
         x="", y="",
         fill = "Rank Percentile") +
    theme_bw()

Show code

#  labs(title="ATP 1980, Label = Number of wins, Size = Number of Players")

Conclusion

After careful analysis of ATP match data since the start of the open era of tennis. Several conclusiongs are support by the data. 1. Overall matches, tournaments, and player participation is declining over time. 2. Playing surfaces have generally been trending toward more hard court matches but mainly inversely proportional to the decline of indoor carpet surfaces. 3. John McEnroe’s 1984 season was the most dominate season across all ATP competitions. 4. Finally: The data supports my hypothesis that there are many, many players at the fringe of ATP who are not winning enough matches to support their careers. Even players in the top tenth of the rankings are winning on average too few matches (and associate prize money) to support the travel and expenses to compete. The struggle of the less ranked players makes the ATP tour possible and those elite few players who are winning 50+ matches a year and earning the big prize money depend on this lesser competition’s participation.

For further analysis

Closer look at the top 10 or 20 players as compared to the rest of “the field”.
Closer look at ranking growth and decline as player develop and age.

Appendix 1 - About the data

The data includes detailed information about each match played on the ATP tour in a given year.

Tournament Data
- tourney-id (YYYY-###)
- tourney_name
- surface
- draw_size
- tourney_level
- tourney_date
Match Data
- match_num
- score
- best_of
- minutes
- round
Player Data (for each winner & loser):
- _id
- _seed
- _entry (WC, Q)
- _name
- _hand
- _height
- _ioc
- _age
- _rank
- _points
Match Statistics (for each w & l):
- _ace
- _df
- _svpt
- _1stIn
- _1stWon
- _2ndWon
- _svGms
- _bpSaved
- _bpFaced

#Footer

Distill is a publication format for scientific and technical writing, native to the web.

Learn more about using Distill at https://rstudio.github.io/distill.

Data files source: https://github.com/JeffSackmann/tennis_atp ↩︎
Only ATP tournament matches are include. Excludes Davis Cup, Challenge, etc.↩︎
https://en.wikipedia.org/wiki/Tennis_court ↩︎
https://bleacherreport.com/articles/1708407-ranking-the-10-most-dominant-seasons-in-tennis-history ↩︎

Comment on this article Share:

ATP Tennis Analysis - Final Project

Latest Overview

Initialize Libraries

Create Working Data Set

Data Enrichment

Calculate Ranking Information

Base Statistics

Tournament Matches

Tournaments

Players

Top Winners

Create Player Statistics Table

Player total wins histogram

Players total wins point chart

Player Statistics

Network

Comparison of players grouped into 10 categories of ranking

Pack Circles

Treemap

Conclusion

For further analysis

Appendix 1 - About the data

Reuse

Citation