Final Project - Using ATP Tennis Data Files: analysis of player populations and success.
ul {line-height: 1;
}
li {line-height: 1;
}...
ATP Tennis Statistics Analysis - Final Project
This analysis looks at the ATP tennis statistics since the beginning of the open era in 1969. The data is sourced from GitHub 1 excludes any Davis Cup matches including only ATP professional tennis tournaments from each year through 2019. Years 2020 and 2021 have been excluded due to interruptions in professional tournament play due to the Covid-19 pandemic.
An analysis of the general number of players participating in tournaments and the overall number of tournaments in the dominance of the top players is presented. Some commentary about the court surface and the number of players who win very few matches on tour will be discussed.
Primary topics of inquiry:
First read ATP data files from GITHUB source (https://github.com/JeffSackmann/tennis_atp/).
The data is collected into a single dataframe by pending each Year’s ATP tournament data file. These files include both tournament match information and match information from other competitions such as the Davis Cup; Davis Cup matches will be excluded from the analysis.
Finally I am only interested in ATP tournament data so will filter our Davis Cup matches which are also included in the data set.
This r chunk will enrich the atp_base data with some variable for use later.
atp_base<-atp_base%>%
mutate(season=year(ymd(tourney_date)))%>%
# mutate_at(c("winner_rank","loser_rank"), ~replace(.,is.na(.), 9999))%>%
mutate(topten_players=case_when(
winner_rank <= 10 & loser_rank <= 10 ~ as.integer("2"),
winner_rank <= 10 | loser_rank <= 10 ~ as.integer("1"),
TRUE ~ as.integer("0"),
))
Following code collects ranking information by season and player and calculates each players average ranking by season. Players appear in the data set for each match they play in either the winner or loser column. To accurately calculate the average ranking the data must be transformed to include only one row for each player for each tournament and then average over the season.
avg_ranking<-atp_base%>%
# filter(season == 1980)%>%
select(season, tourney_id, winner_id, winner_rank, loser_id, loser_rank)%>%
pivot_longer(cols = c("winner_id","loser_id"), names_to = "winner_loser_cat", values_to = "player_id")%>%
pivot_longer(cols = ends_with("rank"), names_to = "rank_type", values_to = "rank")%>%
#the following filter is critical to eliminate duplicate rows for winner/loser since the 2 pivot longers create a 2 rows for each winner and loser. I wonly wont loser/loser and winner/winner eliminating loser/winner and winner/loser rows
filter((winner_loser_cat == "loser_id" & rank_type == "loser_rank") | (winner_loser_cat == "winner_id" & rank_type == "winner_rank"))%>%
select(season, tourney_id, player_id, rank)%>%
distinct()%>%
filter(!is.na(rank))%>%
group_by(season, player_id)%>%
summarize(avg_rank = mean(rank))
Lets start with some simple statistics by year. 2
With the advent of the open era of professional tennis in 1969 conversion from the historically amateur status tournaments to professional tournaments was accomplished over the course of 1969 - 1971 demonstrated by the growth in the number of tournaments. Since a peak in 1975 the overall number of matches played on the ATP professional tennis tour has declined. In 1988 players took more control and took a more active role in governance of the ATP and participation increased for a few years. However in the early 90s there’s a significant increase in the prize money at each tournament requiring greater levels of commitment from sponsors in an overall decline in the number of matches going forward.
Also of Interest is the type of surface that is being played on for each tournament professional tennis match. In recent years hard courts have come to dominate the game of tennis. Early in the open era of tennis tournaments four distinct surfaces were played; clay, grass, hard courts and, when indoors, carpet. In more recent years the play on carpet has been eliminated as technology has allowed for temporary hard courts to be used in Indoor Stadium arenas. Grass Court tennis originally the surface of choice for 3 out of the 4 grand slam tournaments has gradually been reduced with the conversion of the Australian Open and the US Open to various versions of hard court. Grass courts have been used at Wimbledon since 1877, the US Open from 1881 to 1974, and the Australian Open from 1905 to 1987.3 The last several decades the number of tournaments in matches played on grass courts has remained relatively stable due to the commitment of the All England Club to host Wimbledon championships on grass and several tournaments Also played on grass for players to prepare for the major.
atp_base %>%
group_by(season) %>%
summarize(season, num_matches = n()) %>%
distinct() %>%
# kable(col.names = c("Season","Total Matches"), caption = "Number of Tournament Matches by Year")
ggplot(aes(x=season,y=num_matches)) +
geom_line(color="#0000FF") +
labs(x="Season", y="ATP Matches Played", title="Total ATP Matched Played by Season\n") +
theme_classic()+
theme(axis.text.x = element_text(size = 10), axis.title.x = element_text(size = 12),
axis.text.y = element_text(size = 10), axis.title.y = element_text(size = 12),
plot.title = element_text(size = 14, face = "bold", color = "darkblue"))
atp_base %>%
filter(str_detect('Clay|Hard|Carpet|Grass', surface)) %>%
group_by(season, surface) %>%
summarize(season, surface, num_matches = n()) %>%
distinct() %>%
# kable(col.names = c("Season","Total Matches"), caption = "Number of Tournament Matches by Year")
ggplot(aes(x=season,y=num_matches)) +
geom_line(aes(color=surface)) +
facet_grid(cols = vars(surface)) +
scale_color_manual(labels=c("Carpet","Clay","Grass","Hard"), values=c("orange", "#FF0000", "darkgreen", "#0000FF")) +
guides(x = guide_axis(n.dodge = 2))+
labs(x="Season", y="ATP Matches Played", title="Total ATP Matched Played by Season by Surface\n", color = "Surface\n") +
theme_minimal()
As can be seen in the figure below the number of tournaments being played follows the same pattern as the total number of matches played by season given that each tournament is played with a fairly consistent number of entries.
atp_base %>%
select(season,tourney_id) %>%
distinct() %>%
group_by(season) %>%
summarize(season, num_tournaments = n()) %>%
distinct() %>%
# kable(col.names = c("Season","Tournaments"), caption = "Number of Tournaments by Year")
ggplot(aes(x=season,y=num_tournaments)) +
geom_line(color="blue")+
labs(x="Season", y="ATP Tournaments Held", title="Total ATP Tournaments Held by Season\n") +
theme_classic()+
theme(axis.text.x = element_text(size = 10), axis.title.x = element_text(size = 12),
axis.text.y = element_text(size = 10), axis.title.y = element_text(size = 12),
plot.title = element_text(size = 14, face = "bold", color = "darkblue"))
More Surprisingly however is that overall participation in the number of players participating in ATP tournaments overall has sharply decreased consistently through the decades. Historically there was a smaller contingent of players who would travel to all the tournaments in the participation from local qualifiers contributed to the overall participation rates being much higher. Since 1990 ended the escalation of prize money in sponsorships at each tournament a greater number of players have been able to support travel expenses at two tournaments throughout the globe. As a consequence each tournament is getting higher ranked players throughout the draw of the tournament. to maintain prize money in sponsorships and TV viewership getting the top ranked players to participate in Tournaments has become essential. and therefore the number of spots open to qualifiers who made live more local to the tournament geography has diminished.
bind_rows(
atp_base %>%
select(season,winner_id) %>%
distinct(),
atp_base %>%
select(season,loser_id) %>%
distinct()
) %>%
group_by(season) %>%
summarize(season, num_players = n()) %>%
distinct() %>%
# kable(col.names = c("Season","Players"), caption = "Number of Players by Year")
ggplot(aes(x=season,y=num_players)) +
geom_line(color="blue")+
labs(x="Season", y="ATP Players Involved", title="Total ATP Professional Participation by Season\n") +
theme_classic()+
theme(axis.text.x = element_text(size = 10), axis.title.x = element_text(size = 12),
axis.text.y = element_text(size = 10), axis.title.y = element_text(size = 12),
plot.title = element_text(size = 14, face = "bold", color = "darkblue"))
If we look at the top players in terms of wins in a single season we can see that a significant number of players from the early 1970s and 1980s. This is due to the fact that the top players were playing in many more tournaments each week then today’s top ranked players. The only player since 1982 to be included in the top 20 wins in a single 80P season is Roger Federer in 2006 But we will see later Ted Rogers 2006 season what’s the second best season for a male professional tennis player in history.
atp_base %>%
select(season,winner_id, winner_name) %>%
group_by(season, winner_id, winner_name) %>%
summarize(season, winner_id, winner_name, num_wins = n()) %>%
distinct() %>%
arrange(desc(num_wins)) %>%
head(20)%>%
kable(col.names = c("Season","Player ID","Player Name","Wins"), caption = "Top Winners")%>%
remove_column(c(2))
Season | Player Name | Wins |
---|---|---|
1977 | Guillermo Vilas | 131 |
1973 | Ilie Nastase | 120 |
1972 | Ilie Nastase | 119 |
1982 | Ivan Lendl | 109 |
1980 | Ivan Lendl | 108 |
1977 | Brian Gottfried | 107 |
1975 | Arthur Ashe | 102 |
1974 | Bjorn Borg | 99 |
1979 | John McEnroe | 99 |
1976 | Jimmy Connors | 98 |
1976 | Raul Ramirez | 97 |
1981 | Ivan Lendl | 95 |
1974 | Jimmy Connors | 94 |
1975 | Manuel Orantes | 94 |
1974 | John Newcombe | 93 |
1973 | Tom Okker | 92 |
1973 | Jimmy Connors | 92 |
1974 | Guillermo Vilas | 92 |
2006 | Roger Federer | 92 |
1975 | Ilie Nastase | 91 |
A player statistics table is created to summarize wins and loses and other statitics for each player each season. Step 1: joins the atp data with itself to include first winner data and then loser data. Step 2: correct NA win totals, calcs total matches, and win_percent each year Step 3: include new columns to categorize the player data.
player_table<-full_join(
atp_base %>%
select(season,winner_id, winner_name) %>%
group_by(season, winner_id, winner_name) %>%
summarize(season, winner_id, winner_name, num_wins = n()) %>%
distinct() %>%
rename(player_id = winner_id, player_name = winner_name),
atp_base %>%
select(season,loser_id, loser_name) %>%
group_by(season, loser_id, loser_name) %>%
summarize(season, loser_id, loser_name, num_loses = n()) %>%
distinct() %>%
rename(player_id = loser_id, player_name = loser_name),
by=c("season"="season","player_id"="player_id","player_name"="player_name")
) %>%
mutate(num_wins=if_else(is.na(num_wins), as.integer(0), num_wins)) %>%
mutate(num_matches=num_wins+num_loses, win_percent=100*num_wins/(num_wins+num_loses))%>%
mutate(decade=10*trunc(season/10))%>%
merge(avg_ranking, by = c("season","player_id"), all.x = TRUE)%>%
group_by(season)%>%
mutate(rank_bin=ntile(avg_rank,10))%>%
ungroup()%>%
merge(tibble(rank_bin=c(1:10),rank_binA=c("Top 10th %ile","11-20th %ile","21-30th %ile","31-40th %ile","41-50th %ile","51-60th %ile","61-70th %ile","71-80th %ile","81-90th %ile","Bottom 10th %ile"),rank_bin_sort=c(1:10)),by = c("rank_bin"), all.x=TRUE)%>%
replace_na(list(rank_binA = 'Unranked', rank_bin_sort = 11))
I found it useful later to look at match wins for rounds other then the first round so here that measure is added to the player table. It will be shown later that players in the lower ranks win very few match all year. When wins in only the first round of the tournment are excluded they win on average very few. Given that prize money for first or second round loses barely cover the expense to travel, compete, pay for coaching, trainers etc. These lower ranked players will struggle to make ends meet.
# exFirstRnd<-atp_base%>%
# filter(draw_size%in%c(128,64,32))%>%
# filter(draw_size != as.integer(str_replace(round,"R","")))
player_table<-merge(player_table,
atp_base%>%
filter(draw_size%in%c(128,64,32))%>%
filter(draw_size != as.integer(str_replace(round,"R","")))%>%
select(season, winner_id)%>%
group_by(season, winner_id)%>%
summarise(season, winner_id, num_wins_2rplus=n())%>%
distinct() %>%
rename(player_id = winner_id),
by = c("season","player_id"),
all.x=TRUE
)%>%
replace_na(list(num_wins_2rplus = 0))
This chart has an early view about how many ATP players are participating in tournaments but winning very few matches.
player_table%>%
select(season, player_id, num_wins)%>%
filter(season %in% c(1980,1990,2000,2010))%>%
# group_by(season, num_wins)%>%
# summarize(season, num_wins, count = n())%>%
# distinct()%>%
ggplot(aes(x=num_wins))+
geom_histogram(binwidth = 15, fill = "SkyBlue")+
# scale_fill_brewer(palette = "Blues")+
facet_grid(rows = vars(season))+
labs(title="Number of ATP players and win total (Binwidth=15)", x="Wins", y="Number of Players")+
theme_bw()
This chart demonstrates that same point but visually it is a little more diffcult to see the trend given how many player literally win 0-1 match all year.
top_winner<-player_table%>%
select(season, player_name, num_wins)%>%
filter(season %in% c(1980,1990,2000,2010))%>%
group_by(season)%>%
slice(which.max(num_wins))
player_table%>%
select(season, player_id, num_wins)%>%
filter(season %in% c(1980,1990,2000,2010))%>%
group_by(season, num_wins)%>%
summarize(season, num_wins, count = n())%>%
distinct()%>%
ggplot(aes(x=num_wins, y=count))+
geom_point(color = "Blue")+
geom_smooth(fill = "SkyBlue") +
geom_point(data = top_winner, y=1, x=top_winner$num_wins, color="red") +
geom_text(data = top_winner, aes(y=1, x=top_winner$num_wins, label=top_winner$player_name), hjust=1, vjust=2, size=3)+
facet_grid(cols = vars(season))+
labs(title="Number of ATP players and win totals", subtitle = "(Highest win total labeled)", x="Wins", y="Number of Players")+
theme_bw()
John McEnroe’s 1984 season is considered by most as the most dominate Men’s tennis season ever (ranked #3 all-time behind Steffi Graff’s 1988 and Martina Navratilova’s 1983 seasons) according to the Bleacher Report 4
Season | Player Name | Wins Percent |
---|---|---|
1984 | John McEnroe | 96 |
2005 | Roger Federer | 95 |
2006 | Roger Federer | 95 |
2015 | Novak Djokovic | 93 |
1974 | Jimmy Connors | 93 |
1986 | Ivan Lendl | 93 |
2004 | Roger Federer | 93 |
1977 | Bjorn Borg | 92 |
1982 | Ivan Lendl | 92 |
1978 | Jimmy Connors | 92 |
1980 | Bjorn Borg | 92 |
2013 | Rafael Nadal | 92 |
1979 | Bjorn Borg | 92 |
1989 | Ivan Lendl | 92 |
2018 | Rafael Nadal | 92 |
1976 | Jimmy Connors | 92 |
1987 | Ivan Lendl | 91 |
2017 | Roger Federer | 91 |
2011 | Novak Djokovic | 91 |
1985 | Ivan Lendl | 91 |
Let’s take a closer look at the 1984 ATP season and see the matches being played amongst the most successful players that year.
# create data:
vSeason <- as.integer("1984")
player_nodes <- player_table%>%
filter(season==vSeason, num_matches>20)%>%
arrange(desc(win_percent))%>%
head(10)%>%
select(player_id, player_name, win_percent)%>%
rename(id=player_id, label=player_name)
player_links <- atp_base %>%
filter(season==vSeason, loser_id%in%player_nodes$id, winner_id%in%player_nodes$id)%>%
select(loser_id, winner_id)%>%
group_by(loser_id, winner_id)%>%
summarize(loser_id, winner_id, occurances=n())%>%
distinct()%>%
rename(from=loser_id,to=winner_id)%>%
mutate(
color=case_when(from==as.integer(100581) ~ "red", to==as.integer(100581) ~ "green", TRUE ~ "lightgray"),
width=case_when(from==as.integer(100581) ~ occurances, to==as.integer(100581) ~ occurances, TRUE ~ as.integer(1)),
label=case_when(from==as.integer(100581) ~ toString(occurances), to==as.integer(100581) ~ toString(occurances), TRUE ~ ""),
font.size=11,
font.color="black",
smooth=TRUE,
# arrows="middle",
shadow=FALSE
)
player_nodes<-player_nodes%>%
merge(player_links%>%
group_by(to)%>%
summarize(to, total_wins=sum(occurances))%>%
select(to, total_wins)%>%
distinct()%>%
rename(id = to), by=c("id"), all.x=TRUE)%>%
replace_na(list(total_wins = 0))%>%
mutate(
shape="dot",
shadow=TRUE,
title = paste(total_wins," top 10 wins"),
size=(total_wins+1)*1.5,
font.size=12,
font.color="black",
border_width = 2
)
visNetwork(player_nodes, player_links, height = "500px", width = "100%",
main="Matches between Top 10 Players (1984)",
footer="Highlighting John McEnroe's wins (green) & losses (red)")%>%
visLayout(randomSeed = 133,improvedLayout = TRUE)
.
The figure below begins to show the situation of match wins by ranking category for a sample of years, 1980, 1990, 2000, 2010. We can see here that average wins for an entire year get close to 0-2 for players outside the top third of the rankings. We can see a few outliers in the lower rankings and these are likely younger players on rapid raise to the uppper ranking.
player_table%>%
filter(as.integer(decade/10) %in% c(198:201))%>%
ggplot(aes(x=factor(reorder(rank_binA,rank_bin_sort)), y=num_wins))+
geom_boxplot()+
coord_flip()+
facet_wrap(facets = vars(decade)) +
labs(title = "Wins by relative ranking category by decade\n", x="", y="Toal Wins") +
theme_bw()
The competitive landscape for less ranked players is gradually increaseing over time. As seen in the following figure, which now included all matched each decade (not just a sampling of years), the top third highest ranked players accounts for roughly 75% of the wins over the year.
player_table%>%
filter(as.integer(decade/10) %in% c(198:201))%>%
ggplot(aes(x=decade, y=num_wins, fill=factor(reorder(rank_binA,rank_bin_sort))))+
geom_bar(position="fill", stat = "identity")+
#scale_color_viridis(discrete = TRUE, option = "turbo")+
scale_fill_viridis(discrete = TRUE, option = "turbo") +
labs(title= "Matches Won by relative ranking each decade\n", x="Decade", y="Percentage of Matches Won", fill="Ranking Category")+
theme_bw()
Let’s take a look at this data in a slight different visual across the entire 4+ decade period of ATP match data. We can clearly see that the top third ranked professional tennis players are accounting for the lions share of maych wins.
data<-player_table%>%
filter(season>1973)%>%
group_by(rank_binA)%>%
summarize(rank_binA,num_wins1 = sum(num_wins))%>%
select(rank_binA, num_wins1)%>%
distinct()%>%
# filter(season == "1980")%>%
arrange(desc(num_wins1))
# head(10)
# player_table%>%
# select(season, player_name, num_wins)%>%
# filter(season == "1980")%>%
# groud_by(season,num_wind)
# arrange(desc(num_wins))%>%
# head(10)
packing<-circleProgressiveLayout(data$num_wins1, sizetype='area')
data<-cbind(data,packing)
dat.gg<-circleLayoutVertices(packing, npoints=50)
ggplot() +
# Make the bubbles
geom_polygon(data = dat.gg, aes(x, y, group = id, fill=as.factor(id)), colour = "black", alpha = 0.6) +
# Add text in the center of each bubble + control its size
geom_text(data = data, aes(x, y, size=num_wins1, label = paste(rank_binA,"\n",num_wins1, sep = ""))) +
scale_size_continuous(range = c(1,4)) +
scale_fill_viridis(discrete = TRUE, option = "turbo") +
# General theme:
theme_void() +
theme(legend.position="none") +
coord_equal()+
labs(title="Ranking Category by Win Totals\n(1973-2019)\n")
The following treemap diagram provide another view of the entire match data set of match wins by ranking category. Interestingly even for the top 10% of ranked players the average number of wins for an entire season is less then 40. Also shown is the fact that lower ranked players are winning perhaps 1 or 2 matches all year. Clearly the prize money in these lower categories will not support an entire season of travel and coaching.
player_table%>%
select(season, rank_binA, rank_bin_sort, num_wins)%>%
filter(season > 1973)%>%
group_by(rank_binA, rank_bin_sort)%>%
summarize(rank_binA, rank_bin_sort, totalwins=sum(num_wins), count = n())%>%
distinct()%>%
ggplot(aes(area = totalwins/count, label = round(totalwins/count,0), fill=reorder(rank_binA,rank_bin_sort)))+
geom_treemap()+
# scale_fill_discrete(high="red", mid="white",low="blue",midpoint = 100) +
geom_treemap_text(colour = "white", place = "center", size = 12)+
#scale_color_viridis(discrete = TRUE, option = "turbo")+
scale_fill_viridis(discrete = TRUE, option = "turbo") +
labs(title="Average wins per player per season (all time 1973-2019)\n by Relative Ranking\n",
x="", y="",
fill = "Rank Percentile") +
theme_bw()
# labs(title="ATP 1980, Label = Number of wins, Size = Number of Players")
Finally, to complete the analysis, first round match wins will be excluded. It is clear that even for player in the top 10% of rankings on average they are winning 10 matches outside the first round of each tournament.
player_table%>%
select(season, rank_binA, rank_bin_sort, num_wins_2rplus)%>%
filter(season > 1973)%>%
group_by(rank_binA, rank_bin_sort)%>%
summarize(rank_binA, rank_bin_sort, totalwins=sum(num_wins_2rplus), count = n())%>%
distinct()%>%
ggplot(aes(area = totalwins/count, label = round(totalwins/count,0), fill=reorder(rank_binA,rank_bin_sort)))+
geom_treemap()+
# scale_fill_discrete(high="red", mid="white",low="blue",midpoint = 100) +
geom_treemap_text(colour = "white", place = "center", size = 12)+
#scale_color_viridis(discrete = TRUE, option = "turbo")+
scale_fill_viridis(discrete = TRUE, option = "turbo") +
labs(title="Average wins per player per season (all time 1973-2019)\nby Relative Ranking\nExcluding 1st Round",
x="", y="",
fill = "Rank Percentile") +
theme_bw()
# labs(title="ATP 1980, Label = Number of wins, Size = Number of Players")
After careful analysis of ATP match data since the start of the open era of tennis. Several conclusiongs are support by the data. 1. Overall matches, tournaments, and player participation is declining over time. 2. Playing surfaces have generally been trending toward more hard court matches but mainly inversely proportional to the decline of indoor carpet surfaces. 3. John McEnroe’s 1984 season was the most dominate season across all ATP competitions. 4. Finally: The data supports my hypothesis that there are many, many players at the fringe of ATP who are not winning enough matches to support their careers. Even players in the top tenth of the rankings are winning on average too few matches (and associate prize money) to support the travel and expenses to compete. The struggle of the less ranked players makes the ATP tour possible and those elite few players who are winning 50+ matches a year and earning the big prize money depend on this lesser competition’s participation.
The data includes detailed information about each match played on the ATP tour in a given year.
Tournament Data
Match Data
Player Data (for each winner & loser):
Match Statistics (for each w & l):
#Footer
Distill is a publication format for scientific and technical writing, native to the web.
Learn more about using Distill at https://rstudio.github.io/distill.
Data files source: https://github.com/JeffSackmann/tennis_atp↩︎
Only ATP tournament matches are include. Excludes Davis Cup, Challenge, etc.↩︎
https://bleacherreport.com/articles/1708407-ranking-the-10-most-dominant-seasons-in-tennis-history↩︎
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
O'Connell (2022, May 11). Data Analytics and Computational Social Science: ATP Tennis Analysis - Final Project. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsnorthonumgithubiodacss601-blogposts2022-05-02-finaprojectv3/
BibTeX citation
@misc{o'connell2022atp, author = {O'Connell, Jason}, title = {Data Analytics and Computational Social Science: ATP Tennis Analysis - Final Project}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsnorthonumgithubiodacss601-blogposts2022-05-02-finaprojectv3/}, year = {2022} }