Expanded analysis of match statitistics spanning multiple decades of ATP Professional Tennis seasons.
This is homework assignment #4 for Jason O’Connell. I have found some interesting data on profession tennis on github and I think I will use this for my final project.
For this homework I will bring in a few data file from various years and try somethings to compare the data between files
First read ATP data files, here I am using only the results from 1980, 1990, 2000, 2010 from GITHUB source
ATP1980 <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1980.csv")
ATP1990 <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1990.csv")
ATP2000 <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2000.csv")
ATP2010 <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2010.csv")
First I will improve on my visuals from last HW and look at them with the new data from this homework.
# ATP1980 %>%
# drop_na(winner_ht, loser_ht) %>%
# group_by(winner_ht, loser_ht) %>%
# summarise(winner_ht,loser_ht, count = n())
ggplot( ATP1980 %>%
drop_na(winner_ht, loser_ht) %>%
mutate(ht_diff=winner_ht-loser_ht) %>%
group_by(surface, ht_diff) %>%
summarise(surface, ht_diff, count = n()), aes(x=ht_diff, y=count)) +
geom_point(aes(color=surface)) +
geom_smooth(aes(color=surface)) +
labs(title="1980 Match Winner vs. Loser Height Differance by Surface", x="Height Difference", y="Number of Matches")
This graphs is a huge improvement over what I was doing in the last HW. Learning geom_smooth() in tutorial 6 really helps. I can quickly see the blank surface data is sort of noise that I should eliminate. Also can see pretty easily the the data is skewed toward the taller player winning but also the more matches are played on clay then hard, then carpet?, and finally grass. I am not sure they even play on carpet anymore - we will check in the 2010 file.
In the other files missing court surface isn’t an issue. Let’e redo 1980 eliminating the missing surface rows and inlcude the other years with the same plot:
ggplot( ATP1980 %>%
drop_na(winner_ht, loser_ht) %>%
filter(!grepl('clay|hard|carpet|grass', surface)) %>%
mutate(ht_diff=winner_ht-loser_ht) %>%
group_by(surface, ht_diff) %>%
summarise(surface, ht_diff, count = n()), aes(x=ht_diff, y=count)) +
geom_point(aes(color=surface)) +
geom_smooth(aes(color=surface)) +
labs(title="1980 Match Winner vs. Loser Height Differance by Surface", x="Height Difference", y="Number of Matches")
Well can’t figure out why that didn’t work? I am hoping to eliminate the null or blank surface using grepl.
ggplot( ATP1990 %>%
drop_na(winner_ht, loser_ht) %>%
mutate(ht_diff=winner_ht-loser_ht) %>%
group_by(surface, ht_diff) %>%
summarise(surface, ht_diff, count = n()), aes(x=ht_diff, y=count)) +
geom_point(aes(color=surface)) +
geom_smooth(aes(color=surface)) +
labs(title="1990 Match Winner vs. Loser Height Differance by Surface", x="Height Difference", y="Number of Matches")
That one looks pretty good.
ggplot( ATP2000 %>%
drop_na(winner_ht, loser_ht) %>%
mutate(ht_diff=winner_ht-loser_ht) %>%
group_by(surface, ht_diff) %>%
summarise(surface, ht_diff, count = n()), aes(x=ht_diff, y=count)) +
geom_point(aes(color=surface)) +
geom_smooth(aes(color=surface)) +
labs(title="2000 Match Winner vs. Loser Height Differance by Surface", x="Height Difference", y="Number of Matches")
Same for this one.
ggplot( ATP2010 %>%
drop_na(winner_ht, loser_ht) %>%
filter(!grepl('clay|hard|grass', surface)) %>%
mutate(ht_diff=winner_ht-loser_ht) %>%
group_by(surface, ht_diff) %>%
summarise(surface, ht_diff, count = n()), aes(x=ht_diff, y=count)) +
geom_point(aes(color=surface)) +
geom_smooth(aes(color=surface)) +
labs(title="2010 Match Winner vs. Loser Height Differance by Surface", x="Height Difference", y="Number of Matches")
Hmmm I’ve got a weird outlier - let’s get rid of that.
ggplot( ATP2010 %>%
drop_na(winner_ht, loser_ht) %>%
filter(!grepl('clay|hard|grass', surface)) %>%
mutate(ht_diff=winner_ht-loser_ht) %>%
filter(ht_diff < 100) %>%
group_by(surface, ht_diff) %>%
summarise(surface, ht_diff, count = n()), aes(x=ht_diff, y=count)) +
geom_point(aes(color=surface)) +
geom_smooth(aes(color=surface)) +
labs(title="2010 Match Winner vs. Loser Height Differance by Surface", x="Height Difference", y="Number of Matches")
Eliminated the outlier but geom_smooth still didn’t work for this one weird.
Now I will combine the data sets and see what it looks like over the decades. First let’s look at the violin chart.
ggplot( bind_rows(mutate(ATP1980,season = "1980"),mutate(ATP1990, season = "1990"), mutate(ATP2000, season = "2000"), mutate(ATP2010, season = "2010")) %>%
drop_na(winner_ht, loser_ht) %>%
mutate(ht_diff=winner_ht-loser_ht) %>%
filter(ht_diff < 100) %>%
group_by(season, ht_diff) %>%
summarise(season, ht_diff, count = n()), aes(x=season, y=ht_diff, fill=season)) +
geom_violin() +
labs(title="Match Winner vs. Loser Height Differance by Season", x="Season", y="Height Differance")
ggplot( bind_rows(mutate(ATP1980,season = "1980"),mutate(ATP1990, season = "1990"), mutate(ATP2000, season = "2000"), mutate(ATP2010, season = "2010")) %>%
drop_na(winner_ht, loser_ht) %>%
mutate(ht_diff=winner_ht-loser_ht) %>%
filter(ht_diff < 100) %>%
group_by(season, ht_diff) %>%
summarise(season, ht_diff, count = n()), aes(x=ht_diff, y=count)) +
geom_point(aes(color=season)) +
geom_smooth(aes(color=season)) +
labs(title="Match Winner vs. Loser Height Differance by Season", x="Height Difference", y="Number of Matches")
New stuff used: filter, bind_rows, geom_smooth Unsuccessfully used: grepl, geom_smooth(2010) Despite racket technology, fitness, and several other factors have changed the game over the years difference in player heigh doesn’t seem to predict the winner of a match.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
O'Connell (2022, Feb. 23). Data Analytics and Computational Social Science: HW4 - ATP Tennis Statiistics. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomnorthonum869986/
BibTeX citation
@misc{o'connell2022hw4, author = {O'Connell, Jason}, title = {Data Analytics and Computational Social Science: HW4 - ATP Tennis Statiistics}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomnorthonum869986/}, year = {2022} }