library(tidyverse)
library(ggplot2)
library(here)
library(mosaic)
library(tidyverse)
library(lubridate)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Final Project: Kevin Martell Luya
Abstract
Swimming has become a fundamental aspect of human culture and has been practiced for thousands of years. They provide a platform for physical activity, competition, and entertainment. Swimming is one of the many sports that have gained popularity worldwide due to its many health benefits and inclusion in major international events like the Olympic Games. Competitive swimming features a variety of events, including freestyle, breaststroke, backstroke, and butterfly. The world records for each event are continually being broken by exceptional world top athletes who push the limits of human performance medley, with races ranging from 50 meters to 1500 meters. This paper will focus on the Top 200 world times data set, which contains historical records of the best 200 swimming times from 1980 to 2020, and answer the following research questions.
1. Are there any patterns or trends in the progression of swimming world records over time? For example, are records being broken more frequently during certain events or specific periods? Or Are world records more likely to be broken in shorter or longer races or specific stroke categories?
2. Is there a relationship between a swimmer’s age, gender, or nationality and their likelihood of breaking a world record? For example, do younger or older swimmers tend to break records more often than their peers?
3. Have advancements in technology or changes in swimming equipment impacted the frequency or magnitude of world record-breaking? For example, have swimsuits or other equipment innovations led to more world records being broken in recent years?
Data set
Data.world is a platform for data collaboration, analysis, and sharing designed to help teams and organizations discover, use, and publish high-quality data in different contexts, such as competitive swimming. Data has been centralized by this organization resulting in the Top 200 world times. In total, there are 5,200 instances and 14 columns. In this data set, each row represents a different swim record. Each column contains a specific piece of information such as the event names, swim times, swim date, event description, and athlete name, as well as information about the team they represent, their full name, gender, birth date, rank order, duration of their swim and last but not the least the city and country code where the event took place in the world. The data is sourced from various international swimming competitions, including the Olympic Games, the FINA World Championships, and national championships in different countries.
# read in the swim data frame
<- read_csv("./_data/swimming_database.csv")
df_swim# look at the head of the data
head(df_swim)
This data set has records from 716 swimmers and 474 events worldwide from 1980 to 2020.
# dimensions of the data set
dim(df_swim)
[1] 5200 14
# number of unique Event name
length(unique(df_swim$`Event Name`))
[1] 474
# number of unique Team name
length(unique(df_swim$`Team Name`))
[1] 58
# number of unique Athletes
length(unique(df_swim$`Athlete Full Name`))
[1] 716
# table of the 26 unique Swimming Events
table(df_swim$`Event description`)
Men 100 Backstroke LCM Male Men 100 Breaststroke LCM Male
200 200
Men 100 Butterfly LCM Male Men 100 Freestyle LCM Male
200 200
Men 1500 Freestyle LCM Male Men 200 Backstroke LCM Male
200 200
Men 200 Breaststroke LCM Male Men 200 Butterfly LCM Male
200 200
Men 200 Freestyle LCM Male Men 200 Medley LCM Male
200 200
Men 400 Freestyle LCM Male Men 400 Medley LCM Male
200 200
Men 800 Freestyle LCM Male Women 100 Backstroke LCM Female
200 200
Women 100 Breaststroke LCM Female Women 100 Butterfly LCM Female
200 200
Women 100 Freestyle LCM Female Women 1500 Freestyle LCM Female
200 200
Women 200 Backstroke LCM Female Women 200 Breaststroke LCM Female
200 200
Women 200 Butterfly LCM Female Women 200 Freestyle LCM Female
200 200
Women 200 Medley LCM Female Women 400 Freestyle LCM Female
200 200
Women 400 Medley LCM Female Women 800 Freestyle LCM Female
200 200
Before conducting a summary statistics of the data set, we need to compute and store in a dedicated column 1) the swimmers’ age when they competed at a given event in time and 2) the swimming event year when it took place. Then, we will mutate the table and factor the columns. This process will facilitate and shorten the code implementation for the following research questions.
Let us extract the year from the Swim Date column.
# extract the event year
<- parse_date_time(df_swim$`Swim date`, orders = "mdy")
date # concatenate the years as a new column
$event_year = year(date) df_swim
Because there are swimmers that were born before 1968 and Libridate has a difficult time parsing years before 1968 due to the “Year 2000 problem,” we created a function that add the century to the year: from this “dd-mm-yy” to “dd-mm-yyyy” format.
Now, let’s extract the swimmers’ age in the year of the competition using add_century function.
# extract the athlete birth date year
<- parse_date_time2(as.character(df_swim$`Athlete birth date`), "mdy", cutoff_2000 = 0)
date # concatenate the event_athle_age as a new column
$event_athle_age = df_swim$event_year - year(date) df_swim
Now, let’s factor the columns of interest and show the statistic of the variables swim_time and event_athle_age.
<- df_swim %>%
df_swim mutate(
`Event Name` = factor(`Event Name`),
`Team Code` = factor(`Team Code`),
`Team Name` = factor(`Team Name`),
`Athlete Full Name` = factor(`Athlete Full Name`),
`Gender` = factor(`Gender`),
`Athlete birth date` = factor(`Athlete birth date`),
`Rank_Order` = factor(`Rank_Order`),
`City` = factor(`City`),
`Country Code` = factor(`Country Code`)
)
summary(df_swim)
index Event Name
Min. : 0 Olympic Games Tokyo 2020 : 361
1st Qu.:1300 13th FINA World Championships 2009 : 293
Median :2600 18th FINA World Championships 2019 : 228
Mean :2600 Olympic Games Rio 2016 : 228
3rd Qu.:3899 19th FINA World Championships Budapest 2022: 226
Max. :5199 17th FINA World Championships 2017 : 223
(Other) :3641
Swim time Swim date Event description Team Code
Length:5200 Length:5200 Length:5200 USA :1360
Class :character Class :character Class :character AUS : 637
Mode :character Mode :character Mode :character JPN : 427
CHN : 377
GBR : 314
HUN : 287
(Other):1798
Team Name Athlete Full Name Gender
United States of America :1360 LEDECKY, Katie : 218 F:2600
Australia : 637 HOSSZU, Katinka : 129 M:2600
Japan : 427 PHELPS, Michael : 86
People's Republic of China: 377 SJOESTROEM, Sarah : 82
Great Britain : 314 SUN, Yang : 78
Hungary : 287 PALTRINIERI, Gregorio: 75
(Other) :1798 (Other) :4532
Athlete birth date Rank_Order City Country Code
03-17-97: 218 1 : 26 Tokyo : 611 USA : 710
05-03-89: 129 2 : 26 Budapest : 585 JPN : 671
05-24-94: 105 3 : 26 Rome : 420 HUN : 599
06-30-85: 86 4 : 26 Rio de Janeiro: 234 ITA : 497
08-17-93: 82 5 : 26 Gwangju : 230 CHN : 439
12-01-91: 78 6 : 26 (Other) :3118 AUS : 391
(Other) :4502 (Other):5044 NA's : 2 (Other):1893
Duration (hh:mm:ss:ff) event_year event_athle_age
Length:5200 Min. :1980 Min. : 14.0
Class :character 1st Qu.:2013 1st Qu.: 21.0
Mode :character Median :2017 Median : 24.0
Mean :2016 Mean : 32.4
3rd Qu.:2021 3rd Qu.: 26.0
Max. :2022 Max. :121.0
# stats of Swim time
fav_stats(df_swim$`Swim time`)
# stats of swimmer's age at time of competition
fav_stats(df_swim$event_athle_age)
Let us take a quick view of each country’s records from 1980 to 2022.
%>%
df_swim group_by(`Team Code`) %>%
summarise(record_counts = length(`Team Code`)) %>%
mutate(`Team Code` = fct_reorder(`Team Code`, record_counts)) %>%
ggplot() + geom_col(aes(record_counts, `Team Code`)) +
labs(title = "Country record (1980 - 2022) ",
x = "Number of records", y ="Country")
We can see that the dominant countries are The United States (USA), Australia(AUS), Japan(JPN), China(CHN), and Great Britain(GBR). Does this mean we must focus only on these five countries to answer the research questions? The answer is no. All countries are representative of our analysis, for hidden patterns might be found at the event/style level, which leads us to answer the first question. Are there any patterns or trends in the progression of swimming world records over time? We can inspect the records that have been broken more frequently in certain events and during what periods.
Let us create a bar graph that shows the number of records broken per year and see if the swimming historical data reveals a trend between 1980 and 2020.
# minimum and maximum years
<- min(df_swim$event_year)
min_year <- max(df_swim$event_year)
max_year
%>%
df_swim ggplot(aes(x = event_year)) +
geom_bar(stat = 'count') +
scale_x_continuous(breaks = seq(min_year, max_year, by = 2 )) +
labs(title = "Year records (1980 - 2022)",
x = "Years", y ="Number of records") +
theme(axis.text.x = element_text(angle = 45))
The data shows that over this range of years, swimmers of different ages, nationalities, and categories have continuously been setting new world records, suggesting that the number of swimming records has increased as time progressed.
There are five crucial peaks (2008, 2009, 2016, 2019, and 2021) in the bar graph that we will analyze since the number of records increased continuously yearly.
# filter data by years of interest
<- df_swim %>%
peaks_yrs filter(
$event_year == 2008 |
df_swim$event_year == 2009 |
df_swim$event_year == 2016 |
df_swim$event_year == 2019 |
df_swim$event_year == 2021 )
df_swim
<- peaks_yrs %>%
peaks_yrs mutate(event_year = factor(event_year))
Analyzing such years in the context of the length of race, we will see whether there is a certain distance in which swimmers have set more new records compared to others’ distances. For this graph, we do not consider the nationality, gender, or age of the swimmer who set a record since this is an analysis at the swimming-length level.
# extract the length of the races
<- peaks_yrs$`Event description`
event_descriptions <- regexpr("\\d+", event_descriptions)
reg $length_race <-regmatches(event_descriptions, reg)
peaks_yrs
# create a histogram of the peak years
ggplot(peaks_yrs, aes(length_race, fill = event_year)) +
geom_histogram(stat = "count",position="dodge2") +
theme(legend.position = "top") +
labs(x = "Length of races in meters (m)",
y = "Number of broken records",
title = "Records by swimming length") +
guides(fill = guide_legend(title = "Year of observation")) +
coord_flip()
Then, let us analyze such years in the context of swimming events and see whether there are certain swimming events in which swimmers have set more new records than other events. Even though the names of the swimming event involve gender, they are officially unique swimming events. For this graph, we do not consider the nationality, gender, or age of the swimmer who set a record since this is an analysis at the swimming-event level.
# Create a bar graph of the filtered data
%>%
peaks_yrs ggplot(aes(`Event description` )) +
geom_bar(aes(fill= event_year),
position = position_stack(reverse = TRUE)) +
theme(legend.position = "top") +
labs(x = "Type of swimming event",
y = "Number of broken records",
title = "Broken records by swimming event") +
guides(fill = guide_legend(title = "Year of observation")) +
coord_flip()
Thanks to this visualization, in general, swimmers are more likely to set new records more frequently in the following swimming events: Women 100 Backstroke LCM Female, Men 100 Breaststroke LCM Male, and Men 100 Freestyle LCM Male, for the proportions of new records increased year after year during this periods. Likewise, based on the graph, the swimming strokes more likely to be broken are breaststroke, butterfly, and freestyle: the proportions are increasing yearly.
Moving forward to the second question, to investigate the relationship between age, gender, or nationality and their likelihood of breaking a world record, we will use histograms for 1) swimmers; age at year competition, 2) swimmers’ gender, and 3) swimmers’ nationality.
Let us plot the age histograms by age from the whole data set.
%>%
df_swim ggplot( aes(event_athle_age, center = event_athle_age), bins = 100 ) +
geom_histogram(binwidth = 1,
fill = "skyblue",
color = "gray",
breaks = seq(5, 40, by = 1)) +
scale_x_continuous(breaks = seq(5, 40, by = 2),
labels = seq(5, 40, by = 2)) +
labs(title = "Age distribution at a year of competition (1980 - 2020)",
x = "Age",
y ="Number of swimmers")
We observe that swimmers aged between 21 and 25 years old at the year of the competition have set more records from 1980 to 2020. Physical development would be the most reasonable explanation. Younger swimmers may have a greater potential for physical development, as their bodies are still growing and developing. This can give them an advantage in terms of strength, power, and flexibility, which can help them to swim faster and break records.
Let us plot a histograms of the age distributions by country.
%>%
df_swim ggplot(aes(event_athle_age)) +
geom_histogram(binwidth = 1,
fill = "skyblue",
color = "gray",
show.legend = TRUE) +
xlim(10,40) +
facet_wrap(~df_swim$`Team Code`) +
labs(title = "Age distribution by country",
x = "Age",
y ="Number of swimmers")
We can see that The United States (USA), Australia(AUS), Japan(JPN), China(CHN), Great Britain(GBR), Italy(ITA), and Russia(RUS) have the swimmers whose ages were between 21 and 25 at a year of competition, suggesting that swimmers from these countries and between those ages are more likely to set new world records in future swimming champions. Genetics and body type could one explanation. Certain countries may have a higher prevalence of individuals with genetic traits that are beneficial for swimming, such as a tall and lean body type or a high percentage of fast-twitch muscle fibers.
Let us plot a third histogram disaggregated by gender.
%>%
df_swim ggplot( aes(event_athle_age)) +
geom_histogram(binwidth = 1,
fill = "skyblue",
color = "gray" ,
breaks = seq(5, 40, by = 1)) +
scale_x_continuous(breaks = seq(5, 40, by = 2),
labels = seq(5, 40, by = 2)) +
facet_wrap(~df_swim$Gender) +
labs(title = "Age distribution by gender",
x = "Age",
y ="Number of swimmers")
These histograms suggests that males between 21 and 25 years old have a higher likelihood of setting new records compared to females swimmer. Why is that? One factor could be physiology - men generally have more muscle mass and larger body size than women, which can give them a physical advantage in swimming. This can be particularly beneficial in events that require power and strength, such as the sprint events. However, the gender gap in world record-breaking is narrowing. In recent years, female swimmers have been breaking records with increasing frequency, and the gap between male and female records is slowly closing.
Have advancements in technology or changes in swimming equipment had an impact on the frequency or magnitude of world record breaking? For example, have swimsuits or other equipment innovations led to more world records being broken in recent years?
To answer the last question regarding the advancements in technology or changes in swimming equipment that had an impact on the frequency or magnitude of world record-breaking, we will 1) use all the data set, 2) create eight year categories (10-15, 16-20, 21-25, 26-30, 31-35, 36-40, 41-45, and 46+), and 3) create a new column with the swimming style (freestyle, butterfly, backstroke, breaststroke, and medley). This will help us abstract the number of swimmers based on their ages for each swimming style and see which of them they have broken more records for a given year. We want to visualize jumps in the number of records per year.
# divide categories according to age
<- df_swim %>%
best_performace mutate(age_category = case_when(
> 10 & event_athle_age <= 15 ~ "10-15",
event_athle_age > 15 & event_athle_age <= 20 ~ "16-20",
event_athle_age > 20 & event_athle_age <= 25 ~ "21-25",
event_athle_age > 25 & event_athle_age <= 30 ~ "26-30",
event_athle_age > 30 & event_athle_age <= 35 ~ "31-35",
event_athle_age > 35 & event_athle_age <= 40 ~ "36-40",
event_athle_age > 40 & event_athle_age <= 45 ~ "41-45",
event_athle_age TRUE ~ "+46" ))
# extract swimming style from Event description
for (i in 1:dim(peaks_yrs)[1]) {
$swim_style_category[i] <- strsplit(best_performace$`Event description`, " ")[[i]][3]
best_performace }
Let us plot the graph.
# Create a bar graph of the filtered data
%>%
best_performace ggplot(aes(x = event_year,
y = age_category,
color = swim_style_category,
alpha = 0.3)) +
geom_count(aes(size = after_stat(prop), group = 1)) +
facet_wrap(~swim_style_category) +
scale_size_area(max_size = 8) +
labs(title = "Records by year and swimimg category",
x = "Year",
y ="Age category")
Advancements in technology and changes in swimming equipment have had a significant impact on the frequency and magnitude of world record breaking in recent years. High-tech swimsuits were introduce in the early 2000s had a major impact on the number of world records being broken. we can observe that the proportions of world records started to increase from 2000’s. These swimsuits were designed to reduce drag and improve buoyancy, which allowed swimmers to swim faster and more efficiently. The suits were so effective that they were eventually banned by the International Swimming Federation (FINA) in 2010.
Conclusion and Discussion
This data analysis suggests that there are patterns or trends in the progression of swimming, such swimmers between 21 and 25 have a higher likelihood of breaking world records in Women 100 Backstroke LCM Female, Men 100 Breaststroke LCM Male, and Men 100 Freestyle LCM Male. We also explored the relationship between a swimmer’s age, gender, or nationality and their likelihood of breaking a world record. Finally, high-tech swimsuits have impacted the frequency or magnitude of world record-breaking starting in 2010. The top 200 world records allowed us to explore historical data, suggesting the findings we presented in this paper.
Bibliography
- For reference, you can check out [the source of the dataset](https://data.world/romanian-data/swimming-dataset-top-200-world-times)
- R as a programming language