Analysis on Olympics from 1896 to 2016
Author

Srujan Kagitala

Published

July 13, 2023

Code
library(tidyverse)
library(ggplot2)
library(treemap)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Overview

For my final project, I would do exploratory analysis on the Olympics data right from the year it started in 1896 in Athena till 2016 in Rio de Janeiro.

Olympics

Olympics is the most decorated sporting event in an athlete’s career. The Olympics are the leading international sporting events featuring summer and winter sports competitions in which thousands of athletes from around the world participate in a variety of competitions. The Olympic Games are considered the world’s foremost sports competition with more than 200 teams, representing sovereign states and territories, participating. The Olympic Games are normally held every four years, and since 1994, have alternated between the Summer and Winter Olympics every two years during the four-year period. The first, second, and third place finishers in each event receive Olympic medals: gold, silver, and bronze, respectively.

Dataset

The dataset I would use for this analysis is downloaded from Kaggle. The original data has been scraped from https://www.sports-reference.com/ in May 2018 and cleaned by Samruddhi Mhatre.

This dataset consists of Olympics data of over a century, from the year 1896 to 2016. Studying this dataset will help to understand the patterns followed in the games of Olympics, patterns of the most successful athletes and countries in their Olympics journey and much more!

Code
olympics_data <- read_csv("_data/athlete_events_olympics.csv")
olympics_data
ABCDEFGHIJ0123456789
ID
<dbl>
Name
<chr>
Sex
<chr>
Age
<dbl>
Height
<dbl>
Weight
<dbl>
Team
<chr>
NOC
<chr>
Games
<chr>
Year
<dbl>
1A DijiangM2418080.0ChinaCHN1992 Summer1992
2A LamusiM2317060.0ChinaCHN2012 Summer2012
3Gunnar Nielsen AabyM24NANADenmarkDEN1920 Summer1920
4Edgar Lindenau AabyeM34NANADenmark/SwedenDEN1900 Summer1900
5Christine Jacoba AaftinkF2118582.0NetherlandsNED1988 Winter1988
5Christine Jacoba AaftinkF2118582.0NetherlandsNED1988 Winter1988
5Christine Jacoba AaftinkF2518582.0NetherlandsNED1992 Winter1992
5Christine Jacoba AaftinkF2518582.0NetherlandsNED1992 Winter1992
5Christine Jacoba AaftinkF2718582.0NetherlandsNED1994 Winter1994
5Christine Jacoba AaftinkF2718582.0NetherlandsNED1994 Winter1994

The data set has 271116 observations and 15 data points recorded per observation. Each observation records variables like ID, Name, Sex, Age, Height, Weight, Team, NOC, Games, Year, Season, City, Sport, Event, Medal.

The youngest and the oldest athlete that ever participated are 10 and 97 years of age respectively. Athletes from across the world compete in 765 events that happen across 66 sports. These happen in 2 Olympic seasons i.e Summer, Winter.

Data Cleaning

The dataset is mostly clean and we don’t have to work around much. Although, for doing some analysis, we might need to create various subsets of data like medals, teams, etc, drop na values for a few variables and mutate new variables(Most of these can be done during the analysis). We need to convert categorical variables into factor to get overall summary across various variables. We don’t need the “ID” column for our analysis as it only serves the purpose of a key.

Code
olympics_data <- olympics_data %>%
  mutate_if(is_character, as.factor) %>%
  select(-ID)

medals <- olympics_data %>%
  filter(!is.na(Medal))

summary(olympics_data)
                       Name        Sex             Age            Height     
 Robert Tait McKenzie    :    58   F: 74522   Min.   :10.00   Min.   :127.0  
 Heikki Ilmari Savolainen:    39   M:196594   1st Qu.:21.00   1st Qu.:168.0  
 Joseph "Josy" Stoffel   :    38              Median :24.00   Median :175.0  
 Ioannis Theofilakis     :    36              Mean   :25.56   Mean   :175.3  
 Takashi Ono             :    33              3rd Qu.:28.00   3rd Qu.:183.0  
 Alexandros Theofilakis  :    32              Max.   :97.00   Max.   :226.0  
 (Other)                 :270880              NA's   :9474    NA's   :60171  
     Weight                 Team             NOC                 Games       
 Min.   : 25.0   United States: 17847   USA    : 18853   2000 Summer: 13821  
 1st Qu.: 60.0   France       : 11988   FRA    : 12758   1996 Summer: 13780  
 Median : 70.0   Great Britain: 11404   GBR    : 12256   2016 Summer: 13688  
 Mean   : 70.7   Italy        : 10260   ITA    : 10715   2008 Summer: 13602  
 3rd Qu.: 79.0   Germany      :  9326   GER    :  9830   2004 Summer: 13443  
 Max.   :214.0   Canada       :  9279   CAN    :  9733   1992 Summer: 12977  
 NA's   :62875   (Other)      :201012   (Other):196971   (Other)    :189805  
      Year         Season                   City               Sport       
 Min.   :1896   Summer:222552   London        : 22426   Athletics : 38624  
 1st Qu.:1960   Winter: 48564   Athina        : 15556   Gymnastics: 26707  
 Median :1988                   Sydney        : 13821   Swimming  : 23195  
 Mean   :1978                   Atlanta       : 13780   Shooting  : 11448  
 3rd Qu.:2002                   Rio de Janeiro: 13688   Cycling   : 10859  
 Max.   :2016                   Beijing       : 13602   Fencing   : 10735  
                                (Other)       :178243   (Other)   :149548  
                                 Event           Medal       
 Football Men's Football            :  5733   Bronze: 13295  
 Ice Hockey Men's Ice Hockey        :  4762   Gold  : 13372  
 Hockey Men's Hockey                :  3958   Silver: 13116  
 Water Polo Men's Water Polo        :  3358   NA's  :231333  
 Basketball Men's Basketball        :  3280                  
 Cycling Men's Road Race, Individual:  2947                  
 (Other)                            :247078                  
Code
olympics_data
ABCDEFGHIJ0123456789
Name
<fct>
Sex
<fct>
Age
<dbl>
Height
<dbl>
Weight
<dbl>
Team
<fct>
NOC
<fct>
Games
<fct>
Year
<dbl>
Season
<fct>
A DijiangM2418080.0ChinaCHN1992 Summer1992Summer
A LamusiM2317060.0ChinaCHN2012 Summer2012Summer
Gunnar Nielsen AabyM24NANADenmarkDEN1920 Summer1920Summer
Edgar Lindenau AabyeM34NANADenmark/SwedenDEN1900 Summer1900Summer
Christine Jacoba AaftinkF2118582.0NetherlandsNED1988 Winter1988Winter
Christine Jacoba AaftinkF2118582.0NetherlandsNED1988 Winter1988Winter
Christine Jacoba AaftinkF2518582.0NetherlandsNED1992 Winter1992Winter
Christine Jacoba AaftinkF2518582.0NetherlandsNED1992 Winter1992Winter
Christine Jacoba AaftinkF2718582.0NetherlandsNED1994 Winter1994Winter
Christine Jacoba AaftinkF2718582.0NetherlandsNED1994 Winter1994Winter
Code
medals
ABCDEFGHIJ0123456789
Name
<fct>
Sex
<fct>
Age
<dbl>
Height
<dbl>
Weight
<dbl>
Team
<fct>
NOC
<fct>
Games
<fct>
Year
<dbl>
Season
<fct>
Edgar Lindenau AabyeM34NANADenmark/SwedenDEN1900 Summer1900Summer
Arvo Ossian AaltonenM30NANAFinlandFIN1920 Summer1920Summer
Arvo Ossian AaltonenM30NANAFinlandFIN1920 Summer1920Summer
Juhamatti Tapio AaltonenM2818485.00000FinlandFIN2014 Winter2014Winter
Paavo Johannes AaltonenM2817564.00000FinlandFIN1948 Summer1948Summer
Paavo Johannes AaltonenM2817564.00000FinlandFIN1948 Summer1948Summer
Paavo Johannes AaltonenM2817564.00000FinlandFIN1948 Summer1948Summer
Paavo Johannes AaltonenM2817564.00000FinlandFIN1948 Summer1948Summer
Paavo Johannes AaltonenM3217564.00000FinlandFIN1952 Summer1952Summer
Kjetil Andr AamodtM2017685.00000NorwayNOR1992 Winter1992Winter

Insights

Code
#Type of medals won by teams.
teams_medals_type <- medals %>%
  group_by(NOC, Medal) %>%
  summarise(count = n())

#Total medals won by teams
teams_medals_total <- medals %>%
  group_by(NOC) %>%
  summarise(total_medals = n()) %>%
  arrange(desc(total_medals))

#Top 50 countries(by total medals won) medal_type tally.
(teams_medals_tally <- inner_join(teams_medals_type,
                              teams_medals_total[1:50, ],
                              by = "NOC") %>%
    arrange(desc(total_medals)))
ABCDEFGHIJ0123456789
NOC
<fct>
Medal
<fct>
count
<int>
total_medals
<int>
USABronze13585637
USAGold26385637
USASilver16415637
URSBronze6892503
URSGold10822503
URSSilver7322503
GERBronze7462165
GERGold7452165
GERSilver6742165
GBRBronze6512068
Code
#Bar plot of medals won by top 50 countries.
ggplot(data = teams_medals_tally , aes(x= reorder(NOC, total_medals), y = count)) +
  geom_bar(stat = "identity",
           mapping = aes(fill = Medal),
           position = "stack") +
  scale_fill_manual(values=c('brown', 'gold', 'darkgrey')) +
  theme_classic() +
  labs(title ="Top 50 countries that won most medals",
       y = "No of medals",
       x = "Nation of Origin",
       fill = "Medal")+
  coord_flip()

We can see that USA has won the highest number of medals, more than double the number of medals won by Soviet Union.

Code
#Sport with events count
(sport_events <- olympics_data %>%
  distinct(Sport, Event) %>%
  group_by(Sport) %>%
  summarise(no_events = n()) %>%
  arrange(desc(no_events)))
ABCDEFGHIJ0123456789
Sport
<fct>
no_events
<int>
Athletics83
Shooting83
Swimming55
Cycling44
Sailing38
Wrestling30
Archery29
Art Competitions29
Canoeing27
Gymnastics27
Code
#Treemap representing each Sport proportional to number of events in olympics
treemap(sport_events,
        index = "Sport",
        vSize = "no_events",
        type = "index",
        fontsize.labels = 25,
        fontcolor.labels = "black",
        align.labels=list(
          c("center", "center")),
        inflate.labels=F,
        palette = "Set1",
        title="Treemap representing each Sport proportional to number of events in olympics",
        fontsize.title=12,
        bg.labels = 0)

Athletics and shooting are the sports with highest number of events (83 events each) followed by swimming with 55 events.

Code
#Age distribution among females and males in olympics.
ggplot(olympics_data, aes(x = Age, na.rm = TRUE)) +
  geom_histogram(bins = 85, fill="darkred", alpha=0.7) +
  theme_linedraw() +
  facet_wrap(vars(Sex), nrow = 2, ncol = 1) +
  labs(title = "Age distribution of atheletes in Olympics",
       x = "Age",
       y = "Counts")

We can see that the mean age of men athletes is higher than the mean age of women athletes. Most frequent age in men athletes is approximately equal to the most frequent age in women athletes.

Code
#Times series for how top 10 nations performed over years.
teams_medals_yearly <- medals %>%
  filter(NOC %in% teams_medals_total[1:9,]$NOC) %>%
  group_by(NOC, Year) %>%
  summarise(total_medals = n()) %>%
  arrange(NOC,Year)

ggplot(teams_medals_yearly,
       aes(x= Year, y = total_medals)) + 
  geom_line(color = "darkgreen") +
  facet_wrap(~NOC, scales = "free") +
  #scale_x_continuous(breaks = c(2015, 2016, 2017),
                     #labels = c("2015", "2016", "2017")) +
  theme_linedraw()+
  labs(title = "Medals won over years by top 10 countries since 1896",
       x = "Year",
       y = "Number of medals")

Research Questions

  1. Is there an unbiased ranking system to determine the rankings of nations in Olympics? Can we rank different nations on varied ranking systems (different weightage for gold, silver and bronze) and observe how their ranks differ based on weightage given to gold, silver and bronze medals? - Ranking System section
  2. Can we identify the most decorated athlete of all time, most decorated men and women athlete? Can we also identify impactful performances of athletes? - Decorated Athletes section
  3. Can we identify the age of men and women athletes where their performance is maximized? Does this differ for countries? - Peak Performant Age section
  4. Can we identify if there is a correlation of features like height/weight/age to specific sports? (For example if height positively impacts basketball, age helps in shooting or athletics, etc.) Does it equally hold for both the genders? Correlation of features to Sport section

Ranking System

Is there an unbiased ranking system to determine the rankings of nations in Olympics? Can we rank different nations on varied ranking systems (different weightage for gold, silver and bronze) and observe how their ranks differ based on weightage given to gold, silver and bronze medals?

Code
(countries_participated_year <- olympics_data %>%
  distinct(NOC, Year)%>%
  group_by(Year) %>%
  summarise(no_countries_participated = n()) %>%
  arrange(desc(no_countries_participated)))
ABCDEFGHIJ0123456789
Year
<dbl>
no_countries_participated
<int>
2016207
2012205
2008204
2004201
2000200
1996197
1992170
1988160
1984148
1972121

230 countries participated in Olympics since the start till the event in 2016. So, we will analyse how varied ranking systems for different years(approx two or three years) in which the number of countries participated is greater than 200 to ensure that we find an unbiased ranking system. We will use that system to rank nations for one random year and also over the years.

Design.

The total weighted medal values for each country determine the country’s rank in the Olympics. The weighted value of the medals won by a country is found by multiplying the number of gold, silver and bronze medals by their respective weight and then summing them. Bronze medals are always worth one point. Gold medals can’t be worth less than silver and silver can’t be worth less than bronze.

A ranking system is defined by silver multiplier and gold multiplier (bronze is always worth 1 point).The weighted value for silver in a ranking system is calculated by multiplying the number of silver medals by weight multiplier of silver. The weight for gold in a ranking system is calculated by multiplying the number of gold medals by weight multiplier of gold. Each country’s weighted medal values are summed for each medal. These totals are ranked in such a fashion that the lowest rank is allotted to the country with the highest weighted value.

Description of Ranking Systems used in the analysis. (Each system is defined by silver and gold multiplier)

  1. System 1 - Silver multiplier = Gold multiplier = 1. All medals are weighed equally. So, the nations are ranked based on the total medals won.
  2. System 2 - Silver multiplier = 2, Gold multiplier = 5.
  3. System 3 - Silver multiplier = 5, Gold multiplier = 5.
  4. System 4 - Silver multiplier = 2, Gold multiplier = 10.
  5. System 5 - Silver multiplier = 5, Gold multiplier = 10.
  6. System 6 - Silver multiplier = 5, Gold multiplier = 20.
  7. System 7 - Bronze mutliplier = Silver multiplier = 0, Gold multiplier = 1. Only gold medals are important.

I will verify the ranking systems for years 2016 and 2008.

Code
(medals_2016 <- medals %>%
  filter(Year == 2016) %>%
  group_by(NOC, Medal) %>%
  summarise(count = n()) %>%
  pivot_wider(names_from=Medal, values_from=count, values_fill=0) %>%
  ungroup %>%
  mutate(system_1 = Gold + Silver + Bronze,
         rank_1 = min_rank(-system_1),
         system_2 = 5*Gold + 2*Silver + Bronze,
         rank_2 = min_rank(-system_2),
         system_3 = 5*Gold + 5*Silver + Bronze,
         rank_3 = min_rank(-system_3),
         system_4 = 10*Gold + 2*Silver + Bronze,
         rank_4 = min_rank(-system_4),
         system_5 = 10*Gold + 5*Silver + Bronze,
         rank_5 = min_rank(-system_5),
         system_6 = 20*Gold + 5*Silver + Bronze,
         rank_6 = min_rank(-system_6),
         system_7 = Gold,
         rank_7 = min_rank(-system_7)) %>%
   arrange(rank_1))
ABCDEFGHIJ0123456789
NOC
<fct>
Silver
<int>
Gold
<int>
Bronze
<int>
system_1
<int>
rank_1
<int>
system_2
<dbl>
rank_2
<int>
system_3
<dbl>
rank_3
<int>
USA54139712641874110361
GER434967159239835273
GBR556426145345626212
RUS285235115435144354
CHN304637113532754175
FRA55202196623163966
AUS34232582720873107
ITA40824728144102648
CAN4461699891710120
JPN1317346410145918411
Code
ranks_2016 <- medals_2016 %>%
  select(NOC, rank_1, rank_2, rank_3, rank_4, rank_5, rank_6, rank_7) %>%
  arrange(rank_1) %>%
  slice(1:50) %>%
  pivot_longer(c("rank_1", "rank_2", "rank_3", "rank_4", "rank_5", "rank_6", "rank_7"),
               names_to = "type",
               values_to = "rank")

ggplot(ranks_2016, aes(x = type, y=NOC, label=rank, fill=rank)) +
  geom_tile() +
  geom_text(color = "white", size = 2)+
  scale_fill_continuous(low = "darkorange",
                        high = "darkblue",
                        name = "rank") +
  scale_x_discrete(labels=c('1:1:1',
                            '5:2:1',
                            '5:5:1',
                            '10:2:1',
                            '10:5:1',
                            '20:5:1',
                            '1:0:0')) +
  theme(axis.text.x = element_text(angle = 90),
        axis.text.y = element_text(size = 6))+
  labs(title = "Heatmap of rankings for top 50 total medal winning countries in 2016",
       subtitle = "X-label represent weight multiplier for each medal type",
       y = "Nation",
       x = "Ranking System (Gold:Silver:Bronze)")

From the above heatmap of rankings for year 2016, we can see that, rank_5 has the least deviation from rankings by other systems for most of the countries. Also, rank_2 is fairly close to most of the other rankings. Hence, either of them can be used as the best estimator of country rankings for the year 2016. Likewise, let’s verify for year 2008 and see if there is a common best system for both the years.

Code
(medals_2008 <- medals %>%
  filter(Year == 2008) %>%
  group_by(NOC, Medal) %>%
  summarise(count = n()) %>%
  pivot_wider(names_from=Medal, values_from=count, values_fill=0) %>%
  ungroup %>%
  mutate(system_1 = Gold + Silver + Bronze,
         rank_1 = min_rank(-system_1),
         system_2 = 5*Gold + 2*Silver + Bronze,
         rank_2 = min_rank(-system_2),
         system_3 = 5*Gold + 5*Silver + Bronze,
         rank_3 = min_rank(-system_3),
         system_4 = 10*Gold + 2*Silver + Bronze,
         rank_4 = min_rank(-system_4),
         system_5 = 10*Gold + 5*Silver + Bronze,
         rank_5 = min_rank(-system_5),
         system_6 = 20*Gold + 5*Silver + Bronze,
         rank_6 = min_rank(-system_6),
         system_7 = Gold,
         rank_7 = min_rank(-system_7)) %>%
   arrange(rank_1))
ABCDEFGHIJ0123456789
NOC
<fct>
Bronze
<int>
Silver
<int>
Gold
<int>
system_1
<int>
rank_1
<int>
system_2
<dbl>
rank_2
<int>
system_3
<dbl>
rank_3
<int>
USA801101273171935112651
CHN575374184253326922
AUS764231149331544414
RUS534643142436034983
GER41164299528353315
GBR25253181623083056
BRA3034147871681027011
KOR26114178725362868
FRA282425779201927310
ESP164777010145122868
Code
ranks_2008 <- medals_2008 %>%
  select(NOC, rank_1, rank_2, rank_3, rank_4, rank_5, rank_6, rank_7) %>%
  arrange(rank_1) %>%
  slice(1:50) %>%
  pivot_longer(c("rank_1", "rank_2", "rank_3", "rank_4", "rank_5", "rank_6", "rank_7"),
               names_to = "type",
               values_to = "rank")

ggplot(ranks_2008, aes(x = type, y=NOC, label=rank, fill=rank)) +
  geom_tile() +
  geom_text(color = "white", size = 2)+
  scale_fill_continuous(low = "darkorange",
                        high = "darkblue",
                        name = "rank") +
  scale_x_discrete(labels=c('1:1:1',
                            '5:2:1',
                            '5:5:1',
                            '10:2:1',
                            '10:5:1',
                            '20:5:1',
                            '1:0:0')) +
  theme(axis.text.x = element_text(angle = 90),
        axis.text.y = element_text(size = 6))+
  labs(title = "Heatmap of rankings for top 50 total medal winning countries in 2008",
       subtitle = "X-label represent weight multiplier for each medal type",
       y = "Nation",
       x = "Ranking System (Gold:Silver:Bronze)")

From the above heatmap of rankings for year 2008, we can see that, rank_5 has the least deviation from rankings by other systems for most of the countries. Also, rank_3 is fairly close to most of the other rankings. Hence, either of them can be used as the best estimators of country rankings for the year 2008.

Analysis on 2008 and 2016 showed that ranking system 5 has the least deviation from other rankings. So, we can use this as the best estimator for ranking nations. This might not be completely unbiased but it fairly ranks the countries according to the medals won.

Below table shows rankings of countries for the year 1912 using system 5.

Code
(medals_1912 <- medals %>%
  filter(Year == 1912) %>%
  group_by(NOC, Medal) %>%
  summarise(count = n()) %>%
  pivot_wider(names_from=Medal, values_from=count, values_fill=0) %>%
  ungroup %>%
  mutate(points = 10*Gold + 5*Silver + Bronze,
         rank = min_rank(-points)) %>%
  arrange(rank))
ABCDEFGHIJ0123456789
NOC
<fct>
Bronze
<int>
Gold
<int>
Silver
<int>
points
<dbl>
rank
<int>
SWE251036213651
GBR5947648492
USA3646256213
NOR344064644
DEN335493285
FIN229403126
GER2311192287
ITA22212278
HUN310171889
FRA5101116010

Let’s see the overall rankings from 1896 to 2016 using ranking system 5. We will rank based on average_points per appearance. Total points is divided by number of appearances to eliminate bias (to some extent) that occurs from the fact that a few nations might take part more number of times compared to others. Only nations above 10th percentile of number of appearances are ranked because there could be outliers on the lower end (especially countries who participated only once).

Code
medals_all_years <- medals %>%
  group_by(NOC, Medal) %>%
  summarise(count = n()) %>%
  pivot_wider(names_from=Medal, values_from=count, values_fill=0) %>%
  ungroup %>%
  mutate(points = 10*Gold + 5*Silver + Bronze)

appearances <- olympics_data %>%
  distinct(NOC, Year) %>%
  group_by(NOC) %>%
  summarise(no_appearances = n())

no_appearances_10p <- quantile(appearances$no_appearances, probs = 0.1)

appearances <- appearances %>%
  filter(no_appearances > no_appearances_10p )

medals_all_years <- inner_join(medals_all_years,
                               appearances,
                               by = "NOC") %>%
  mutate(avg_points_per_appearance = points/no_appearances,
         rank = min_rank(-avg_points_per_appearance)) %>%
  arrange(rank)

medals_all_years
ABCDEFGHIJ0123456789
NOC
<fct>
Bronze
<int>
Silver
<int>
Gold
<int>
points
<dbl>
no_appearances
<int>
avg_points_per_appearance
<dbl>
rank
<int>
URS689732108215169101.516900e+031
USA13581641263835943351.026943e+032
GDR281327397588669.810000e+023
FRG233194159279364.655000e+024
GER74667474511566264.448462e+025
RUS4083673906143163.839375e+026
GBR65173967811126353.178857e+027
CHN2923473505527202.763500e+028
ITA5315315758936352.553143e+029
FRA6666105018726352.493143e+0210

Decorated Athletes

Can we identify the most decorated athlete of all time, most decorated men and women athlete?

First we compute the weighted value of medals (points) won by athletes across years using ranking system 5 (the best ranking system as shown previously). The most decorated athlete is definitely based on the number of points earned.

Another points system to rank athletes performance can be designed based on number of events the athlete has participated in years and the number of appearances over years. This ranking can be defined as impact rankings. This system tries to give weightage to closely comparable points based on time frame and participation in number of events(either lesser appearances or lesser events are weighted higher). Impact ranking is based on extrapolated impact points. Impact points are calculated by using points, normalized number of events (normal_events) and normalized number of appearances(normal_years). Only athletes with number of events and number of appearances above 10th percentile are ranked because there could be outliers on the lower end (especially for those who either participated in a single event or only appeared once). impact_points=(pointsnormal_events)/normal_years

Code
athletes_gender <- olympics_data %>%
  distinct(Name, Sex) %>%
  select(Name, Sex)

athlete_medals_all_years <- medals %>%
  group_by(Name, Medal) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  pivot_wider(names_from=Medal, values_from=count, values_fill=0) %>%
  ungroup %>%
  mutate(points = 10*Gold + 5*Silver + Bronze,
         rank_decorated = min_rank(-points))

decorated_rankings <- inner_join(athlete_medals_all_years,
                                 athletes_gender,
                                 by = "Name") %>%
  arrange(rank_decorated)

appearances_years <- olympics_data %>%
  distinct(Name, Year) %>%
  group_by(Name) %>%
  summarise(no_years = n()) %>%
  mutate(normal_years = (no_years - min(no_years))/(max(no_years) - min(no_years)))

no_years_10p <- quantile(appearances_years$no_years, probs = 0.1)

appearances_years <- appearances_years %>%
  filter(no_years > no_years_10p )

appearances_events <- olympics_data %>%
  group_by(Name) %>%
  summarise(no_events = n()) %>%
  mutate(normal_events = (no_events - min(no_events))/(max(no_events) - min(no_events)))

no_events_10p <- quantile(appearances_events$no_events, probs = 0.1)

appearances_events <- appearances_events %>%
  filter(no_events > no_events_10p )

athlete_medals_events_all_years <- inner_join(athlete_medals_all_years,
                               appearances_events,
                               by = "Name")

athlete_medals_events_years <- inner_join(athlete_medals_events_all_years,
                               appearances_years,
                               by = "Name") %>%
  mutate(impact_points = (points * normal_events)/normal_years,
         rank_impact = min_rank(-impact_points)) %>%
  select(-rank_decorated, -normal_events, -normal_years)

impact_rankings <- inner_join(athlete_medals_events_years,
                                 athletes_gender,
                                 by = "Name") %>%
  arrange(rank_impact)

decorated_rankings
ABCDEFGHIJ0123456789
Name
<fct>
Gold
<int>
Bronze
<int>
Silver
<int>
points
<dbl>
rank_decorated
<int>
Sex
<fct>
Michael Fred Phelps, II23232471M
Larysa Semenivna Latynina (Diriy-)9451192F
Paavo Johannes Nurmi9031053M
Ole Einar Bjrndalen8141014M
Birgit Fischer-Schmidt8041005F
Raymond Clarence "Ray" Ewry10001005M
Nikolay Yefimovich Andrianov735987M
Jennifer Elisabeth "Jenny" Thompson (-Cumpelik)813968F
Mark Andrew Spitz911968M
Sawao Kato813968M
Code
impact_rankings
ABCDEFGHIJ0123456789
Name
<fct>
Gold
<int>
Bronze
<int>
Silver
<int>
points
<dbl>
no_events
<int>
no_years
<int>
impact_points
<dbl>
rank_impact
<int>
Sex
<fct>
Michael Fred Phelps, II2323247305282.75000001M
Viktor Ivanovych Chukarin71386162203.68421052M
Nikolay Yefimovich Andrianov73598243177.94736843M
Sawao Kato81396243174.31578954M
Akinori Nakayama62272162170.52631585M
Larysa Semenivna Latynina (Diriy-)945119193169.10526326F
Borys Anfiyanovych Shakhlin72492243167.05263167M
Mark Andrew Spitz91196122166.73684218M
Vitaly Venediktovich Shcherbo64064162151.57894749M
Aleksandr Nikolayevich Dityatin31661162144.473684210M

From the above rankings, we can conclude that Michael Fred Phelps, II is the most decorated and impactful player ever in the history of the Olympics. He is also the the most decorated and impactful athlete in men. He won a total of 23 gold, 3 silver and 2 bronze medals in 5 appearances across 30 events. Whereas, Larysa Semenivna Latynina (Diriy-) is the most decorated and impactful women athlete. She won 9 gold, 5 Silver and 4 bronze medals in 3 appearances across 19 events.

Code
ggplot(decorated_rankings, aes(points)) +
  geom_histogram( bins = 60, fill="darkblue") +
  theme_linedraw() +
  facet_wrap(~Sex, scales = "free") +
  labs(title = "Athletes points distribution")

Code
ggplot(impact_rankings, aes(impact_points)) +
  geom_histogram(bins = 60, fill="darkorange") +
  theme_linedraw() +
  facet_wrap(~Sex, scales = "free") +
  labs(title = "Athletes impact points distribution",
       x = "points")

Peak Performant Age

Can we identify the age of men and women athletes where their performance is maximized? Does this differ for countries?

Here we try to identify the age where most of the athletes won medals. This is a direct reflection about the peak performance of an athlete.

Code
olympics_medals_encode <- olympics_data %>%
  mutate(is_medal_won = case_when(
         Medal == "Gold" | Medal == "Silver" | Medal == "Bronze" ~ "medal",
         TRUE ~ "no_medal"))

age_wise_medals <-  olympics_medals_encode%>%
  filter(!is.na(Age)) %>%
  group_by(Sex, Age, is_medal_won) %>%
  summarise(no_medals = n())

ggplot(age_wise_medals, aes(x=Age, y=no_medals, size = no_medals,color=is_medal_won)) +
  geom_point(alpha=0.6) +
  scale_size(range = c(.1, 10), name="Medals")+
  theme_linedraw()+
  facet_wrap(vars(Sex),nrow=2, ncol=1) +
  labs(title = "Scatter plot for medals won vs age",
       subtitle = "Size of bubble represents number of medals won",
       x = "age",
       y = "Number of Medals",
       color = "Position")+
  theme(legend.position = "top")

From the above visualizations, we can conclude that peak performant age for male athletes is between 22 to 24 . While for females, it is between 23 to 26. This prediction is also evident from the raw data shown in the following tables.

Code
(age_wise_medals %>%
  filter(Sex == "M",
         is_medal_won == "medal") %>%
  arrange(desc(no_medals)) %>%
  select(-is_medal_won))
ABCDEFGHIJ0123456789
Sex
<fct>
Age
<dbl>
no_medals
<int>
M232512
M242392
M222281
M252212
M262146
M271911
M211897
M281732
M291431
M201395
Code
(age_wise_medals %>%
  filter(Sex == "F",
         is_medal_won == "medal") %>%
  arrange(desc(no_medals)) %>%
  select(-is_medal_won))
ABCDEFGHIJ0123456789
Sex
<fct>
Age
<dbl>
no_medals
<int>
F25912
F24897
F23883
F22878
F21787
F27779
F26778
F20609
F28603
F19523
Code
age_top10_country_wise_medals <- olympics_data %>%
  filter(NOC %in% teams_medals_total[1:10,]$NOC) %>%
  mutate(is_medal_won = case_when(
         Medal == "Gold" | Medal == "Silver" | Medal == "Bronze" ~ "medal",
         TRUE ~ "no_medal")
  ) %>%
  filter(!is.na(Age), is_medal_won == "medal") %>%
  group_by(Sex, NOC, Age) %>%
  summarise(no_medals = n())

ggplot(age_top10_country_wise_medals, aes(x=reorder(NOC, Age, mean), y=Age, fill=Sex)) +
  geom_jitter(color="black",size=0.2) +
  geom_boxplot(varwidth = TRUE, alpha = 0.5) +
  scale_size(range = c(.1, 10), name="Medals")+
  theme_linedraw()+
  facet_wrap(~Sex, ncol=1, scales = "free_y") +
  labs(title = "Ages of athletes - Top 10 medal winning nations",
       x = "NOC",
       y = "Age")+
  theme(legend.position = "none")

Code
(age_top10_country_wise_medals %>%
  group_by(Sex, NOC) %>%
  summarise(mean_age = mean(Age),
            median_age = median(Age)) %>%
  arrange(Sex,mean_age))
ABCDEFGHIJ0123456789
Sex
<fct>
NOC
<fct>
mean_age
<dbl>
median_age
<dbl>
FITA26.0645226.0
FRUS26.5833326.5
FURS26.6153826.5
FFRA27.9166727.5
FAUS28.0000027.5
FSWE28.2963028.0
FGER30.0625029.5
FCAN30.0882429.5
FGBR31.5757631.0
FUSA31.6052630.5

From the above graphs and table we can see that the mean medal winning age is different for different countries. Italy has the least mean for medal winning age in females and Soviet Union(URS) has the least in males. We can say that this difference might be present because of differences in a few influential attributes like body type, difference in nutrition habits across countries and the way they train.

Correlation of features to Sport

Can we identify if there is a correlation of features like height/weight/age to specific sports? Does it equally hold for both the genders?

I will be analyzing height for basketball, age for athletics and shooting. We can’t verify traits like height, weight for athletics because it has variety of events where same trait might not be significant. Whereas in shooting, I believe age matters the most because we can make an assessment on concentration levels which is key to this sport.

Code
basketball_medals <-  olympics_medals_encode%>%
  filter(Sport == "Basketball", !is.na(Height)) %>%
  group_by(Sex, Height, is_medal_won) %>%
  summarise(no_medals = n())

ggplot(basketball_medals,
       aes(x=Height, y=no_medals, size = no_medals, color=is_medal_won)) +
  geom_point(alpha=0.6) +
  scale_size(range = c(.1, 10), name="Medals")+
  theme_linedraw()+
  facet_wrap(vars(Sex), nrow = 2, ncol = 1) +
  labs(title = "Scatter plot for medals won vs height in basketball",
       subtitle = "Size of bubble represents number of medals won",
       x = "height",
       y = "Number of Medals",
       color = "Position")+
  theme(legend.position = "top")

For basketball, men with height ranging from 190 to 205 cm have won most of the medals. For females, it is between 180 to 190 cm. So, we can say that height has a positive affect on winning in basketball because majority of the medal winning athletes are tall.

Code
athletics_medals <-  olympics_medals_encode%>%
  filter(Sport == "Athletics", !is.na(Age)) %>%
  group_by(Sex, Age, is_medal_won) %>%
  summarise(no_medals = n())

ggplot(athletics_medals,
       aes(x=Age, y=no_medals, size = no_medals, color=is_medal_won)) +
  geom_point(alpha=0.6) +
  scale_size(range = c(.1, 10), name="Medals")+
  theme_linedraw()+
  facet_wrap(vars(Sex), nrow = 2, ncol = 1) +
  labs(title = "Scatter plot for medals won vs age in athletics",
       subtitle = "Size of bubble represents number of medals won",
       x = "age",
       y = "Number of Medals",
       color = "Position")+
  theme(legend.position = "top")

In athletics, for both the genders most of the medals winners are between ages 20 and 30. Also, we can observe that most of the participants age is also in this bracket. We can say that since this is the age with maximum potential of human body, age factor is dominant in athletics.

Code
shooting_medals <-  olympics_medals_encode%>%
  filter(Sport == "Shooting", !is.na(Age)) %>%
  group_by(Sex, Age, is_medal_won) %>%
  summarise(no_medals = n())

ggplot(shooting_medals,
       aes(x=Age, y=no_medals, size = no_medals, color=is_medal_won)) +
  geom_point(alpha=0.6) +
  scale_size(range = c(.1, 10), name="Medals")+
  theme_linedraw()+
  facet_wrap(vars(Sex), nrow = 2, ncol = 1) +
  labs(title = "Scatter plot for medals won vs age in shooting",
       subtitle = "Size of bubble represents number of medals won",
       x = "age",
       y = "Number of Medals",
       color = "Position")+
  theme(legend.position = "top")

In shooting, for both the genders most of the medals winners are between 20 and 40. We can claim that this is the age with maximum potential of human body and the concentrations levels are higher in the younger age. So, age factor is dominant in shooting.

Conclusion

  1. The number of athletes, events, and participating nations has grown dramatically since 1896.

  2. We have identified that ranking system 5 is the closest to an unbiased system. In this system we use 10, 5 and 1 weights for gold, silver and bronze respectively.

  3. Michael Fred Phelps, II is the most decorated male athlete and Larysa Semenivna Latynina (Diriy-) is the most decorated female athlete. Phelps won 23 gold, 3 silver and 2 bronze in 5 appearances. Larysa Semenivna Latynina (Diriy-) won 9 gold, 5 Silver and 4 bronze medals in 3 appearances. They are also the most impactful athletes ever.

  4. Peak performance age for male athletes is between 22 to 24 and for females it is between 23 to 26. Also the mean age of medal winners is different across nations.

  5. We observed significance of traits like height/age in sports and the data supports it. So, height matters in basketball and age matters in both athletics and shooting.

Limitiations

  1. It is quite tough to identify decorated/impactful athletes just from visualizations.
  2. Identifying the best ranking systems from heatmap could be challenging for general audience. But can be done.
  3. In the treemap showing “Sport size proportional to number of events in in Olympics” the labels in the right bottom corner couldn’t be shown because of the sizes of the boxes and this is a bug in treemap library.

Bibliography

Samruddhi Mhatre. “Olympics Dataset (1896 to 2016)” Kaggle, https://www.kaggle.com/datasets/samruddhim/olympics-althlete-events-analysis

Wikipedia - https://en.wikipedia.org/wiki/Olympic_Games

R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

Wickham, Hadley, et al. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed., O’Reilly Media, Inc, 2023. http://r4ds.hadley.nz/