hw2
olympics
srujan_kagitala
Homework 2
Author

Srujan Kagitala

Published

July 1, 2023

Code
library(tidyverse)
library(ggplot2)
library(treemap)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Overview

For my final project, I would do exploratory analysis on the Olympics data right from the year it started in 1896 in Athena till 2016 in Rio de Janeiro.

Olympics

Olympics is the most decorated sporting event in an athlete’s career. The Olympics are the leading international sporting events featuring summer and winter sports competitions in which thousands of athletes from around the world participate in a variety of competitions. The Olympic Games are considered the world’s foremost sports competition with more than 200 teams, representing sovereign states and territories, participating. The Olympic Games are normally held every four years, and since 1994, have alternated between the Summer and Winter Olympics every two years during the four-year period. The first, second, and third place finishers in each event receive Olympic medals: gold, silver, and bronze, respectively.

Dataset

The dataset I would use for this analysis is downloaded from Kaggle.

This dataset consists of Olympics data of over a century, from the year 1896 to 2016. Studying this dataset will help to understand the patterns followed in the games of Olympics, patterns of the most successful athletes and countries in their Olympics journey and much more!

Code
olympics_data <- read_csv("_data/athlete_events_olympics.csv")
head(olympics_data)
# A tibble: 6 × 15
     ID Name      Sex     Age Height Weight Team  NOC   Games  Year Season City 
  <dbl> <chr>     <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
1     1 A Dijiang M        24    180     80 China CHN   1992…  1992 Summer Barc…
2     2 A Lamusi  M        23    170     60 China CHN   2012…  2012 Summer Lond…
3     3 Gunnar N… M        24     NA     NA Denm… DEN   1920…  1920 Summer Antw…
4     4 Edgar Li… M        34     NA     NA Denm… DEN   1900…  1900 Summer Paris
5     5 Christin… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
6     5 Christin… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
# ℹ 3 more variables: Sport <chr>, Event <chr>, Medal <chr>

The data set has 271116 observations and 15 data points recorded per observation. Each observation records variables like ID, Name, Sex, Age, Height, Weight, Team, NOC, Games, Year, Season, City, Sport, Event, Medal.

The youngest and the oldest athlete that ever participated are 10 and 97 years of age respectively. Athletes from across the world compete in 765 events that happen across 66 sports. These happen in 2 Olympic seasons i.e Summer, Winter.

Data Cleaning

The dataset is mostly clean and we don’t have to work around much. Although, for doing some analysis, we might need to create various subsets of data like medals, teams, etc, drop na values for a few variables and mutate new variables(Most of these can be done during the analysis). We need to convert categorical variables into factor to get overall summary across various variables. We don’t need the “ID” column for our analysis as it only serves the purpose of a key.

Code
olympics_data <- olympics_data %>%
  mutate_if(is_character, as.factor) %>%
  select(-ID)

medals <- olympics_data %>%
  filter(!is.na(Medal))

summary(olympics_data)
                       Name        Sex             Age            Height     
 Robert Tait McKenzie    :    58   F: 74522   Min.   :10.00   Min.   :127.0  
 Heikki Ilmari Savolainen:    39   M:196594   1st Qu.:21.00   1st Qu.:168.0  
 Joseph "Josy" Stoffel   :    38              Median :24.00   Median :175.0  
 Ioannis Theofilakis     :    36              Mean   :25.56   Mean   :175.3  
 Takashi Ono             :    33              3rd Qu.:28.00   3rd Qu.:183.0  
 Alexandros Theofilakis  :    32              Max.   :97.00   Max.   :226.0  
 (Other)                 :270880              NA's   :9474    NA's   :60171  
     Weight                 Team             NOC                 Games       
 Min.   : 25.0   United States: 17847   USA    : 18853   2000 Summer: 13821  
 1st Qu.: 60.0   France       : 11988   FRA    : 12758   1996 Summer: 13780  
 Median : 70.0   Great Britain: 11404   GBR    : 12256   2016 Summer: 13688  
 Mean   : 70.7   Italy        : 10260   ITA    : 10715   2008 Summer: 13602  
 3rd Qu.: 79.0   Germany      :  9326   GER    :  9830   2004 Summer: 13443  
 Max.   :214.0   Canada       :  9279   CAN    :  9733   1992 Summer: 12977  
 NA's   :62875   (Other)      :201012   (Other):196971   (Other)    :189805  
      Year         Season                   City               Sport       
 Min.   :1896   Summer:222552   London        : 22426   Athletics : 38624  
 1st Qu.:1960   Winter: 48564   Athina        : 15556   Gymnastics: 26707  
 Median :1988                   Sydney        : 13821   Swimming  : 23195  
 Mean   :1978                   Atlanta       : 13780   Shooting  : 11448  
 3rd Qu.:2002                   Rio de Janeiro: 13688   Cycling   : 10859  
 Max.   :2016                   Beijing       : 13602   Fencing   : 10735  
                                (Other)       :178243   (Other)   :149548  
                                 Event           Medal       
 Football Men's Football            :  5733   Bronze: 13295  
 Ice Hockey Men's Ice Hockey        :  4762   Gold  : 13372  
 Hockey Men's Hockey                :  3958   Silver: 13116  
 Water Polo Men's Water Polo        :  3358   NA's  :231333  
 Basketball Men's Basketball        :  3280                  
 Cycling Men's Road Race, Individual:  2947                  
 (Other)                            :247078                  
Code
head(olympics_data)
# A tibble: 6 × 14
  Name      Sex     Age Height Weight Team  NOC   Games  Year Season City  Sport
  <fct>     <fct> <dbl>  <dbl>  <dbl> <fct> <fct> <fct> <dbl> <fct>  <fct> <fct>
1 A Dijiang M        24    180     80 China CHN   1992…  1992 Summer Barc… Bask…
2 A Lamusi  M        23    170     60 China CHN   2012…  2012 Summer Lond… Judo 
3 Gunnar N… M        24     NA     NA Denm… DEN   1920…  1920 Summer Antw… Foot…
4 Edgar Li… M        34     NA     NA Denm… DEN   1900…  1900 Summer Paris Tug-…
5 Christin… F        21    185     82 Neth… NED   1988…  1988 Winter Calg… Spee…
6 Christin… F        21    185     82 Neth… NED   1988…  1988 Winter Calg… Spee…
# ℹ 2 more variables: Event <fct>, Medal <fct>
Code
head(medals)
# A tibble: 6 × 14
  Name      Sex     Age Height Weight Team  NOC   Games  Year Season City  Sport
  <fct>     <fct> <dbl>  <dbl>  <dbl> <fct> <fct> <fct> <dbl> <fct>  <fct> <fct>
1 Edgar Li… M        34     NA     NA Denm… DEN   1900…  1900 Summer Paris Tug-…
2 Arvo Oss… M        30     NA     NA Finl… FIN   1920…  1920 Summer Antw… Swim…
3 Arvo Oss… M        30     NA     NA Finl… FIN   1920…  1920 Summer Antw… Swim…
4 Juhamatt… M        28    184     85 Finl… FIN   2014…  2014 Winter Sochi Ice …
5 Paavo Jo… M        28    175     64 Finl… FIN   1948…  1948 Summer Lond… Gymn…
6 Paavo Jo… M        28    175     64 Finl… FIN   1948…  1948 Summer Lond… Gymn…
# ℹ 2 more variables: Event <fct>, Medal <fct>

Insights

Code
#Type of medals won by teams.
teams_medals_type <- medals %>%
  group_by(NOC, Medal) %>%
  summarise(count = n())

#Total medals won by teams
teams_medals_total <- medals %>%
  group_by(NOC) %>%
  summarise(total_medals = n()) %>%
  arrange(desc(total_medals))

#Top 50 countries(by total medals won) medal_type tally.
(teams_medals_tally <- inner_join(teams_medals_type,
                              teams_medals_total[1:50, ],
                              by = "NOC") %>%
    arrange(desc(total_medals)))
# A tibble: 150 × 4
# Groups:   NOC [50]
   NOC   Medal  count total_medals
   <fct> <fct>  <int>        <int>
 1 USA   Bronze  1358         5637
 2 USA   Gold    2638         5637
 3 USA   Silver  1641         5637
 4 URS   Bronze   689         2503
 5 URS   Gold    1082         2503
 6 URS   Silver   732         2503
 7 GER   Bronze   746         2165
 8 GER   Gold     745         2165
 9 GER   Silver   674         2165
10 GBR   Bronze   651         2068
# ℹ 140 more rows
Code
#Bar plot of medals won by top 50 countries.
ggplot(data = teams_medals_tally , aes(x= reorder(NOC, total_medals), y = count)) +
  geom_bar(stat = "identity",
           mapping = aes(fill = Medal),
           position = "stack") +
  labs(title ="Top 50 countries that won most medals",
       y = "No of medals",
       x = "Nation of Origin",
       fill = "Medal")+
  coord_flip()

We can see that USA has won the highest number of medals, more than double the number of medals won by Soviet Union.

Code
#Sport with events count
(sport_events <- olympics_data %>%
  distinct(Sport, Event) %>%
  group_by(Sport) %>%
  summarise(no_events = n()) %>%
  arrange(desc(no_events)))
# A tibble: 66 × 2
   Sport            no_events
   <fct>                <int>
 1 Athletics               83
 2 Shooting                83
 3 Swimming                55
 4 Cycling                 44
 5 Sailing                 38
 6 Wrestling               30
 7 Archery                 29
 8 Art Competitions        29
 9 Canoeing                27
10 Gymnastics              27
# ℹ 56 more rows
Code
#Treemap representing each Sport proportional to number of events in olympics
treemap(sport_events,
        index = "Sport",
        vSize = "no_events",
        type = "index",
        fontsize.labels = 10,
        fontcolor.labels = "black",
        align.labels=list(
          c("center", "center")),
        inflate.labels=F,
        palette = "Set1",
        title="Treemap representing each Sport proportional to number of events in olympics",
        fontsize.title=12)

Athletics and shooting are the sports with highest number of events (83 events each) followed by swimming with 55 events.

Code
#Age distribution among females and males in olympics.
ggplot(olympics_data, aes(x = Age, fill = Sex, na.rm = TRUE)) +
  geom_density(adjust=2, alpha = 0.5) +
  theme_linedraw() +
  labs(title = "Age distribution of atheletes in Olympics",
       x = "Age",
       y = "proportion")

We can see that the mean age of men athletes is higher than the mean age of women athletes. Most frequent age in men athletes is approximately equal to the most frequent age in women athletes.

Code
#Times series for how top 10 nations performed over years.
teams_medals_yearly <- medals %>%
  filter(NOC %in% teams_medals_total[1:9,]$NOC) %>%
  group_by(NOC, Year) %>%
  summarise(total_medals = n()) %>%
  arrange(NOC,Year)

ggplot(teams_medals_yearly,
       aes(x= Year, y = total_medals)) + 
  geom_line() +
  facet_wrap(~NOC, scales = "free") +
  #scale_x_continuous(breaks = c(2015, 2016, 2017),
                     #labels = c("2015", "2016", "2017")) +
  theme_linedraw()+
  labs(title = "Medals won over years by top 10 countries since 1896",
       x = "Year",
       y = "Number of medals")

Research Questions

  1. Is there an unbiased ranking system to determine the rankings of nations in Olympics? Can we rank different nations on varied ranking systems (different weightage for gold, silver and bronze) and observe how their ranks differ based on weightage given to gold, silver and bronze medals?
  2. Can we identify the most decorated athlete of all time, most decorated men and women athlete?
  3. Can we identify the age of men and women athletes where their performance is maximized? Does this differ for countries?
  4. Can we identify if there is a correlation of features like height/weight/age to a specific sports? (For example if height positively impacts basketball, age helps in shooting or athletics, etc.) Does it equally hold for both the genders?