DACSS601 Final Project

Final_project
Olympics
Analysis of Olympics data from 1896-present
Author

Mani Shanker Kamarapu

Published

September 3, 2022

Setting up the libraries

Code
library(tidyverse)
library(ggplot2)
library(summarytools)
library(lubridate)
library(scales)
library(plotly)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Intoduction

The Olympic Games are the world’s only truly global, multi-sport, celebratory athletics competition. With more than 200 countries participating in over 400 events across the Summer and Winter Games, the Olympics are where the world comes to compete, feel inspired, and be together. The evolution of the Olympic Movement during the 20th and 21st centuries has resulted in several changes to the Olympic Games. Some of these adjustments include the creation of the Winter Olympic Games for snow and ice sports, the Paralympic Games for athletes with disabilities, the Youth Olympic Games for athletes aged 14 to 18, the five Continental games (Pan American, African, Asian, European, and Pacific), and the World Games for sports that are not contested in the Olympic Games.

Specifically, in this project I have multiple data sets on Olympic games and their results and on the athletes. They contain the detailed information on the athlete including their sex, DOB, height, weight and their country. And also the detailed results on each game and medals won by each country and info on every event and competitions.

This data set contains:

154,902 unique athletes and their biological information i.e. height, weight, date of birth All Winter / Summer Olympic games from 1896 to 2022 7326 unique results 314,726 rows of athlete to result data which includes both team sports and individual sports each row includes position - which is how well the athlete performed for the specific event. Note: not all position is integer - contains strings which contains information on which round / heat they achieved 235 distinct countries (some existing from the past)

Research questions
  1. How is the athlete biological info factors in the number of medals won by athletes?
  2. How is the medals won trend is setting on by the changing year?
  3. How is the athlete result change as he participants in more seasons and is it dependent factor and is the experience proves anything?
  4. Are more medals are won by playing in teams or individually?

Reading the data

There are a total of five data sets which combined contains the complete information on athlete, results and medals. First of all, we need to read the data sets into R and then join them to get a complete data set of the Olympics data. So then we can analysis the data and find the insights.

Code
athlete <- read_csv("_data/Olympics/Olympic_Athlete_Bio.csv") %>%
  na_if("na")
head(athlete)

While rendering it is taking a lot of time and consuming more space. So I have commented the below lines to submit my file

Code
# print(dfSummary(athlete, 
#                 varnumbers = FALSE, 
#                 plain.ascii = FALSE, 
#                 graph.magnif = 0.70, 
#                 style = "grid", 
#                 valid.col = FALSE), 
#       method = 'render', 
#       table.classes = 'table-condensed')
Code
events <- read_csv("_data/Olympics/Olympic_Athlete_Event_Results.csv") %>%
  na_if("na")
head(events)

While rendering it is taking a lot of time and consuming more space. So I have commented the below lines to submit my file

Code
# print(dfSummary(events, 
#                 varnumbers = FALSE, 
#                 plain.ascii = FALSE, 
#                 graph.magnif = 0.70, 
#                 style = "grid", 
#                 valid.col = FALSE), 
#       method = 'render', 
#       table.classes = 'table-condensed')
Code
medals <- read_csv("_data/Olympics/Olympic_Games_Medal_Tally.csv") %>%
  select(!c("year", "total")) %>%
  na_if("na")
head(medals)

While rendering it is taking a lot of time and consuming more space. So I have commented the below lines to submit my file

Code
# print(dfSummary(medals, 
#                 varnumbers = FALSE, 
#                 plain.ascii = FALSE, 
#                 graph.magnif = 0.70, 
#                 style = "grid", 
#                 valid.col = FALSE), 
#       method = 'render', 
#       table.classes = 'table-condensed')
Code
results <- read_csv("_data/Olympics/Olympic_Results.csv") %>%
  select(!"result_detail") %>%
  na_if("na")
head(results)

While rendering it is taking a lot of time and consuming more space. So I have commented the below lines to submit my file

Code
# print(dfSummary(results, 
#                 varnumbers = FALSE, 
#                 plain.ascii = FALSE, 
#                 graph.magnif = 0.70, 
#                 style = "grid", 
#                 valid.col = FALSE), 
#       method = 'render', 
#       table.classes = 'table-condensed')
Code
games <- read_csv("_data/Olympics/Olympics_Games.csv") %>%
  select(c(edition, edition_id, city, isHeld, competition_start_date, competition_end_date)) %>%
  na_if("na")
games
Code
# print(dfSummary(games, 
#                 varnumbers = FALSE, 
#                 plain.ascii = FALSE, 
#                 graph.magnif = 0.70, 
#                 style = "grid", 
#                 valid.col = FALSE), 
#       method = 'render', 
#       table.classes = 'table-condensed')

Joining the data

The reading of data is done and to join the data and get a complete set, it is preferable to join the individual data sets with a full_join() and then tidy up the data as we need and remove the irregularities so we can approach the tidy data so we can get a perfect analysis. It is usually better to not to hard-code the data and have gentle approach by tidying the data individually and joining them and tidy the remaining untidy data after we join, it gives us a better understanding of the data.

Code
athlete_events <- full_join(athlete, events, by = c("athlete_id", "country_noc")) %>%
  select(!"name")
head(athlete_events)

While rendering it is taking a lot of time and consuming more space. So I have commented the below lines to submit my file

Code
# print(dfSummary(athlete_events, 
#                 varnumbers = FALSE, 
#                 plain.ascii = FALSE, 
#                 graph.magnif = 0.70, 
#                 style = "grid", 
#                 valid.col = FALSE), 
#       method = 'render', 
#       table.classes = 'table-condensed')
Code
olympics_competition <- full_join(athlete_events, results, by = c("edition", "edition_id", "result_id", "sport"))
head(olympics_competition)

While rendering it is taking a lot of time and consuming more space. So I have commented the below lines to submit my file

Code
# print(dfSummary(olympics_competition, 
#                 varnumbers = FALSE, 
#                 plain.ascii = FALSE, 
#                 graph.magnif = 0.70, 
#                 style = "grid", 
#                 valid.col = FALSE), 
#       method = 'render', 
#       table.classes = 'table-condensed')
Code
olympics <- full_join(olympics_competition, games, by =c("edition", "edition_id")) %>%
  filter(is.na(isHeld)) %>%
  select(!c("event", "isHeld")) 
head(olympics)

While rendering it is taking a lot of time and consuming more space. So I have commented the below lines to submit my file

Code
# print(dfSummary(olympics, 
#                 varnumbers = FALSE, 
#                 plain.ascii = FALSE, 
#                 graph.magnif = 0.70, 
#                 style = "grid", 
#                 valid.col = FALSE), 
#       method = 'render', 
#       table.classes = 'table-condensed')

Tidying the data

Now as the data is joined and it is tidy, we need to make changes to classes of the variables to better suit the values assigned to the variables. And also it is a good pratice to rename and reorder the variables to get a good description and grouping of the variables.

Code
olympics <- olympics %>%
  mutate(height = as.numeric(height), 
         weight = as.numeric(weight), 
         born = as.Date(born, "%d-%m-%Y"), 
         start_date = as.Date(start_date, "%d-%m-%Y"), 
         end_date = as.Date(end_date, "%d-%m-%Y"), 
         competition_start_date = as.Date(competition_start_date, "%d-%m-%Y"), 
         competition_end_date = as.Date(competition_end_date, "%d-%m-%Y"))
head(olympics)
Code
medals <- medals %>%
  separate("edition", into = c("Year", "Season", "del"), sep = " ") %>%
  select(!del) %>%
  pivot_longer(c("gold", "silver", "bronze"), names_to = "type_of_medal", values_to = "medal_count")
head(medals)

Renaming the columns

Code
colnames(olympics) <- c("Athlete_id", "Sex", "DOB", "Height", "Weight", "Country", "NOC", "Edition", "Edition_id", "Sport", "Result_id", "Athlete", "Position", "Medal", "TeamSport", "Event", "Result_location", "No_of_participants", "Result_format", "Result_description", "Event_start_date", "Event_end_date", "City", "Competition_start_date", "Competition_end_date")
head(olympics)

Calculating BMI and Age of Participation for athletes

Now I am calculating BMI from height and weight, so we can get an better view of the biological information of the athlete and also calculating the age of participation so it can be used to further analysis the athlete and know how much is age matters.

Code
olympics <- olympics %>%
  mutate(BMI = Weight/(Height/100)^2, .keep = "unused") %>%
  separate(Edition, into = c("Year", "Season", "del"), sep = " ") %>%
  select(!del) %>%
  mutate(Age = as.integer((Event_start_date-DOB)/365.25)) 
head(olympics)

Rearranging the columns

Code
olympics <- olympics %>%
  relocate(c("Athlete", "BMI", "Age"), .after = "Athlete_id") 
head(olympics)
Code
print(dfSummary(olympics,
                varnumbers = FALSE,
                plain.ascii = FALSE,
                graph.magnif = 0.70,
                style = "grid",
                valid.col = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

olympics

Dimensions: 315043 x 26
Duplicates: 1217
Variable Stats / Values Freqs (% of Valid) Graph Missing
Athlete_id [numeric]
Mean (sd) : 123800.7 (252523.4)
min ≤ med ≤ max:
1 ≤ 76327 ≤ 2.2e+07
IQR (CV) : 75007 (2)
155031 distinct values 4 (0.0%)
Athlete [character]
1. Tait McKenzie
2. Heikki Savolainen
3. Josy Stoffel
4. Ioannis Theofilakis
5. Takashi Ono
6. Andreas Wecker
7. Jean Jacoby
8. Al Jochim
9. Alfred Munnings
10. Hans Sauter
[ 153396 others ]
58(0.0%)
39(0.0%)
38(0.0%)
33(0.0%)
33(0.0%)
32(0.0%)
32(0.0%)
31(0.0%)
31(0.0%)
31(0.0%)
314549(99.9%)
136 (0.0%)
BMI [numeric]
Mean (sd) : 22.7 (2.9)
min ≤ med ≤ max:
8.4 ≤ 22.5 ≤ 56.5
IQR (CV) : 3.2 (0.1)
3537 distinct values 93272 (29.6%)
Age [integer]
Mean (sd) : 25.1 (5.5)
min ≤ med ≤ max:
11 ≤ 24 ≤ 71
IQR (CV) : 7 (0.2)
59 distinct values 36801 (11.7%)
Sex [character]
1. Female
2. Male
87974(28.3%)
222656(71.7%)
4413 (1.4%)
DOB [Date]
min : 1900-01-01
med : 1967-04-04
max : 2009-01-01
range : 109y 0m 0d
34839 distinct values 36517 (11.6%)
Country [character]
1. United States
2. France
3. Great Britain
4. Italy
5. Canada
6. Germany
7. Japan
8. Sweden
9. Australia
10. Hungary
[ 687 others ]
23358(7.5%)
15221(4.9%)
13632(4.4%)
12339(4.0%)
11601(3.7%)
10431(3.4%)
10161(3.3%)
9405(3.0%)
8719(2.8%)
7465(2.4%)
188298(60.6%)
4413 (1.4%)
NOC [character]
1. USA
2. FRA
3. GBR
4. ITA
5. CAN
6. GER
7. JPN
8. SWE
9. AUS
10. HUN
[ 222 others ]
23551(7.5%)
15340(4.9%)
13717(4.4%)
12397(3.9%)
11722(3.7%)
11521(3.7%)
10173(3.2%)
9438(3.0%)
8851(2.8%)
7542(2.4%)
190787(60.6%)
4 (0.0%)
Year [character]
1. 1992
2. 1988
3. 2020
4. 2000
5. 1996
6. 2008
7. 2016
8. 2004
9. 2012
10. 1972
[ 32 others ]
17198(5.5%)
15946(5.1%)
14786(4.7%)
14032(4.5%)
13994(4.4%)
13975(4.4%)
13941(4.4%)
13756(4.4%)
13196(4.2%)
12586(4.0%)
171501(54.5%)
132 (0.0%)
Season [character]
1. (Empty string)
2. Equestrian
3. Summer
4. Winter
23(0.0%)
300(0.1%)
252862(80.3%)
61726(19.6%)
132 (0.0%)
Edition_id [numeric]
Mean (sd) : 30.2 (18.4)
min ≤ med ≤ max:
1 ≤ 24 ≤ 372
IQR (CV) : 33 (0.6)
59 distinct values 132 (0.0%)
Sport [character]
1. Athletics
2. Artistic Gymnastics
3. Swimming
4. Fencing
5. Shooting
6. Rowing
7. Cross Country Skiing
8. Alpine Skiing
9. Football
10. Wrestling
[ 98 others ]
47481(15.1%)
27580(8.8%)
26246(8.3%)
11957(3.8%)
11709(3.7%)
11321(3.6%)
11096(3.5%)
10486(3.3%)
9384(3.0%)
7409(2.4%)
140238(44.5%)
136 (0.0%)
Result_id [numeric]
Mean (sd) : 1502015 (4783510)
min ≤ med ≤ max:
1 ≤ 65810 ≤ 90016770
IQR (CV) : 228495 (3.2)
7332 distinct values 136 (0.0%)
Position [character]
1. 5
2. 1
3. 9
4. 2
5. 3
6. 4
7. 7
8. 6
9. 8
10. DNS
[ 2264 others ]
17664(5.6%)
16273(5.2%)
15865(5.0%)
15741(5.0%)
15674(5.0%)
13683(4.3%)
11860(3.8%)
11626(3.7%)
10266(3.3%)
8330(2.6%)
177925(56.5%)
136 (0.0%)
Medal [character]
1. Bronze
2. Gold
3. Silver
14904(33.4%)
15027(33.7%)
14640(32.8%)
270472 (85.9%)
TeamSport [logical]
1. FALSE
2. TRUE
194095(61.6%)
120812(38.4%)
136 (0.0%)
Event [character]
1. Football, Men
2. Ice Hockey, Men
3. Hockey, Men
4. Individual, Men
5. Basketball, Men
6. Water Polo, Men
7. Road Race, Individual, Me
8. Singles, Men
9. 4 x 100 metres Relay, Men
10. Singles, Women
[ 1544 others ]
7834(2.5%)
5066(1.6%)
4574(1.5%)
3930(1.2%)
3577(1.1%)
3487(1.1%)
3068(1.0%)
3047(1.0%)
2695(0.9%)
2670(0.8%)
274855(87.3%)
240 (0.1%)
Result_location [character]
1. Los Angeles Memorial Coli
2. White City Stadium, Londo
3. Stockholms Olympiastadion
4. Olympic Stadium, Olympic
5. Stade Olympique de Colomb
6. Centennial Olympic Stadiu
7. Beijing Guojia Tiyuchang,
8. Olympisch Stadion, Amster
9. Ol-lim-pik Ju-gyeong-gi-j
10. Francis Field, Washington
[ 936 others ]
2885(0.9%)
2841(0.9%)
2793(0.9%)
2510(0.8%)
2481(0.8%)
2436(0.8%)
2360(0.8%)
2333(0.7%)
2252(0.7%)
2210(0.7%)
289415(92.0%)
527 (0.2%)
No_of_participants [character]
1. 143 from 12 countries
2. 19 from 19 countries
3. 64 from 16 countries
4. 16 from 16 countries
5. 60 from 15 countries
6. 185 from 29 countries
7. 28 from 28 countries
8. 52 from 13 countries
9. 20 from 20 countries
10. 27 from 27 countries
[ 2036 others ]
1871(0.6%)
1609(0.5%)
1502(0.5%)
1380(0.4%)
1351(0.4%)
1295(0.4%)
1266(0.4%)
1265(0.4%)
1165(0.4%)
1144(0.4%)
300955(95.6%)
240 (0.1%)
Result_format [character]
1. Single-elimination tourna
2. Single elimination tourna
3. Round-robin pools advance
4. Two runs, total time dete
5. Round-robin pools advance
6. Round-robin pools advance
7. Round-robin pools advance
8. Placement determined by b
9. Four runs, total time det
10. Points awarded for placem
[ 632 others ]
10747(6.8%)
6911(4.4%)
6302(4.0%)
5938(3.8%)
5643(3.6%)
5165(3.3%)
4918(3.1%)
4652(2.9%)
3548(2.2%)
2942(1.9%)
101446(64.1%)
156831 (49.8%)
Result_description [character]
1. NA
2. NA
3. NA
4. NA
5. NA
6. NA
7. NA
8. NA
9. NA
10. NA
[ 6953 others ]
477(0.2%)
473(0.2%)
352(0.1%)
351(0.1%)
349(0.1%)
349(0.1%)
346(0.1%)
345(0.1%)
344(0.1%)
343(0.1%)
308340(98.8%)
2974 (0.9%)
Event_start_date [Date]
min : 1900-05-14
med : 1988-09-21
max : 2022-02-20
range : 121y 9m 6d
921 distinct values 990 (0.3%)
Event_end_date [Date]
min : 1900-05-21
med : 1988-09-23
max : 2022-02-20
range : 121y 8m 30d
693 distinct values 104375 (33.1%)
City [character]
1. London
2. Tokyo
3. Beijing
4. Athina
5. Sydney
6. Atlanta
7. Rio de Janeiro
8. Barcelona
9. Los Angeles
10. Seoul
[ 35 others ]
24345(7.7%)
22988(7.3%)
19363(6.2%)
14375(4.6%)
14032(4.5%)
13994(4.4%)
13941(4.4%)
13485(4.3%)
13405(4.3%)
12780(4.1%)
151880(48.3%)
455 (0.1%)
Competition_start_date [Date]
min : 1900-05-14
med : 1988-02-13
max : 2022-02-02
range : 121y 8m 19d
51 distinct values 15864 (5.0%)
Competition_end_date [Date]
min : 1900-10-28
med : 1988-02-28
max : 2022-02-20
range : 121y 3m 23d
51 distinct values 15864 (5.0%)

Generated by summarytools 1.0.1 (R version 4.1.3)
2022-09-04

Generalized analysis

Code
p <- olympics %>%
  mutate(Year = as.numeric(Year)) %>%
  filter(NOC == "USA") %>% 
  arrange(desc(Year)) %>%
  group_by(Year) %>%
  count() %>%
  rename(No_of_medals = n) %>%
  ggplot(aes(x = Year, y = No_of_medals)) +
  geom_line() +
  geom_point() +
  theme_classic() + 
  labs(title = "Medal rate of recent years in USA", x = "Year", y = "Number of medals")
ggplotly(p)

The above figure gives us the fluctuations in the medal won in recent years in USA, they are basically from 300-850. But there is enormous irregularity in the findings as the medals won alternately increase and drops down.

Code
plot <- medals %>%
  mutate(Year = as.numeric(Year)) %>%
  group_by(Year, type_of_medal) %>%
  summarise(sum_medal_count = sum(medal_count), .groups = 'drop') %>%
  mutate(type_of_medal=as.factor(type_of_medal)) %>% 
  ggplot(aes(x = Year, y = sum_medal_count, color = type_of_medal)) +
  geom_line() +
  geom_point() +
  theme_classic() + 
  labs(title = "Medal distribution on basis of Year and type of medal", x = "Year", y = "Number of medals") +
  theme(axis.text.x=element_text(angle=90)) +
  guides(color = guide_legend(title = "Type of medal"))
ggplotly(plot)

As per the above graph, we get the medal distribution on basis of year and type of medal, It gives us the variations in the Number of medals as by the changing year in different types.

In-depth Analysis

Code
p <- olympics %>%
  mutate(Year = as.numeric(Year)) %>%
  group_by(Year) %>%
  count(Sex) %>%
  drop_na(Sex) %>%
  rename(No_of_participant = n) %>%
  ggplot(aes(x = Year, y = No_of_participant, fill = Sex)) +
  geom_area() +
  geom_point() +
  theme_classic() + 
  facet_grid(vars(Sex)) +
  labs(title = "Gender Analysis", x = "Year", y = "No of participants") +
  theme(axis.text.x=element_text(angle=90)) 
ggplotly(p)

As per the above graph, we get that there is a huge difference between the male and female participants in the Olympics and there is clearly less female participation when compare to male.

To further analysis I have taken an athlete “Heikki Savolainen” who is top player in gymnastics and played in largest number of events and we will try to approach different factors like age, position, medals, etc.

Code
plot <- olympics %>%
  mutate(Year = as.numeric(Year)) %>%       
  filter(Athlete == "Heikki Savolainen") %>%
  mutate(Position = as.numeric(Position)) %>%
  group_by(Year) %>%
  summarise(mean_position = mean(Position)) %>%
  ggplot(aes(x = Year, y = mean_position, fill = "red")) +
  geom_density(stat = "identity") +
  theme_classic() + 
  labs(title = "Mean Position of Heikki Savolainen over years", x = "Year", y = "Position")
ggplotly(plot)

According to the above figure, the athlete position was good but it gradually increased as he is end of his career, so his performance is gradually dropping as he is aging.

Code
p <- olympics %>%
  mutate(Year = as.numeric(Year)) %>%
  drop_na(Medal) %>%
  group_by(Year, TeamSport) %>%
  count() %>%
  rename(No_of_medals = n) %>%
  ggplot(aes(x = Year, y = No_of_medals, fill = TeamSport)) +
  geom_bar(stat = "identity") +
  theme_minimal() + 
  labs(title = "Medals earned in Team Sport vs Individual Sport", x = "Year", y = "No of medals") +
  theme(axis.text.x=element_text(angle=90)) +
  guides(color = guide_legend(title = "Team Sport"))
ggplotly(p)

As per the data, it is clear that the medals earned due to individual sports are greater than team sports ad this can be due to many reasons and the major reason is due to having more number of individual sports in Olympics than team sports.

Conclusion

The further analysis is required for more in depth analysis, but as by our analysis so far the career of an Olympic athlete starts with the passion to win gold for their country and as he tries many times in every event and in every season and wins medals based on his hard work but as the age increases and year passes by the position gradually decreases, for example as we saw in the above in-depth analysis, position of the athlete depends on many factors but it biological factors matter and from this analysis we can also say that an athlete win doesn’t individually depend on him and individuality is also important, but further exploratory analysis is required and data available is not so conclusive.

Bibliography

  1. OlyMADMen - The Incredible Community of dedicated Olympic historians and statisticians who worked with Olympians and gathered mass amount of data for olympedia.org.

  2. Randi H Griffin - Author of 120 years of Olympic history: athletes and results which this dataset is inspired from.

  3. Richard Carlier - Providing Constructive feedback and data schema diagram to help make this dataset better.

  4. Gwanhee Lee - Provided an iniatial web scrapping for the olympedia.org website

  5. David Mo - Mentorship for software engineering practices