Olympic Analysis
Data Description
I choose to investigate a dataset containing information about the Olympic Games between 1896 and 2016. The data set is read in and displayed using the code below.
# Read in the data set
athletes <- read_csv('./Data/athlete_events.csv')
noc <- read_csv('./Data/noc_regions.csv')
# Display the athlete data set
athletes
# A tibble: 271,116 × 15
ID Name Sex Age Height Weight Team NOC Games Year Season City
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 1 A Dijia… M 24 180 80 China CHN 1992… 1992 Summer Barc…
2 2 A Lamusi M 23 170 60 China CHN 2012… 2012 Summer Lond…
3 3 Gunnar … M 24 NA NA Denm… DEN 1920… 1920 Summer Antw…
4 4 Edgar L… M 34 NA NA Denm… DEN 1900… 1900 Summer Paris
5 5 Christi… F 21 185 82 Neth… NED 1988… 1988 Winter Calg…
6 5 Christi… F 21 185 82 Neth… NED 1988… 1988 Winter Calg…
7 5 Christi… F 25 185 82 Neth… NED 1992… 1992 Winter Albe…
8 5 Christi… F 25 185 82 Neth… NED 1992… 1992 Winter Albe…
9 5 Christi… F 27 185 82 Neth… NED 1994… 1994 Winter Lill…
10 5 Christi… F 27 185 82 Neth… NED 1994… 1994 Winter Lill…
# … with 271,106 more rows, and 3 more variables: Sport <chr>, Event <chr>,
# Medal <chr>
# A tibble: 230 × 3
NOC region notes
<chr> <chr> <chr>
1 AFG Afghanistan <NA>
2 AHO Curacao Netherlands Antilles
3 ALB Albania <NA>
4 ALG Algeria <NA>
5 AND Andorra <NA>
6 ANG Angola <NA>
7 ANT Antigua Antigua and Barbuda
8 ANZ Australia Australasia
9 ARG Argentina <NA>
10 ARM Armenia <NA>
# … with 220 more rows
The dataset comes as a set of two csv files: ‘athlete_events’ and ‘noc_regions’ where each file represents a table in a relational database. The ‘athlete_events’ table contains most of the information that I am interested in. Each row in this table contains information about an athlete who competed in an Olympic event. Note that athletes can occur in the table more than once if they competed in multiple events or games. A good example of this is Michael Phelps who appears in the athletes table many times.
# Find all the rows containing information about Michael Phelps
filter(athletes, str_detect(Name, "Michael") & str_detect(Name, "Phelps")) %>%
select(-c(ID, NOC, Sex, Height, Weight, Season))
# A tibble: 30 × 9
Name Age Team Games Year City Sport Event Medal
<chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 Michael Fred Phelps, II 15 United Sta… 2000… 2000 Sydn… Swim… Swim… <NA>
2 Michael Fred Phelps, II 19 United Sta… 2004… 2004 Athi… Swim… Swim… Bron…
3 Michael Fred Phelps, II 19 United Sta… 2004… 2004 Athi… Swim… Swim… Bron…
4 Michael Fred Phelps, II 19 United Sta… 2004… 2004 Athi… Swim… Swim… Gold
5 Michael Fred Phelps, II 19 United Sta… 2004… 2004 Athi… Swim… Swim… Gold
6 Michael Fred Phelps, II 19 United Sta… 2004… 2004 Athi… Swim… Swim… Gold
7 Michael Fred Phelps, II 19 United Sta… 2004… 2004 Athi… Swim… Swim… Gold
8 Michael Fred Phelps, II 19 United Sta… 2004… 2004 Athi… Swim… Swim… Gold
9 Michael Fred Phelps, II 19 United Sta… 2004… 2004 Athi… Swim… Swim… Gold
10 Michael Fred Phelps, II 23 United Sta… 2008… 2008 Beij… Swim… Swim… Gold
# … with 20 more rows
Each row in the athletes table contains the National Olympic Committee that they are competing for which is stored as a three letter code in the ‘NOC’ column. The ‘NOC’ column serves as a foreign key to the ‘noc_regions’ table. The ‘noc_regions’ table maps the National Olympic Committee code to its corresponding country and provides additional notes about the committee. For example, consider this row in ‘noc_regions’: (HKG, China, Hong Kong). This tells us that the code HKG corresponds Hong Kong Olympic Committee in China.
In the majority of the cases, the ‘notes’ column of the ‘noc_regions’ table is empty. The case I gave in the previous example is an outlier case in which it is present. In cases where it is present, it often provides another way besides the NOC code to distinguish between two committees from the same region. The code chunk below display all the entries in the ‘noc_regions’ table that have notes associated with it. Notice that many of the rows in the table have the same region.
# List the countries have non-null values in 'notes' column
noc %>%
filter(!is.na(notes)) %>%
arrange(region)
# A tibble: 21 × 3
NOC region notes
<chr> <chr> <chr>
1 ANT Antigua Antigua and Barbuda
2 ANZ Australia Australasia
3 NFL Canada Newfoundland
4 HKG China Hong Kong
5 AHO Curacao Netherlands Antilles
6 BOH Czech Republic Bohemia
7 CRT Greece Crete
8 IOA Individual Olympic Athletes Individual Olympic Athletes
9 NBO Malaysia North Borneo
10 SKN Saint Kitts Turks and Caicos Islands
# … with 11 more rows
Cleaning The Data
Initial Transformations
The data was collected in csv format, which made it easy to read into a tibble. However, the data is not yet tidy. We will need to perform cleaning operations to make the data neater for analysis.
We can start by dropping the ‘Games’ column. We don’t need the ‘Games’ column because each entry in the column is a combination of the ‘Year’ and ‘Season’ columns which are already present in the data set.
Next, we can add the region from the ‘noc_region’ file to the athletes tibble so that we only have one tibble to work with. I choose to drop the ‘notes’ column in the noc file because most of the values are not available and the values which are available don’t provide much useful information.
We can clean up the ‘Medal’ column by replacing the values that are not available with the string “None”. Based on the context of the data set, we can assume that a NA in this column means that the athlete did not win a Medal in that event. Therefore, replacing these values with “None” seems logical.
Handling Missing Values
There are four columns in the data set which contain missing values: ‘Region’, ‘Age’, ‘Height’, and ‘Weight’. I started by handling the missing values in the ‘region’ column first because it is the simplist. We can find the ‘NOC’ associated with unknown regions by querying the ‘noc_regions’ data set.
# A tibble: 3 × 3
NOC region notes
<chr> <chr> <chr>
1 ROT <NA> Refugee Olympic Team
2 TUV <NA> Tuvalu
3 UNK <NA> Unknown
We know that any row in athletes that has an ‘NOC’ value of ‘ROT’, ‘TUV’, or ‘UNK’ will have a value in the ‘Region’ column that is NA. I think it would be sensible to replace the NA values according to the table below.
NOC | Region | Notes |
---|---|---|
ROT | None | Regugee Olympic Team |
TUV | Tuvala | Tuvala |
UNK | Unkown | Unknown |
The Refugee Olympic Team does not have a region, so I think it is sensible to replace those missing values with “None”. I did a google search and learned that Tuvala is a small indpenent island nation so it makes sense to replace those missing values with “Tuvala”. Lastly, I think it is fair to replace the missing values associated with the UNK NOC as “Unknown”. The code below changes the missing values according to the table above.
Let’s handle the missing values in the ‘Age’, ‘Height’, and ‘Weight’ columns next. Suppose we find a missing value in the ‘Age’ column. I think a reasonable idea would be to replace the missing value with the average age during the Olympic games that occurred in that year. We know that ‘Year’ column of the data set does not contain any missing values. Therefore, if we find a missing value in the ‘Age’ column, we can extract the value of the ‘Year’ column in that row and we know it will not be missing. Then we can calculate the average age during that year and replace the missing value with the average value. Using this approach, we can replace all the missing values in these columns.
First, let’s calculate the average age, height, and weight for each year in the data set. This is calculated and stored in the tibble using the code chunk below.
# Compute average age, height, and weight by years
AverageValues <- athletes %>%
group_by(Year) %>%
summarise(AverageAge = round(mean(Age, na.rm = TRUE), digits = 0),
AverageHeight = round(mean(Height, na.rm = TRUE), digits = 0),
AverageWeight = round(mean(Weight, na.rm = TRUE), digits = 1))
# Display the results
AverageValues
# A tibble: 35 × 4
Year AverageAge AverageHeight AverageWeight
<dbl> <dbl> <dbl> <dbl>
1 1896 24 173 71.4
2 1900 29 177 74.6
3 1904 27 176 72.2
4 1906 27 178 75.9
5 1908 27 178 75.4
6 1912 28 177 73.1
7 1920 29 176 73.1
8 1924 28 175 71.7
9 1928 29 175 71
10 1932 33 174 70.5
# … with 25 more rows
Now that we have computed the averages, it is time to replace the missing values. The ‘cleanColumn’ function cleans all of the columns mentioned above.
# Function to clean age, height, and weight
cleanColumn <- function(column, indx) {
# Iterate along the age column
for (i in seq_along(column)) {
# Extract the current age
cVal <- column[[i]]
# Check for NA value
if (is.na(cVal)) {
# Extract year from the row
cYear <- slice(athletes, i)$Year[[1]]
# Replace the NA value with the average age from year
column[[i]] <- filter(AverageValues, Year == cYear)[[indx]][[1]]
}
}
# Return cleaned column
return(column)
}
# Set age, height, and weight to cleaned version
athletes$Age <- cleanColumn(athletes$Age, 2)
athletes$Height <- cleanColumn(athletes$Height, 3)
athletes$Weight <- cleanColumn(athletes$Weight, 4)
Cleaning Sport & Event
There is another transformation I would like to perform to make the data set a bit cleaner. Take a look at the first few values in the ‘Sport’ and ‘Event’ column. They are presented below.
# A tibble: 6 × 2
Sport Event
<chr> <chr>
1 Basketball Basketball Men's Basketball
2 Judo Judo Men's Extra-Lightweight
3 Football Football Men's Football
4 Tug-Of-War Tug-Of-War Men's Tug-Of-War
5 Speed Skating Speed Skating Women's 500 metres
6 Speed Skating Speed Skating Women's 1,000 metres
Notice how the first word or set of words in the ‘Event’ column is the same exact string in the ‘Sport’ column. The code chunk below confirms that this pattern is true throughout the entire data set. Note, I did not evaluate this code to speed up the rendering process. However, the output can be verified by running this qmd file.
testCols <- function() {
# Iterate along sport column
for (i in seq_along(athletes$Sport)) {
# Get the sport and event string at row i
# Split each string on space character
sport <- str_split(athletes$Sport[[i]], " ")[[1]]
event <- str_split(athletes$Event[[i]], " ")[[1]]
# Iterate over items in sport list
for (j in seq_along(sport)) {
# Check to see if each item in sport is equivalent to corresponding item in event
if (sport[[j]] != event[[j]]) {
# If it is not equivalent -> stop and display sport and event which breaks pattern
cat("BREAK\n", sport, " ", event, "\nPattern does not hold")
return(FALSE)
}
}
}
# If we get here, then the property holds for the entire data set
cat("Pattern holds for entire data set\n")
return(TRUE)
}
testCols()
I would like to remove the first few words that in the ‘Event’ column that are also present in the ‘Sport’ column. This will clean up the data set even more. Furthermore, we are not losing any information because we have confirmed that everything we are throwing away appears in the ‘Sport’ column. The following chunk of code cleans the ‘Event’ column.
# Function to clean event column
cleanEvent <- function() {
# Iterate along and event columns
for (i in seq_along(athletes$Sport)) {
# Split sport and event string at row i on space character
sport <- str_split(athletes$Sport[[i]], " ")[[1]]
event <- str_split(athletes$Event[[i]], " ")[[1]]
# Empty string to store new event
newEvent <- ""
# Iterate along event list
for (j in seq_along(event)) {
# Once we get past the length of sport, start piecing together newEvent
if (j > length(sport)) {
newEvent = str_c(newEvent, event[[j]], " ")
}
}
# Reset newEvent string at position i
athletes$Event[[i]] <- newEvent
}
# Return cleaned column to save changes
return(athletes$Event)
}
athletes$Event <- cleanEvent()
Reordering & Arranging
Lastly, let’s finish the cleaning process by reordering the columns and sorting them to make the data set easier to look at.
Saving Cleaned Data
Some of the operations that were required to clean the data set took some time to complete. To save time later, let’s save the clean data set in a new csv file so we don’t have to repeat all of the cleaning operations.
Finally, let’s read in the new csv file and present the cleaned data.
# A tibble: 270,767 × 15
ID Name Team NOC Region Sex Age Height Weight Year Season City
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1724 "Arist… Gree… GRE Greece M 24 173 71.4 1896 Summer Athi…
2 1724 "Arist… Gree… GRE Greece M 24 173 71.4 1896 Summer Athi…
3 1725 "Konst… Gree… GRE Greece M 24 173 71.4 1896 Summer Athi…
4 1725 "Konst… Gree… GRE Greece M 24 173 71.4 1896 Summer Athi…
5 4113 "Anast… Gree… GRE Greece M 24 173 71.4 1896 Summer Athi…
6 4116 "Ioann… Gree… GRE Greece M 24 173 71.4 1896 Summer Athi…
7 4189 "Nikol… Gree… GRE Greece M 24 173 71.4 1896 Summer Athi…
8 4431 "Georg… Gree… GRE Greece M 24 173 71.4 1896 Summer Athi…
9 4493 "Antel… Gree… GRE Greece M 24 173 71.4 1896 Summer Athi…
10 5660 "Georg… Gree… GRE Greece M 24 173 71.4 1896 Summer Athi…
# … with 270,757 more rows, and 3 more variables: Sport <chr>, Event <chr>,
# Medal <chr>
Potential Research Questions
I am interested if there are relationships between success in the Olympics (whether you got a medal or not) and variables such as height, weight, age, and country. I believe we can use this data set to see what correlates with winning medals.