Olympic Analysis

hw2

Olympic_Data

Author

Matthew Norberg

Published

October 11, 2022

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Data Description

I choose to investigate a dataset containing information about the Olympic Games between 1896 and 2016. The data set is read in and displayed using the code below.

# Read in the data set
athletes <- read_csv('./Data/athlete_events.csv')
noc <- read_csv('./Data/noc_regions.csv')

# Display the athlete data set
athletes

# A tibble: 271,116 × 15
      ID Name     Sex     Age Height Weight Team  NOC   Games  Year Season City 
   <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
 1     1 A Dijia… M        24    180     80 China CHN   1992…  1992 Summer Barc…
 2     2 A Lamusi M        23    170     60 China CHN   2012…  2012 Summer Lond…
 3     3 Gunnar … M        24     NA     NA Denm… DEN   1920…  1920 Summer Antw…
 4     4 Edgar L… M        34     NA     NA Denm… DEN   1900…  1900 Summer Paris
 5     5 Christi… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
 6     5 Christi… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
 7     5 Christi… F        25    185     82 Neth… NED   1992…  1992 Winter Albe…
 8     5 Christi… F        25    185     82 Neth… NED   1992…  1992 Winter Albe…
 9     5 Christi… F        27    185     82 Neth… NED   1994…  1994 Winter Lill…
10     5 Christi… F        27    185     82 Neth… NED   1994…  1994 Winter Lill…
# … with 271,106 more rows, and 3 more variables: Sport <chr>, Event <chr>,
#   Medal <chr>

# Display the noc data set
noc

# A tibble: 230 × 3
   NOC   region      notes               
   <chr> <chr>       <chr>               
 1 AFG   Afghanistan <NA>                
 2 AHO   Curacao     Netherlands Antilles
 3 ALB   Albania     <NA>                
 4 ALG   Algeria     <NA>                
 5 AND   Andorra     <NA>                
 6 ANG   Angola      <NA>                
 7 ANT   Antigua     Antigua and Barbuda 
 8 ANZ   Australia   Australasia         
 9 ARG   Argentina   <NA>                
10 ARM   Armenia     <NA>                
# … with 220 more rows

The dataset comes as a set of two csv files: ‘athlete_events’ and ‘noc_regions’ where each file represents a table in a relational database. The ‘athlete_events’ table contains most of the information that I am interested in. Each row in this table contains information about an athlete who competed in an Olympic event. Note that athletes can occur in the table more than once if they competed in multiple events or games. A good example of this is Michael Phelps who appears in the athletes table many times.

# Find all the rows containing information about Michael Phelps
filter(athletes, str_detect(Name, "Michael") & str_detect(Name, "Phelps")) %>%
  select(-c(ID, NOC, Sex, Height, Weight, Season))

# A tibble: 30 × 9
   Name                      Age Team        Games  Year City  Sport Event Medal
   <chr>                   <dbl> <chr>       <chr> <dbl> <chr> <chr> <chr> <chr>
 1 Michael Fred Phelps, II    15 United Sta… 2000…  2000 Sydn… Swim… Swim… <NA> 
 2 Michael Fred Phelps, II    19 United Sta… 2004…  2004 Athi… Swim… Swim… Bron…
 3 Michael Fred Phelps, II    19 United Sta… 2004…  2004 Athi… Swim… Swim… Bron…
 4 Michael Fred Phelps, II    19 United Sta… 2004…  2004 Athi… Swim… Swim… Gold 
 5 Michael Fred Phelps, II    19 United Sta… 2004…  2004 Athi… Swim… Swim… Gold 
 6 Michael Fred Phelps, II    19 United Sta… 2004…  2004 Athi… Swim… Swim… Gold 
 7 Michael Fred Phelps, II    19 United Sta… 2004…  2004 Athi… Swim… Swim… Gold 
 8 Michael Fred Phelps, II    19 United Sta… 2004…  2004 Athi… Swim… Swim… Gold 
 9 Michael Fred Phelps, II    19 United Sta… 2004…  2004 Athi… Swim… Swim… Gold 
10 Michael Fred Phelps, II    23 United Sta… 2008…  2008 Beij… Swim… Swim… Gold 
# … with 20 more rows

Each row in the athletes table contains the National Olympic Committee that they are competing for which is stored as a three letter code in the ‘NOC’ column. The ‘NOC’ column serves as a foreign key to the ‘noc_regions’ table. The ‘noc_regions’ table maps the National Olympic Committee code to its corresponding country and provides additional notes about the committee. For example, consider this row in ‘noc_regions’: (HKG, China, Hong Kong). This tells us that the code HKG corresponds Hong Kong Olympic Committee in China.

In the majority of the cases, the ‘notes’ column of the ‘noc_regions’ table is empty. The case I gave in the previous example is an outlier case in which it is present. In cases where it is present, it often provides another way besides the NOC code to distinguish between two committees from the same region. The code chunk below display all the entries in the ‘noc_regions’ table that have notes associated with it. Notice that many of the rows in the table have the same region.

# List the countries have non-null values in 'notes' column
noc %>%
  filter(!is.na(notes)) %>%
  arrange(region)

# A tibble: 21 × 3
   NOC   region                      notes                      
   <chr> <chr>                       <chr>                      
 1 ANT   Antigua                     Antigua and Barbuda        
 2 ANZ   Australia                   Australasia                
 3 NFL   Canada                      Newfoundland               
 4 HKG   China                       Hong Kong                  
 5 AHO   Curacao                     Netherlands Antilles       
 6 BOH   Czech Republic              Bohemia                    
 7 CRT   Greece                      Crete                      
 8 IOA   Individual Olympic Athletes Individual Olympic Athletes
 9 NBO   Malaysia                    North Borneo               
10 SKN   Saint Kitts                 Turks and Caicos Islands   
# … with 11 more rows

Cleaning The Data

Initial Transformations

The data was collected in csv format, which made it easy to read into a tibble. However, the data is not yet tidy. We will need to perform cleaning operations to make the data neater for analysis.

We can start by dropping the ‘Games’ column. We don’t need the ‘Games’ column because each entry in the column is a combination of the ‘Year’ and ‘Season’ columns which are already present in the data set.

# Get rid of id & games column
athletes <- select(athletes, -c(Games))

Next, we can add the region from the ‘noc_region’ file to the athletes tibble so that we only have one tibble to work with. I choose to drop the ‘notes’ column in the noc file because most of the values are not available and the values which are available don’t provide much useful information.

# Combine information from noc table
athletes <- inner_join(athletes, noc) %>% select(-(notes))

We can clean up the ‘Medal’ column by replacing the values that are not available with the string “None”. Based on the context of the data set, we can assume that a NA in this column means that the athlete did not win a Medal in that event. Therefore, replacing these values with “None” seems logical.

# Switch NA in Medal to None
athletes <- athletes %>% replace_na(list(Medal = "None"))

Handling Missing Values

There are four columns in the data set which contain missing values: ‘Region’, ‘Age’, ‘Height’, and ‘Weight’. I started by handling the missing values in the ‘region’ column first because it is the simplist. We can find the ‘NOC’ associated with unknown regions by querying the ‘noc_regions’ data set.

# Display NA in region
filter(noc, is.na(region))

# A tibble: 3 × 3
  NOC   region notes               
  <chr> <chr>  <chr>               
1 ROT   <NA>   Refugee Olympic Team
2 TUV   <NA>   Tuvalu              
3 UNK   <NA>   Unknown

We know that any row in athletes that has an ‘NOC’ value of ‘ROT’, ‘TUV’, or ‘UNK’ will have a value in the ‘Region’ column that is NA. I think it would be sensible to replace the NA values according to the table below.

NOC	Region	Notes
ROT	None	Regugee Olympic Team
TUV	Tuvala	Tuvala
UNK	Unkown	Unknown

The Refugee Olympic Team does not have a region, so I think it is sensible to replace those missing values with “None”. I did a google search and learned that Tuvala is a small indpenent island nation so it makes sense to replace those missing values with “Tuvala”. Lastly, I think it is fair to replace the missing values associated with the UNK NOC as “Unknown”. The code below changes the missing values according to the table above.

# Fill in missing values in Region column
athletes <- athletes %>%
            mutate(Temp = case_when(NOC == "ROT" ~ "None",
                                      NOC == "TUV" ~ "Tuvala",
                                      NOC == "UNK" ~ "Unknown")) %>%
            mutate(Region = case_when(is.na(region) ~ Temp,
                                      !is.na(region) ~ region)) %>%
            select(-c(region, Temp))

Let’s handle the missing values in the ‘Age’, ‘Height’, and ‘Weight’ columns next. Suppose we find a missing value in the ‘Age’ column. I think a reasonable idea would be to replace the missing value with the average age during the Olympic games that occurred in that year. We know that ‘Year’ column of the data set does not contain any missing values. Therefore, if we find a missing value in the ‘Age’ column, we can extract the value of the ‘Year’ column in that row and we know it will not be missing. Then we can calculate the average age during that year and replace the missing value with the average value. Using this approach, we can replace all the missing values in these columns.

First, let’s calculate the average age, height, and weight for each year in the data set. This is calculated and stored in the tibble using the code chunk below.

# Compute average age, height, and weight by years
AverageValues <- athletes %>%
                 group_by(Year) %>%
                 summarise(AverageAge = round(mean(Age, na.rm = TRUE), digits = 0),
                           AverageHeight = round(mean(Height, na.rm = TRUE), digits = 0),
                           AverageWeight = round(mean(Weight, na.rm = TRUE), digits = 1))
# Display the results
AverageValues

# A tibble: 35 × 4
    Year AverageAge AverageHeight AverageWeight
   <dbl>      <dbl>         <dbl>         <dbl>
 1  1896         24           173          71.4
 2  1900         29           177          74.6
 3  1904         27           176          72.2
 4  1906         27           178          75.9
 5  1908         27           178          75.4
 6  1912         28           177          73.1
 7  1920         29           176          73.1
 8  1924         28           175          71.7
 9  1928         29           175          71  
10  1932         33           174          70.5
# … with 25 more rows

Now that we have computed the averages, it is time to replace the missing values. The ‘cleanColumn’ function cleans all of the columns mentioned above.

# Function to clean age, height, and weight
cleanColumn <- function(column, indx) {
  # Iterate along the age column
  for (i in seq_along(column)) {
    # Extract the current age
    cVal <- column[[i]]
    # Check for NA value
    if (is.na(cVal)) {
      # Extract year from the row
      cYear <- slice(athletes, i)$Year[[1]]
      # Replace the NA value with the average age from year
      column[[i]] <- filter(AverageValues, Year == cYear)[[indx]][[1]]
    }
  }
  # Return cleaned column
  return(column)
}
# Set age, height, and weight to cleaned version
athletes$Age <- cleanColumn(athletes$Age, 2)
athletes$Height <- cleanColumn(athletes$Height, 3)
athletes$Weight <- cleanColumn(athletes$Weight, 4)

Cleaning Sport & Event

There is another transformation I would like to perform to make the data set a bit cleaner. Take a look at the first few values in the ‘Sport’ and ‘Event’ column. They are presented below.

# Display first few rows of Sport and Event columns
athletes %>% 
  select(Sport, Event) %>%
  head()

# A tibble: 6 × 2
  Sport         Event                             
  <chr>         <chr>                             
1 Basketball    Basketball Men's Basketball       
2 Judo          Judo Men's Extra-Lightweight      
3 Football      Football Men's Football           
4 Tug-Of-War    Tug-Of-War Men's Tug-Of-War       
5 Speed Skating Speed Skating Women's 500 metres  
6 Speed Skating Speed Skating Women's 1,000 metres

Notice how the first word or set of words in the ‘Event’ column is the same exact string in the ‘Sport’ column. The code chunk below confirms that this pattern is true throughout the entire data set. Note, I did not evaluate this code to speed up the rendering process. However, the output can be verified by running this qmd file.

testCols <- function() {
  # Iterate along sport column
  for (i in seq_along(athletes$Sport)) {
    # Get the sport and event string at row i
    # Split each string on space character
    sport <- str_split(athletes$Sport[[i]], " ")[[1]]
    event <- str_split(athletes$Event[[i]], " ")[[1]]
    # Iterate over items in sport list
    for (j in seq_along(sport)) {
      # Check to see if each item in sport is equivalent to corresponding item in event
      if (sport[[j]] != event[[j]]) {
        # If it is not equivalent -> stop and display sport and event which breaks pattern
        cat("BREAK\n", sport, " ", event, "\nPattern does not hold")
        return(FALSE)
      }
    }
  }
  # If we get here, then the property holds for the entire data set
  cat("Pattern holds for entire data set\n")
  return(TRUE)
}
testCols()

I would like to remove the first few words that in the ‘Event’ column that are also present in the ‘Sport’ column. This will clean up the data set even more. Furthermore, we are not losing any information because we have confirmed that everything we are throwing away appears in the ‘Sport’ column. The following chunk of code cleans the ‘Event’ column.

# Function to clean event column
cleanEvent <- function() {
  # Iterate along and event columns
  for (i in seq_along(athletes$Sport)) {
    # Split sport and event string at row i on space character
    sport <- str_split(athletes$Sport[[i]], " ")[[1]]
    event <- str_split(athletes$Event[[i]], " ")[[1]]
    # Empty string to store new event
    newEvent <- ""

    # Iterate along event list
    for (j in seq_along(event)) {
      # Once we get past the length of sport, start piecing together newEvent
      if (j > length(sport)) {
        newEvent = str_c(newEvent, event[[j]], " ")
      }
    }
    
    # Reset newEvent string at position i
    athletes$Event[[i]] <- newEvent
  }
  
  # Return cleaned column to save changes
  return(athletes$Event)
}
athletes$Event <- cleanEvent()

Reordering & Arranging

Lastly, let’s finish the cleaning process by reordering the columns and sorting them to make the data set easier to look at.

# Change column order and sort by year
athletes <- athletes %>% 
  relocate(Team, .after = Name) %>%
  relocate(NOC, .after = Team) %>%
  relocate(Region, .after = NOC) %>%
  arrange(Year)

Saving Cleaned Data

Some of the operations that were required to clean the data set took some time to complete. To save time later, let’s save the clean data set in a new csv file so we don’t have to repeat all of the cleaning operations.

# Save cleaned data set
write_csv(athletes, './Data/athletes_clean.csv')

Finally, let’s read in the new csv file and present the cleaned data.

# Read in saved data set and display
athletes <- read_csv('./Data/athletes_clean.csv')
athletes

# A tibble: 270,767 × 15
      ID Name    Team  NOC   Region Sex     Age Height Weight  Year Season City 
   <dbl> <chr>   <chr> <chr> <chr>  <chr> <dbl>  <dbl>  <dbl> <dbl> <chr>  <chr>
 1  1724 "Arist… Gree… GRE   Greece M        24    173   71.4  1896 Summer Athi…
 2  1724 "Arist… Gree… GRE   Greece M        24    173   71.4  1896 Summer Athi…
 3  1725 "Konst… Gree… GRE   Greece M        24    173   71.4  1896 Summer Athi…
 4  1725 "Konst… Gree… GRE   Greece M        24    173   71.4  1896 Summer Athi…
 5  4113 "Anast… Gree… GRE   Greece M        24    173   71.4  1896 Summer Athi…
 6  4116 "Ioann… Gree… GRE   Greece M        24    173   71.4  1896 Summer Athi…
 7  4189 "Nikol… Gree… GRE   Greece M        24    173   71.4  1896 Summer Athi…
 8  4431 "Georg… Gree… GRE   Greece M        24    173   71.4  1896 Summer Athi…
 9  4493 "Antel… Gree… GRE   Greece M        24    173   71.4  1896 Summer Athi…
10  5660 "Georg… Gree… GRE   Greece M        24    173   71.4  1896 Summer Athi…
# … with 270,757 more rows, and 3 more variables: Sport <chr>, Event <chr>,
#   Medal <chr>

Potential Research Questions

I am interested if there are relationships between success in the Olympics (whether you got a medal or not) and variables such as height, weight, age, and country. I believe we can use this data set to see what correlates with winning medals.