Code
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)Vinitha Maheswaran
December 10, 2022
For this homework I will be working with the 120 years of Olympic history: athletes and results dataset. The Olympics data has two csv files - “athlete_events.csv” and “noc_regions.csv”. This historical dataset contains information on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. This data was scraped from www.sports-reference.com in May 2018 and is available on Kaggle.
# A tibble: 271,116 × 15
      ID Name     Sex     Age Height Weight Team  NOC   Games  Year Season City 
   <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
 1     1 A Dijia… M        24    180     80 China CHN   1992…  1992 Summer Barc…
 2     2 A Lamusi M        23    170     60 China CHN   2012…  2012 Summer Lond…
 3     3 Gunnar … M        24     NA     NA Denm… DEN   1920…  1920 Summer Antw…
 4     4 Edgar L… M        34     NA     NA Denm… DEN   1900…  1900 Summer Paris
 5     5 Christi… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
 6     5 Christi… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
 7     5 Christi… F        25    185     82 Neth… NED   1992…  1992 Winter Albe…
 8     5 Christi… F        25    185     82 Neth… NED   1992…  1992 Winter Albe…
 9     5 Christi… F        27    185     82 Neth… NED   1994…  1994 Winter Lill…
10     5 Christi… F        27    185     82 Neth… NED   1994…  1994 Winter Lill…
# … with 271,106 more rows, and 3 more variables: Sport <chr>, Event <chr>,
#   Medal <chr># A tibble: 230 × 3
   NOC   region      notes               
   <chr> <chr>       <chr>               
 1 AFG   Afghanistan <NA>                
 2 AHO   Curacao     Netherlands Antilles
 3 ALB   Albania     <NA>                
 4 ALG   Algeria     <NA>                
 5 AND   Andorra     <NA>                
 6 ANG   Angola      <NA>                
 7 ANT   Antigua     Antigua and Barbuda 
 8 ANZ   Australia   Australasia         
 9 ARG   Argentina   <NA>                
10 ARM   Armenia     <NA>                
# … with 220 more rows[1] 271116     15[1] 230   3spc_tbl_ [271,116 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID    : num [1:271116] 1 2 3 4 5 5 5 5 5 5 ...
 $ Name  : chr [1:271116] "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
 $ Sex   : chr [1:271116] "M" "M" "M" "M" ...
 $ Age   : num [1:271116] 24 23 24 34 21 21 25 25 27 27 ...
 $ Height: num [1:271116] 180 170 NA NA 185 185 185 185 185 185 ...
 $ Weight: num [1:271116] 80 60 NA NA 82 82 82 82 82 82 ...
 $ Team  : chr [1:271116] "China" "China" "Denmark" "Denmark/Sweden" ...
 $ NOC   : chr [1:271116] "CHN" "CHN" "DEN" "DEN" ...
 $ Games : chr [1:271116] "1992 Summer" "2012 Summer" "1920 Summer" "1900 Summer" ...
 $ Year  : num [1:271116] 1992 2012 1920 1900 1988 ...
 $ Season: chr [1:271116] "Summer" "Summer" "Summer" "Summer" ...
 $ City  : chr [1:271116] "Barcelona" "London" "Antwerpen" "Paris" ...
 $ Sport : chr [1:271116] "Basketball" "Judo" "Football" "Tug-Of-War" ...
 $ Event : chr [1:271116] "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
 $ Medal : chr [1:271116] NA NA NA "Gold" ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_double(),
  ..   Name = col_character(),
  ..   Sex = col_character(),
  ..   Age = col_double(),
  ..   Height = col_double(),
  ..   Weight = col_double(),
  ..   Team = col_character(),
  ..   NOC = col_character(),
  ..   Games = col_character(),
  ..   Year = col_double(),
  ..   Season = col_character(),
  ..   City = col_character(),
  ..   Sport = col_character(),
  ..   Event = col_character(),
  ..   Medal = col_character()
  .. )
 - attr(*, "problems")=<externalptr> spc_tbl_ [230 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ NOC   : chr [1:230] "AFG" "AHO" "ALB" "ALG" ...
 $ region: chr [1:230] "Afghanistan" "Curacao" "Albania" "Algeria" ...
 $ notes : chr [1:230] NA "Netherlands Antilles" NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   NOC = col_character(),
  ..   region = col_character(),
  ..   notes = col_character()
  .. )
 - attr(*, "problems")=<externalptr> | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID [numeric] | 
 | 135571 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Name [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sex [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Age [numeric] | 
 | 74 distinct values | 9474 (3.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Height [numeric] | 
 | 95 distinct values | 60171 (22.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Weight [numeric] | 
 | 220 distinct values | 62875 (23.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Team [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| NOC [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Games [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Year [numeric] | 
 | 35 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Season [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| City [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sport [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Event [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Medal [character] | 
 | 
 | 231333 (85.3%) | 
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-24
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NOC [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| region [character] | 
 | 
 | 3 (1.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| notes [character] | 
 | 
 | 209 (90.9%) | 
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-24
The dataset contains information on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016 (the past 120 years). The Winter and Summer Games were held in the same year until 1992. After that, they were staggered such that the Winter Games occur once every 4 years starting with 1994, and the Summer Games occur once every 4 years starting with 1996. The “athlete_events.csv” file has 271,116 observations and 15 variables/attributes. Each row in this csv file corresponds to an individual athlete competing in an individual Olympic event (athlete-events). It includes information about the athlete’s name, gender, age, height (in cm), weight (in kg), team/country they represent, National Olympic Committee (3-letter code) they are representing, year and season participated, Olympic games host city for that year and season, sport, athlete event and medal won. Each athlete will have multiple observations in the data as they would have participated in multiple events and during different seasons. This csv file has 1385 duplicates which I will be investigating in the next steps. The “noc_regions.csv” file has 230 observations and 3 variables/attributes. This file contains information about the ‘NOC’ National Olympic Committee which is a 3-letter code, the corresponding region and notes. The file has 230 unique codes for the NOC variable. Few of the regions have same NOC code which in some cases is distinguished using the notes. The notes has missing value for 209 observations. The NOC variable is present in both the files and can be used as a key to join both the files into a single dataset.
# A tibble: 1,385 × 15
      ID Name     Sex     Age Height Weight Team  NOC   Games  Year Season City 
   <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
 1   704 Dsir An… M        27     NA     NA Belg… BEL   1932…  1932 Summer Los …
 2  2449 William… M        48     NA     NA Unit… USA   1928…  1928 Summer Amst…
 3  2449 William… M        48     NA     NA Unit… USA   1928…  1928 Summer Amst…
 4  2777 Hermann… M        43     NA     NA Germ… GER   1928…  1928 Summer Amst…
 5  2777 Hermann… M        43     NA     NA Germ… GER   1928…  1928 Summer Amst…
 6  2777 Hermann… M        51     NA     NA Germ… GER   1936…  1936 Summer Berl…
 7  2903 Lucien … M        46     NA     NA Fran… FRA   1924…  1924 Summer Paris
 8  2903 Lucien … M        46     NA     NA Fran… FRA   1924…  1924 Summer Paris
 9  4319 Ludwig … M        41     NA     NA Germ… GER   1932…  1932 Summer Los …
10  4319 Ludwig … M        41     NA     NA Germ… GER   1932…  1932 Summer Los …
# … with 1,375 more rows, and 3 more variables: Sport <chr>, Event <chr>,
#   Medal <chr>
Art Competitions          Cycling    Equestrianism          Sailing 
            1315               32                1               37 The “athlete_events.csv” file has 1385 duplicates as shown above. The table() shows that more than 90% of the duplicate observations are for the Sport ‘Art Competitions’. These duplicates could have been introduced during the data collection while performing scraping. The duplicates can be removed from the athlete_data during the data cleaning process and before joining the datasets.
The noc_data has some missing value in 212 observations. Hence, I start by cleaning the noc_data.
[1] 212[1] 0[1] "NA values in NOC: 0"
[1] "NA values in region: 3"
[1] "NA values in notes: 209"The ‘region’ variable in noc_data has missing values for 3 observations. The corresponding NOC code for these 3 observations are ROT, TUV, and UNK. I have displayed the 3 observations below.
# A tibble: 3 × 3
  NOC   region notes               
  <chr> <chr>  <chr>               
1 ROT   <NA>   Refugee Olympic Team
2 TUV   <NA>   Tuvalu              
3 UNK   <NA>   Unknown             # A tibble: 1 × 3
  NOC   region                      notes                      
  <chr> <chr>                       <chr>                      
1 IOA   Individual Olympic Athletes Individual Olympic AthletesAlthough the ‘region’ value is missing for these observations, we have the ‘notes’ for them. From the notes it is evident that ROT stands for Refugee Olympic Team, TUV stands for Tuvalu and UNK stands for Unknown. I further analyzed whether there are any observations in noc_data with the same value for both ‘region’ and ‘notes’ variables and found 1 observation. For the NOC code ‘IOA’, the region and notes is given the value ‘Individual Olympic Athletes’. Hence, for the NOC codes ‘ROT’, ‘TUV’ and ‘UNK’ I decided to impute the missing ‘region’ values with the corresponding ‘notes’ values.
# A tibble: 0 × 3
# … with 3 variables: NOC <chr>, region <chr>, notes <chr>The ‘notes’ variable in noc_data has missing values for 209 observations. Since, this is more than 90% I decided to drop the ‘notes’ variable. After dropping the ‘notes’ variable from the noc_data, it is left with 230 observations and 2 variables.
# A tibble: 230 × 2
   NOC   region     
   <chr> <chr>      
 1 AFG   Afghanistan
 2 AHO   Curacao    
 3 ALB   Albania    
 4 ALG   Algeria    
 5 AND   Andorra    
 6 ANG   Angola     
 7 ANT   Antigua    
 8 ANZ   Australia  
 9 ARG   Argentina  
10 ARM   Armenia    
# … with 220 more rowsNext, I cleaned the athlete_data. As the first step of cleaning the athlete_data, I dropped the 1385 duplicate observations which I had identified earlier while exploring the data. After dropping the duplicate observations, the athlete_data has 269,731 observations and 15 variables.
The athlete_data has 359615 instances of missing values.
[1] 359615[1] 0The variables ‘Age’, ‘Height’, ‘Weight’ and ‘Medal’ have missing values in the athlete_data.
[1] "NA values in ID: 0"
[1] "NA values in Name: 0"
[1] "NA values in Sex: 0"
[1] "NA values in Age: 9315"
[1] "NA values in Height: 58814"
[1] "NA values in Weight: 61527"
[1] "NA values in Team: 0"
[1] "NA values in NOC: 0"
[1] "NA values in Games: 0"
[1] "NA values in Year: 0"
[1] "NA values in Season: 0"
[1] "NA values in City: 0"
[1] "NA values in Sport: 0"
[1] "NA values in Event: 0"
[1] "NA values in Medal: 229959"The ‘Medal’ variable has 13295 observations with value Bronze, 13108 observations with value Silver, and 13369 observations with value Gold. The remaining values are missing for ‘Medal’ variable. The missing values indicate that the athlete did not win a medal for that sport event during that year and season.
I handled the missing data in ‘Medal’ variable by imputing the missing values with ‘No Medal’ as the athlete did not win a medal.
[1] 0
  Bronze     Gold No Medal   Silver 
   13295    13369   229959    13108 The variables ‘Age’, ‘Height’, and ‘Weight’ have 9315, 58814, and 61527 missing values respectively. This is equivalent to 0.03%, 0.22% and 0.23% of missing values. This is a significantly large number and I performed data imputation for these variables. I imputed the missing values with the average Age, Height and Weight of the athletes grouped by Sex, Season, Year, and Event. I grouped based on those variables as the athletes participating in the various events are usually in the same age, height and weight range. For example, the male athletes participating in the heavy weight wrestling belong to weight categories like 55kg/60kg/etc.
# A tibble: 1 × 15
     ID  Name   Sex    Age Height Weight  Team   NOC Games  Year Season  City
  <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
1     0     0     0 0.0345  0.218  0.228     0     0     0     0      0     0
# … with 3 more variables: Sport <dbl>, Event <dbl>, Medal <dbl># Handling the missing data in 'Age', 'Height', and 'Weight' variables using data imputation
# Storing the average age, height, and weight for each group
average_athlete_data <- athlete_data%>%
  group_by(Sex, Season, Year, Event)%>%
  summarise(average_Age = mean(Age, na.rm = TRUE),
            average_Height = mean(Height, na.rm = TRUE),
            average_Weight = mean(Weight, na.rm = TRUE))
# Joining the athlete_data and average_athlete_data using Sex, Season, Year and Event as the key
cleaned_athlete_data = merge(x=athlete_data, y=average_athlete_data, by=c("Sex", "Season", "Year", "Event"), all.x=TRUE)
cleaned_athlete_data <- tibble(cleaned_athlete_data)
# Replacing the missing values in 'Age', 'Height', and 'Weight' variables with the corresponding values in 'Average_Age', 'Average_Height', and 'Average_Weight' variables
cleaned_athlete_data <- cleaned_athlete_data%>%
    mutate(Age = coalesce(Age, average_Age),
           Height = coalesce(Height, average_Height),
           Weight = coalesce(Weight, average_Weight))
# Dropping the variables 'Average_Age', 'Average_Height', and 'Average_Weight' from cleaned_athlete_data as they are no longer needed
cleaned_athlete_data <- cleaned_athlete_data%>%
  select(-c(16,17,18))
# Rounded off the Age', 'Height', and 'Weight' variables to the nearest integer
cleaned_athlete_data <- cleaned_athlete_data%>%
  mutate(Age = round(Age, digits = 0),
         Height = round(Height, digits = 0),
         Weight = round(Weight, digits = 1))
# Finding the percentage of missing values for the variables 'Age', 'Height', and 'Weight' to check whether the percentage of missing values has decreased
cleaned_athlete_data %>% summarize_all(funs(sum(is.na(.)) / length(.)))# A tibble: 1 × 15
    Sex Season  Year Event    ID  Name      Age Height Weight  Team   NOC Games
  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
1     0      0     0     0     0     0 0.000552 0.0207 0.0452     0     0     0
# … with 3 more variables: City <dbl>, Sport <dbl>, Medal <dbl>[1] "NA values in Sex: 0"
[1] "NA values in Season: 0"
[1] "NA values in Year: 0"
[1] "NA values in Event: 0"
[1] "NA values in ID: 0"
[1] "NA values in Name: 0"
[1] "NA values in Age: 149"
[1] "NA values in Height: 5586"
[1] "NA values in Weight: 12185"
[1] "NA values in Team: 0"
[1] "NA values in NOC: 0"
[1] "NA values in Games: 0"
[1] "NA values in City: 0"
[1] "NA values in Sport: 0"
[1] "NA values in Medal: 0"The percentage of missing values for the variables ‘Age’, ‘Height’, and ‘Weight’ has reduced from 0.03%, 0.22% and 0.23% to 0.00056%, 0.02% and 0.046 % respectively which is a significant improvement. The remaining missing values could not be imputed as all the observations in the groups (grouped by Sex, Season, Year and Event) had missing values for ‘Age’/‘Height’/‘Weight’ which makes it impossible to get the mean values. One possible solution is to remove all the observations with missing values in any of the variables. This would result in 12,792 observations being dropped which is about 5% of the total data. For now, I am keeping the observations with missing values. However, I can remove the 12,792 observations and store it in another tibble for performing visualization in the future.
The ‘Games’ variable is redundant as it contains information about the year and season of the Olympic games which is already present in the ‘Year’ and ‘Season’ variables. Hence, I dropped the ‘Games’ variable.
The cleaned_athlete_data is left with 269731 observations and 14 variables after cleaning.
As the next step after tidying the datasets, I joined the cleaned_athlete_data and noc_data using ‘NOC’ as the key, into a single dataset. The joined dataset has 269731 observations and 15 variables which makes sense as the cleaned_athlete_data had 269731 observations and cleaned_athlete_data and noc_data datasets had 14 and 2 attributes respectively. Since, the “NOC” attribute is common in both datasets we count it only once.
# A tibble: 269,731 × 15
   NOC   Sex   Season  Year Event       ID Name    Age Height Weight Team  City 
   <chr> <chr> <chr>  <dbl> <chr>    <dbl> <chr> <dbl>  <dbl>  <dbl> <chr> <chr>
 1 AFG   M     Summer  1988 Wrestl…  84618 Ahma…    24    161   53.6 Afgh… Seoul
 2 AFG   F     Summer  2012 Athlet…  61961 Tahm…    23    160   52   Afgh… Lond…
 3 AFG   M     Summer  1964 Wrestl…  86829 Nour…    20    162   58.8 Afgh… Tokyo
 4 AFG   M     Summer  1936 Hockey…   5841 Saye…    22    174   82   Afgh… Berl…
 5 AFG   M     Summer  1960 Wrestl… 116125 Niza…    34    168  111   Afgh… Roma 
 6 AFG   M     Summer  1948 Hockey…   5844 Moha…    28    176   74.1 Afgh… Lond…
 7 AFG   M     Summer  1956 Hockey…  87372 Din …    27    177   72.6 Afgh… Melb…
 8 AFG   M     Summer  1960 Wrestl…  58364 Moha…    20    166   62   Afgh… Roma 
 9 AFG   M     Summer  2012 Judo M…  33817 Ajma…    25    173   66   Afgh… Lond…
10 AFG   M     Summer  1972 Wrestl…  28855 Ahma…    24    165   52   Afgh… Muni…
# … with 269,721 more rows, and 3 more variables: Sport <chr>, Medal <chr>,
#   region <chr>I rearranged the order of variables in olympic_data to make the data more understandable and easier for analyzing. I also sorted the olympic_data in ascending order based on ‘Season’ and ‘Year’.
# Rearranging the columns in olympic_data
olympic_data <- olympic_data%>%
  select(c("Season", "Year", "ID", "Name", "Sex", "Age", "Height", "Weight", "Team", "NOC", "region", "City", "Sport", "Event", "Medal"))
# Sorting the olympic_data in ascending order based on 'Season' and 'Year'
olympic_data <- olympic_data%>%
  arrange(Season, Year)The olympic_data is cleaned and can be used for answering various research questions related to the Olympic games.
I have arrived at few potential research questions that can be answered or analyzed using the cleaned olympic_data.
---
title: "Homework 2"
author: "Vinitha Maheswaran"
date: "12/10/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - hw2
  - Olympics
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in data
For this homework I will be working with the 120 years of Olympic history: athletes and results dataset. The Olympics data has two csv files - "athlete_events.csv" and "noc_regions.csv". This historical dataset contains information on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. This data was scraped from www.sports-reference.com in May 2018 and is available on Kaggle. 
```{r}
# Reading the "athlete_events.csv" and "noc_regions.csv" files
athlete_data <- read_csv("_data/athlete_events.csv")
noc_data <- read_csv("_data/noc_regions.csv")
```
```{r}
# Displaying athlete_data dataset
athlete_data
```
```{r}
# Displaying noc_data dataset
noc_data
```
```{r}
# Finding dimension of both datasets
dim(athlete_data)
dim(noc_data)
```
```{r}
# Structure of athlete_data dataset
str(athlete_data)
```
```{r}
# Structure of noc_data dataset
str(noc_data)
```
```{r}
#Summary of athlete_data
library(summarytools)
print(summarytools::dfSummary(athlete_data,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.60, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')
```
```{r}
#Summary of noc_data
library(summarytools)
print(summarytools::dfSummary(noc_data,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.60, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')
```
### Briefly describe the data
The dataset contains information on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016 (the past 120 years). The Winter and Summer Games were held in the same year until 1992. After that, they were staggered such that the Winter Games occur once every 4 years starting with 1994, and the Summer Games occur once every 4 years starting with 1996. The "athlete_events.csv" file has 271,116 observations and 15 variables/attributes. Each row in this csv file corresponds to an individual athlete competing in an individual Olympic event (athlete-events). It includes information about the athlete's name, gender, age, height (in cm), weight (in kg), team/country they represent, National Olympic Committee (3-letter code) they are representing, year and season participated, Olympic games host city for that year and season, sport, athlete event and medal won. Each athlete will have multiple observations in the data as they would have participated in multiple events and during different seasons. This csv file has 1385 duplicates which I will be investigating in the next steps. The "noc_regions.csv" file has 230 observations and 3 variables/attributes. This file contains information about the 'NOC' National Olympic Committee which is a 3-letter code, the corresponding region and notes. The file has 230 unique codes for the NOC variable. Few of the regions have same NOC code which in some cases is distinguished using the notes. The notes has missing value for 209 observations. The NOC variable is present in both the files and can be used as a key to join both the files into a single dataset.
### Looking into duplicate data
```{r}
# Displaying the duplicate observations in "athlete_events.csv" file
duplicate_athlete_data <- athlete_data[duplicated(athlete_data),]
duplicate_athlete_data
```
```{r}
table(duplicate_athlete_data$Sport)
```
The "athlete_events.csv" file has 1385 duplicates as shown above. The table() shows that more than 90% of the duplicate observations are for the Sport 'Art Competitions'. These duplicates could have been introduced during the data collection while performing scraping. The duplicates can be removed from the athlete_data during the data cleaning process and before joining the datasets.
## Tidy Data and Mutate Variables (as needed)
The noc_data has some missing value in 212 observations. Hence, I start by cleaning the noc_data.
```{r}
#Check for missing/null data in the noc_data
sum(is.na(noc_data))
sum(is.null(noc_data))
```
```{r}
# Checking which columns have NA values in noc_data
col <- colnames(noc_data)
for (c in col){
  print(paste0("NA values in ", c, ": ", sum(is.na(noc_data[,c]))))
}
```
The 'region' variable in noc_data has missing values for 3 observations. The corresponding NOC code for these 3 observations are ROT, TUV, and UNK. I have displayed the 3 observations below.
```{r}
# Displaying the observations with missing value in 'region' variable
noc_data%>%filter(is.na(region))
```
```{r}
# Displaying the observations with same value for both 'region' and 'notes' variables
noc_data%>%filter(region==notes)
``` 
Although the 'region' value is missing for these observations, we have the 'notes' for them. From the notes it is evident that ROT stands for Refugee Olympic Team, TUV stands for Tuvalu and UNK stands for Unknown. I further analyzed whether there are any observations in noc_data with the same value for both 'region' and 'notes' variables and found 1 observation. For the NOC code 'IOA', the region and notes is given the value 'Individual Olympic Athletes'. Hence, for the NOC codes 'ROT', 'TUV' and 'UNK' I decided to impute the missing 'region' values with the corresponding 'notes' values.
```{r}
# Imputing the missing 'region' values with the corresponding 'notes' values in noc_data
noc_data <- noc_data%>%
  mutate(region = coalesce(region,notes))
# Sanity Check: Checking that the 3 observations no longer have missing 'region' values
noc_data%>%filter(is.na(region))
```
The 'notes' variable in noc_data has missing values for 209 observations. Since, this is more than 90% I decided to drop the 'notes' variable. After dropping the 'notes' variable from the noc_data, it is left with 230 observations and 2 variables.
```{r}
# Dropping the 'notes' variable from noc_data
noc_data <- noc_data%>%
  select(-c(3))
# Displaying the noc_data after tidying
noc_data
```
Next, I cleaned the athlete_data. As the first step of cleaning the athlete_data, I dropped the 1385 duplicate observations which I had identified earlier while exploring the data. After dropping the duplicate observations, the athlete_data has 269,731 observations and 15 variables.
```{r}
# Dropping the 1385 duplicate observations from athlete_data
athlete_data <- athlete_data%>%
  distinct()
```
The athlete_data has 359615 instances of missing values.
```{r}
#Check for missing/null data in the athlete_data
sum(is.na(athlete_data))
sum(is.null(athlete_data))
```
The variables 'Age', 'Height', 'Weight' and 'Medal' have missing values in the athlete_data.
```{r}
# Checking which columns have NA values in athlete_data
col <- colnames(athlete_data)
for (c in col){
  print(paste0("NA values in ", c, ": ", sum(is.na(athlete_data[,c]))))
}
```
The 'Medal' variable has 13295 observations with value Bronze, 13108 observations with value Silver, and 13369 observations with value Gold. The remaining values are missing for 'Medal' variable. The missing values indicate that the athlete did not win a medal for that sport event during that year and season.
```{r}
table(athlete_data$Medal)
```
I handled the missing data in 'Medal' variable by imputing the missing values with 'No Medal' as the athlete did not win a medal.
```{r}
# Handling missing data in 'Medal' variable
athlete_data <- athlete_data%>%
  mutate(Medal = replace(Medal, is.na(Medal), "No Medal"))
#Sanity Check: Checking that the 'Medal' variable has no missing values after data imputation
sum(is.na(athlete_data$Medal))
table(athlete_data$Medal)
```
The variables 'Age', 'Height', and 'Weight' have 9315, 58814, and 61527 missing values respectively. This is equivalent to 0.03%, 0.22% and 0.23% of missing values. This is a significantly large number and I performed data imputation for these variables. I imputed the missing values with the average Age, Height and Weight of the athletes grouped by Sex, Season, Year, and Event. I grouped based on those variables as the athletes participating in the various events are usually in the same age, height and weight range. For example, the male athletes participating in the heavy weight wrestling belong to weight categories like 55kg/60kg/etc.
```{r}
# Finding the percentage of missing values for the variables 'Age', 'Height', and 'Weight'
athlete_data %>% summarize_all(funs(sum(is.na(.)) / length(.)))
```
```{r}
# Handling the missing data in 'Age', 'Height', and 'Weight' variables using data imputation
# Storing the average age, height, and weight for each group
average_athlete_data <- athlete_data%>%
  group_by(Sex, Season, Year, Event)%>%
  summarise(average_Age = mean(Age, na.rm = TRUE),
            average_Height = mean(Height, na.rm = TRUE),
            average_Weight = mean(Weight, na.rm = TRUE))
# Joining the athlete_data and average_athlete_data using Sex, Season, Year and Event as the key
cleaned_athlete_data = merge(x=athlete_data, y=average_athlete_data, by=c("Sex", "Season", "Year", "Event"), all.x=TRUE)
cleaned_athlete_data <- tibble(cleaned_athlete_data)
# Replacing the missing values in 'Age', 'Height', and 'Weight' variables with the corresponding values in 'Average_Age', 'Average_Height', and 'Average_Weight' variables
cleaned_athlete_data <- cleaned_athlete_data%>%
    mutate(Age = coalesce(Age, average_Age),
           Height = coalesce(Height, average_Height),
           Weight = coalesce(Weight, average_Weight))
# Dropping the variables 'Average_Age', 'Average_Height', and 'Average_Weight' from cleaned_athlete_data as they are no longer needed
cleaned_athlete_data <- cleaned_athlete_data%>%
  select(-c(16,17,18))
# Rounded off the Age', 'Height', and 'Weight' variables to the nearest integer
cleaned_athlete_data <- cleaned_athlete_data%>%
  mutate(Age = round(Age, digits = 0),
         Height = round(Height, digits = 0),
         Weight = round(Weight, digits = 1))
# Finding the percentage of missing values for the variables 'Age', 'Height', and 'Weight' to check whether the percentage of missing values has decreased
cleaned_athlete_data %>% summarize_all(funs(sum(is.na(.)) / length(.)))
```
```{r}
# Displaying the count of missing values in cleaned_athlete_data for each variable
col <- colnames(cleaned_athlete_data)
for (c in col){
  print(paste0("NA values in ", c, ": ", sum(is.na(cleaned_athlete_data[,c]))))
}
```
The percentage of missing values for the variables 'Age', 'Height', and 'Weight' has reduced from 0.03%, 0.22% and 0.23% to 0.00056%, 0.02% and 0.046 % respectively which is a significant improvement. The remaining missing values could not be imputed as all the observations in the groups (grouped by Sex, Season, Year and Event) had missing values for 'Age'/'Height'/'Weight' which makes it impossible to get the mean values. One possible solution is to remove all the observations with missing values in any of the variables. This would result in 12,792 observations being dropped which is about 5% of the total data. For now, I am keeping the observations with missing values. However, I can remove the 12,792 observations and store it in another tibble for performing visualization in the future.
The 'Games' variable is redundant as it contains information about the year and season of the Olympic games which is already present in the 'Year' and 'Season' variables. Hence, I dropped the 'Games' variable.
```{r}
# Dropping the 'Games' variable from cleaned_athlete_data
cleaned_athlete_data <- cleaned_athlete_data%>%
  select(-c(12))
```
The cleaned_athlete_data is left with 269731 observations and 14 variables after cleaning.
## Join Data
As the next step after tidying the datasets, I joined the cleaned_athlete_data and noc_data using 'NOC' as the key, into a single dataset. The joined dataset has 269731 observations and 15 variables which makes sense as the cleaned_athlete_data had 269731 observations and cleaned_athlete_data and noc_data datasets had 14 and 2 attributes respectively. Since, the "NOC" attribute is common in both datasets we count it only once.
```{r}
# performed left join for cleaned_athlete_data and noc_data datasets.
olympic_data = merge(x=cleaned_athlete_data, y=noc_data, by="NOC", all.x=TRUE)
olympic_data <- tibble(olympic_data)
olympic_data
```
I rearranged the order of variables in olympic_data to make the data more understandable and easier for analyzing. I also sorted the olympic_data in ascending order based on 'Season' and 'Year'.
```{r}
# Rearranging the columns in olympic_data
olympic_data <- olympic_data%>%
  select(c("Season", "Year", "ID", "Name", "Sex", "Age", "Height", "Weight", "Team", "NOC", "region", "City", "Sport", "Event", "Medal"))
# Sorting the olympic_data in ascending order based on 'Season' and 'Year'
olympic_data <- olympic_data%>%
  arrange(Season, Year)
```
The olympic_data is cleaned and can be used for answering various research questions related to the Olympic games.
## Potential Research Questions
I have arrived at few potential research questions that can be answered or analyzed using the cleaned olympic_data.
1) What is the ratio of male to female athletes participating in the Olympic games and has gender equality of athletes participating increased over the past 120 years?
2) Has the performance of female athletes improved over the years for teams?
3) Does the host city have any advantage in Olympic games in terms of winning more medals?
4) Is it possible to identify which sport event wins more medals for each team?
5) Has there been a significant change in the age/height/weight of athletes participating in the various events over the years?