Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Vinitha Maheswaran
December 10, 2022
For this homework I will be working with the 120 years of Olympic history: athletes and results dataset. The Olympics data has two csv files - “athlete_events.csv” and “noc_regions.csv”. This historical dataset contains information on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. This data was scraped from www.sports-reference.com in May 2018 and is available on Kaggle.
# A tibble: 271,116 × 15
ID Name Sex Age Height Weight Team NOC Games Year Season City
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 1 A Dijia… M 24 180 80 China CHN 1992… 1992 Summer Barc…
2 2 A Lamusi M 23 170 60 China CHN 2012… 2012 Summer Lond…
3 3 Gunnar … M 24 NA NA Denm… DEN 1920… 1920 Summer Antw…
4 4 Edgar L… M 34 NA NA Denm… DEN 1900… 1900 Summer Paris
5 5 Christi… F 21 185 82 Neth… NED 1988… 1988 Winter Calg…
6 5 Christi… F 21 185 82 Neth… NED 1988… 1988 Winter Calg…
7 5 Christi… F 25 185 82 Neth… NED 1992… 1992 Winter Albe…
8 5 Christi… F 25 185 82 Neth… NED 1992… 1992 Winter Albe…
9 5 Christi… F 27 185 82 Neth… NED 1994… 1994 Winter Lill…
10 5 Christi… F 27 185 82 Neth… NED 1994… 1994 Winter Lill…
# … with 271,106 more rows, and 3 more variables: Sport <chr>, Event <chr>,
# Medal <chr>
# A tibble: 230 × 3
NOC region notes
<chr> <chr> <chr>
1 AFG Afghanistan <NA>
2 AHO Curacao Netherlands Antilles
3 ALB Albania <NA>
4 ALG Algeria <NA>
5 AND Andorra <NA>
6 ANG Angola <NA>
7 ANT Antigua Antigua and Barbuda
8 ANZ Australia Australasia
9 ARG Argentina <NA>
10 ARM Armenia <NA>
# … with 220 more rows
[1] 271116 15
[1] 230 3
spc_tbl_ [271,116 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ID : num [1:271116] 1 2 3 4 5 5 5 5 5 5 ...
$ Name : chr [1:271116] "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
$ Sex : chr [1:271116] "M" "M" "M" "M" ...
$ Age : num [1:271116] 24 23 24 34 21 21 25 25 27 27 ...
$ Height: num [1:271116] 180 170 NA NA 185 185 185 185 185 185 ...
$ Weight: num [1:271116] 80 60 NA NA 82 82 82 82 82 82 ...
$ Team : chr [1:271116] "China" "China" "Denmark" "Denmark/Sweden" ...
$ NOC : chr [1:271116] "CHN" "CHN" "DEN" "DEN" ...
$ Games : chr [1:271116] "1992 Summer" "2012 Summer" "1920 Summer" "1900 Summer" ...
$ Year : num [1:271116] 1992 2012 1920 1900 1988 ...
$ Season: chr [1:271116] "Summer" "Summer" "Summer" "Summer" ...
$ City : chr [1:271116] "Barcelona" "London" "Antwerpen" "Paris" ...
$ Sport : chr [1:271116] "Basketball" "Judo" "Football" "Tug-Of-War" ...
$ Event : chr [1:271116] "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
$ Medal : chr [1:271116] NA NA NA "Gold" ...
- attr(*, "spec")=
.. cols(
.. ID = col_double(),
.. Name = col_character(),
.. Sex = col_character(),
.. Age = col_double(),
.. Height = col_double(),
.. Weight = col_double(),
.. Team = col_character(),
.. NOC = col_character(),
.. Games = col_character(),
.. Year = col_double(),
.. Season = col_character(),
.. City = col_character(),
.. Sport = col_character(),
.. Event = col_character(),
.. Medal = col_character()
.. )
- attr(*, "problems")=<externalptr>
spc_tbl_ [230 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ NOC : chr [1:230] "AFG" "AHO" "ALB" "ALG" ...
$ region: chr [1:230] "Afghanistan" "Curacao" "Albania" "Algeria" ...
$ notes : chr [1:230] NA "Netherlands Antilles" NA NA ...
- attr(*, "spec")=
.. cols(
.. NOC = col_character(),
.. region = col_character(),
.. notes = col_character()
.. )
- attr(*, "problems")=<externalptr>
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID [numeric] |
|
135571 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Name [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sex [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Age [numeric] |
|
74 distinct values | 9474 (3.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Height [numeric] |
|
95 distinct values | 60171 (22.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Weight [numeric] |
|
220 distinct values | 62875 (23.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Team [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NOC [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Games [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Year [numeric] |
|
35 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Season [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
City [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sport [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Event [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Medal [character] |
|
|
231333 (85.3%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-24
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NOC [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
region [character] |
|
|
3 (1.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
notes [character] |
|
|
209 (90.9%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-24
The dataset contains information on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016 (the past 120 years). The Winter and Summer Games were held in the same year until 1992. After that, they were staggered such that the Winter Games occur once every 4 years starting with 1994, and the Summer Games occur once every 4 years starting with 1996. The “athlete_events.csv” file has 271,116 observations and 15 variables/attributes. Each row in this csv file corresponds to an individual athlete competing in an individual Olympic event (athlete-events). It includes information about the athlete’s name, gender, age, height (in cm), weight (in kg), team/country they represent, National Olympic Committee (3-letter code) they are representing, year and season participated, Olympic games host city for that year and season, sport, athlete event and medal won. Each athlete will have multiple observations in the data as they would have participated in multiple events and during different seasons. This csv file has 1385 duplicates which I will be investigating in the next steps. The “noc_regions.csv” file has 230 observations and 3 variables/attributes. This file contains information about the ‘NOC’ National Olympic Committee which is a 3-letter code, the corresponding region and notes. The file has 230 unique codes for the NOC variable. Few of the regions have same NOC code which in some cases is distinguished using the notes. The notes has missing value for 209 observations. The NOC variable is present in both the files and can be used as a key to join both the files into a single dataset.
# A tibble: 1,385 × 15
ID Name Sex Age Height Weight Team NOC Games Year Season City
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 704 Dsir An… M 27 NA NA Belg… BEL 1932… 1932 Summer Los …
2 2449 William… M 48 NA NA Unit… USA 1928… 1928 Summer Amst…
3 2449 William… M 48 NA NA Unit… USA 1928… 1928 Summer Amst…
4 2777 Hermann… M 43 NA NA Germ… GER 1928… 1928 Summer Amst…
5 2777 Hermann… M 43 NA NA Germ… GER 1928… 1928 Summer Amst…
6 2777 Hermann… M 51 NA NA Germ… GER 1936… 1936 Summer Berl…
7 2903 Lucien … M 46 NA NA Fran… FRA 1924… 1924 Summer Paris
8 2903 Lucien … M 46 NA NA Fran… FRA 1924… 1924 Summer Paris
9 4319 Ludwig … M 41 NA NA Germ… GER 1932… 1932 Summer Los …
10 4319 Ludwig … M 41 NA NA Germ… GER 1932… 1932 Summer Los …
# … with 1,375 more rows, and 3 more variables: Sport <chr>, Event <chr>,
# Medal <chr>
Art Competitions Cycling Equestrianism Sailing
1315 32 1 37
The “athlete_events.csv” file has 1385 duplicates as shown above. The table() shows that more than 90% of the duplicate observations are for the Sport ‘Art Competitions’. These duplicates could have been introduced during the data collection while performing scraping. The duplicates can be removed from the athlete_data during the data cleaning process and before joining the datasets.
The noc_data has some missing value in 212 observations. Hence, I start by cleaning the noc_data.
[1] 212
[1] 0
[1] "NA values in NOC: 0"
[1] "NA values in region: 3"
[1] "NA values in notes: 209"
The ‘region’ variable in noc_data has missing values for 3 observations. The corresponding NOC code for these 3 observations are ROT, TUV, and UNK. I have displayed the 3 observations below.
# A tibble: 3 × 3
NOC region notes
<chr> <chr> <chr>
1 ROT <NA> Refugee Olympic Team
2 TUV <NA> Tuvalu
3 UNK <NA> Unknown
# A tibble: 1 × 3
NOC region notes
<chr> <chr> <chr>
1 IOA Individual Olympic Athletes Individual Olympic Athletes
Although the ‘region’ value is missing for these observations, we have the ‘notes’ for them. From the notes it is evident that ROT stands for Refugee Olympic Team, TUV stands for Tuvalu and UNK stands for Unknown. I further analyzed whether there are any observations in noc_data with the same value for both ‘region’ and ‘notes’ variables and found 1 observation. For the NOC code ‘IOA’, the region and notes is given the value ‘Individual Olympic Athletes’. Hence, for the NOC codes ‘ROT’, ‘TUV’ and ‘UNK’ I decided to impute the missing ‘region’ values with the corresponding ‘notes’ values.
# A tibble: 0 × 3
# … with 3 variables: NOC <chr>, region <chr>, notes <chr>
The ‘notes’ variable in noc_data has missing values for 209 observations. Since, this is more than 90% I decided to drop the ‘notes’ variable. After dropping the ‘notes’ variable from the noc_data, it is left with 230 observations and 2 variables.
# A tibble: 230 × 2
NOC region
<chr> <chr>
1 AFG Afghanistan
2 AHO Curacao
3 ALB Albania
4 ALG Algeria
5 AND Andorra
6 ANG Angola
7 ANT Antigua
8 ANZ Australia
9 ARG Argentina
10 ARM Armenia
# … with 220 more rows
Next, I cleaned the athlete_data. As the first step of cleaning the athlete_data, I dropped the 1385 duplicate observations which I had identified earlier while exploring the data. After dropping the duplicate observations, the athlete_data has 269,731 observations and 15 variables.
The athlete_data has 359615 instances of missing values.
[1] 359615
[1] 0
The variables ‘Age’, ‘Height’, ‘Weight’ and ‘Medal’ have missing values in the athlete_data.
[1] "NA values in ID: 0"
[1] "NA values in Name: 0"
[1] "NA values in Sex: 0"
[1] "NA values in Age: 9315"
[1] "NA values in Height: 58814"
[1] "NA values in Weight: 61527"
[1] "NA values in Team: 0"
[1] "NA values in NOC: 0"
[1] "NA values in Games: 0"
[1] "NA values in Year: 0"
[1] "NA values in Season: 0"
[1] "NA values in City: 0"
[1] "NA values in Sport: 0"
[1] "NA values in Event: 0"
[1] "NA values in Medal: 229959"
The ‘Medal’ variable has 13295 observations with value Bronze, 13108 observations with value Silver, and 13369 observations with value Gold. The remaining values are missing for ‘Medal’ variable. The missing values indicate that the athlete did not win a medal for that sport event during that year and season.
I handled the missing data in ‘Medal’ variable by imputing the missing values with ‘No Medal’ as the athlete did not win a medal.
[1] 0
Bronze Gold No Medal Silver
13295 13369 229959 13108
The variables ‘Age’, ‘Height’, and ‘Weight’ have 9315, 58814, and 61527 missing values respectively. This is equivalent to 0.03%, 0.22% and 0.23% of missing values. This is a significantly large number and I performed data imputation for these variables. I imputed the missing values with the average Age, Height and Weight of the athletes grouped by Sex, Season, Year, and Event. I grouped based on those variables as the athletes participating in the various events are usually in the same age, height and weight range. For example, the male athletes participating in the heavy weight wrestling belong to weight categories like 55kg/60kg/etc.
# A tibble: 1 × 15
ID Name Sex Age Height Weight Team NOC Games Year Season City
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 0.0345 0.218 0.228 0 0 0 0 0 0
# … with 3 more variables: Sport <dbl>, Event <dbl>, Medal <dbl>
# Handling the missing data in 'Age', 'Height', and 'Weight' variables using data imputation
# Storing the average age, height, and weight for each group
average_athlete_data <- athlete_data%>%
group_by(Sex, Season, Year, Event)%>%
summarise(average_Age = mean(Age, na.rm = TRUE),
average_Height = mean(Height, na.rm = TRUE),
average_Weight = mean(Weight, na.rm = TRUE))
# Joining the athlete_data and average_athlete_data using Sex, Season, Year and Event as the key
cleaned_athlete_data = merge(x=athlete_data, y=average_athlete_data, by=c("Sex", "Season", "Year", "Event"), all.x=TRUE)
cleaned_athlete_data <- tibble(cleaned_athlete_data)
# Replacing the missing values in 'Age', 'Height', and 'Weight' variables with the corresponding values in 'Average_Age', 'Average_Height', and 'Average_Weight' variables
cleaned_athlete_data <- cleaned_athlete_data%>%
mutate(Age = coalesce(Age, average_Age),
Height = coalesce(Height, average_Height),
Weight = coalesce(Weight, average_Weight))
# Dropping the variables 'Average_Age', 'Average_Height', and 'Average_Weight' from cleaned_athlete_data as they are no longer needed
cleaned_athlete_data <- cleaned_athlete_data%>%
select(-c(16,17,18))
# Rounded off the Age', 'Height', and 'Weight' variables to the nearest integer
cleaned_athlete_data <- cleaned_athlete_data%>%
mutate(Age = round(Age, digits = 0),
Height = round(Height, digits = 0),
Weight = round(Weight, digits = 1))
# Finding the percentage of missing values for the variables 'Age', 'Height', and 'Weight' to check whether the percentage of missing values has decreased
cleaned_athlete_data %>% summarize_all(funs(sum(is.na(.)) / length(.)))
# A tibble: 1 × 15
Sex Season Year Event ID Name Age Height Weight Team NOC Games
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 0 0 0 0.000552 0.0207 0.0452 0 0 0
# … with 3 more variables: City <dbl>, Sport <dbl>, Medal <dbl>
[1] "NA values in Sex: 0"
[1] "NA values in Season: 0"
[1] "NA values in Year: 0"
[1] "NA values in Event: 0"
[1] "NA values in ID: 0"
[1] "NA values in Name: 0"
[1] "NA values in Age: 149"
[1] "NA values in Height: 5586"
[1] "NA values in Weight: 12185"
[1] "NA values in Team: 0"
[1] "NA values in NOC: 0"
[1] "NA values in Games: 0"
[1] "NA values in City: 0"
[1] "NA values in Sport: 0"
[1] "NA values in Medal: 0"
The percentage of missing values for the variables ‘Age’, ‘Height’, and ‘Weight’ has reduced from 0.03%, 0.22% and 0.23% to 0.00056%, 0.02% and 0.046 % respectively which is a significant improvement. The remaining missing values could not be imputed as all the observations in the groups (grouped by Sex, Season, Year and Event) had missing values for ‘Age’/‘Height’/‘Weight’ which makes it impossible to get the mean values. One possible solution is to remove all the observations with missing values in any of the variables. This would result in 12,792 observations being dropped which is about 5% of the total data. For now, I am keeping the observations with missing values. However, I can remove the 12,792 observations and store it in another tibble for performing visualization in the future.
The ‘Games’ variable is redundant as it contains information about the year and season of the Olympic games which is already present in the ‘Year’ and ‘Season’ variables. Hence, I dropped the ‘Games’ variable.
The cleaned_athlete_data is left with 269731 observations and 14 variables after cleaning.
As the next step after tidying the datasets, I joined the cleaned_athlete_data and noc_data using ‘NOC’ as the key, into a single dataset. The joined dataset has 269731 observations and 15 variables which makes sense as the cleaned_athlete_data had 269731 observations and cleaned_athlete_data and noc_data datasets had 14 and 2 attributes respectively. Since, the “NOC” attribute is common in both datasets we count it only once.
# A tibble: 269,731 × 15
NOC Sex Season Year Event ID Name Age Height Weight Team City
<chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
1 AFG M Summer 1988 Wrestl… 84618 Ahma… 24 161 53.6 Afgh… Seoul
2 AFG F Summer 2012 Athlet… 61961 Tahm… 23 160 52 Afgh… Lond…
3 AFG M Summer 1964 Wrestl… 86829 Nour… 20 162 58.8 Afgh… Tokyo
4 AFG M Summer 1936 Hockey… 5841 Saye… 22 174 82 Afgh… Berl…
5 AFG M Summer 1960 Wrestl… 116125 Niza… 34 168 111 Afgh… Roma
6 AFG M Summer 1948 Hockey… 5844 Moha… 28 176 74.1 Afgh… Lond…
7 AFG M Summer 1956 Hockey… 87372 Din … 27 177 72.6 Afgh… Melb…
8 AFG M Summer 1960 Wrestl… 58364 Moha… 20 166 62 Afgh… Roma
9 AFG M Summer 2012 Judo M… 33817 Ajma… 25 173 66 Afgh… Lond…
10 AFG M Summer 1972 Wrestl… 28855 Ahma… 24 165 52 Afgh… Muni…
# … with 269,721 more rows, and 3 more variables: Sport <chr>, Medal <chr>,
# region <chr>
I rearranged the order of variables in olympic_data to make the data more understandable and easier for analyzing. I also sorted the olympic_data in ascending order based on ‘Season’ and ‘Year’.
# Rearranging the columns in olympic_data
olympic_data <- olympic_data%>%
select(c("Season", "Year", "ID", "Name", "Sex", "Age", "Height", "Weight", "Team", "NOC", "region", "City", "Sport", "Event", "Medal"))
# Sorting the olympic_data in ascending order based on 'Season' and 'Year'
olympic_data <- olympic_data%>%
arrange(Season, Year)
The olympic_data is cleaned and can be used for answering various research questions related to the Olympic games.
I have arrived at few potential research questions that can be answered or analyzed using the cleaned olympic_data.
---
title: "Homework 2"
author: "Vinitha Maheswaran"
date: "12/10/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw2
- Olympics
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in data
For this homework I will be working with the 120 years of Olympic history: athletes and results dataset. The Olympics data has two csv files - "athlete_events.csv" and "noc_regions.csv". This historical dataset contains information on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. This data was scraped from www.sports-reference.com in May 2018 and is available on Kaggle.
```{r}
# Reading the "athlete_events.csv" and "noc_regions.csv" files
athlete_data <- read_csv("_data/athlete_events.csv")
noc_data <- read_csv("_data/noc_regions.csv")
```
```{r}
# Displaying athlete_data dataset
athlete_data
```
```{r}
# Displaying noc_data dataset
noc_data
```
```{r}
# Finding dimension of both datasets
dim(athlete_data)
dim(noc_data)
```
```{r}
# Structure of athlete_data dataset
str(athlete_data)
```
```{r}
# Structure of noc_data dataset
str(noc_data)
```
```{r}
#Summary of athlete_data
library(summarytools)
print(summarytools::dfSummary(athlete_data,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.60,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
```
```{r}
#Summary of noc_data
library(summarytools)
print(summarytools::dfSummary(noc_data,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.60,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
```
### Briefly describe the data
The dataset contains information on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016 (the past 120 years). The Winter and Summer Games were held in the same year until 1992. After that, they were staggered such that the Winter Games occur once every 4 years starting with 1994, and the Summer Games occur once every 4 years starting with 1996. The "athlete_events.csv" file has 271,116 observations and 15 variables/attributes. Each row in this csv file corresponds to an individual athlete competing in an individual Olympic event (athlete-events). It includes information about the athlete's name, gender, age, height (in cm), weight (in kg), team/country they represent, National Olympic Committee (3-letter code) they are representing, year and season participated, Olympic games host city for that year and season, sport, athlete event and medal won. Each athlete will have multiple observations in the data as they would have participated in multiple events and during different seasons. This csv file has 1385 duplicates which I will be investigating in the next steps. The "noc_regions.csv" file has 230 observations and 3 variables/attributes. This file contains information about the 'NOC' National Olympic Committee which is a 3-letter code, the corresponding region and notes. The file has 230 unique codes for the NOC variable. Few of the regions have same NOC code which in some cases is distinguished using the notes. The notes has missing value for 209 observations. The NOC variable is present in both the files and can be used as a key to join both the files into a single dataset.
### Looking into duplicate data
```{r}
# Displaying the duplicate observations in "athlete_events.csv" file
duplicate_athlete_data <- athlete_data[duplicated(athlete_data),]
duplicate_athlete_data
```
```{r}
table(duplicate_athlete_data$Sport)
```
The "athlete_events.csv" file has 1385 duplicates as shown above. The table() shows that more than 90% of the duplicate observations are for the Sport 'Art Competitions'. These duplicates could have been introduced during the data collection while performing scraping. The duplicates can be removed from the athlete_data during the data cleaning process and before joining the datasets.
## Tidy Data and Mutate Variables (as needed)
The noc_data has some missing value in 212 observations. Hence, I start by cleaning the noc_data.
```{r}
#Check for missing/null data in the noc_data
sum(is.na(noc_data))
sum(is.null(noc_data))
```
```{r}
# Checking which columns have NA values in noc_data
col <- colnames(noc_data)
for (c in col){
print(paste0("NA values in ", c, ": ", sum(is.na(noc_data[,c]))))
}
```
The 'region' variable in noc_data has missing values for 3 observations. The corresponding NOC code for these 3 observations are ROT, TUV, and UNK. I have displayed the 3 observations below.
```{r}
# Displaying the observations with missing value in 'region' variable
noc_data%>%filter(is.na(region))
```
```{r}
# Displaying the observations with same value for both 'region' and 'notes' variables
noc_data%>%filter(region==notes)
```
Although the 'region' value is missing for these observations, we have the 'notes' for them. From the notes it is evident that ROT stands for Refugee Olympic Team, TUV stands for Tuvalu and UNK stands for Unknown. I further analyzed whether there are any observations in noc_data with the same value for both 'region' and 'notes' variables and found 1 observation. For the NOC code 'IOA', the region and notes is given the value 'Individual Olympic Athletes'. Hence, for the NOC codes 'ROT', 'TUV' and 'UNK' I decided to impute the missing 'region' values with the corresponding 'notes' values.
```{r}
# Imputing the missing 'region' values with the corresponding 'notes' values in noc_data
noc_data <- noc_data%>%
mutate(region = coalesce(region,notes))
# Sanity Check: Checking that the 3 observations no longer have missing 'region' values
noc_data%>%filter(is.na(region))
```
The 'notes' variable in noc_data has missing values for 209 observations. Since, this is more than 90% I decided to drop the 'notes' variable. After dropping the 'notes' variable from the noc_data, it is left with 230 observations and 2 variables.
```{r}
# Dropping the 'notes' variable from noc_data
noc_data <- noc_data%>%
select(-c(3))
# Displaying the noc_data after tidying
noc_data
```
Next, I cleaned the athlete_data. As the first step of cleaning the athlete_data, I dropped the 1385 duplicate observations which I had identified earlier while exploring the data. After dropping the duplicate observations, the athlete_data has 269,731 observations and 15 variables.
```{r}
# Dropping the 1385 duplicate observations from athlete_data
athlete_data <- athlete_data%>%
distinct()
```
The athlete_data has 359615 instances of missing values.
```{r}
#Check for missing/null data in the athlete_data
sum(is.na(athlete_data))
sum(is.null(athlete_data))
```
The variables 'Age', 'Height', 'Weight' and 'Medal' have missing values in the athlete_data.
```{r}
# Checking which columns have NA values in athlete_data
col <- colnames(athlete_data)
for (c in col){
print(paste0("NA values in ", c, ": ", sum(is.na(athlete_data[,c]))))
}
```
The 'Medal' variable has 13295 observations with value Bronze, 13108 observations with value Silver, and 13369 observations with value Gold. The remaining values are missing for 'Medal' variable. The missing values indicate that the athlete did not win a medal for that sport event during that year and season.
```{r}
table(athlete_data$Medal)
```
I handled the missing data in 'Medal' variable by imputing the missing values with 'No Medal' as the athlete did not win a medal.
```{r}
# Handling missing data in 'Medal' variable
athlete_data <- athlete_data%>%
mutate(Medal = replace(Medal, is.na(Medal), "No Medal"))
#Sanity Check: Checking that the 'Medal' variable has no missing values after data imputation
sum(is.na(athlete_data$Medal))
table(athlete_data$Medal)
```
The variables 'Age', 'Height', and 'Weight' have 9315, 58814, and 61527 missing values respectively. This is equivalent to 0.03%, 0.22% and 0.23% of missing values. This is a significantly large number and I performed data imputation for these variables. I imputed the missing values with the average Age, Height and Weight of the athletes grouped by Sex, Season, Year, and Event. I grouped based on those variables as the athletes participating in the various events are usually in the same age, height and weight range. For example, the male athletes participating in the heavy weight wrestling belong to weight categories like 55kg/60kg/etc.
```{r}
# Finding the percentage of missing values for the variables 'Age', 'Height', and 'Weight'
athlete_data %>% summarize_all(funs(sum(is.na(.)) / length(.)))
```
```{r}
# Handling the missing data in 'Age', 'Height', and 'Weight' variables using data imputation
# Storing the average age, height, and weight for each group
average_athlete_data <- athlete_data%>%
group_by(Sex, Season, Year, Event)%>%
summarise(average_Age = mean(Age, na.rm = TRUE),
average_Height = mean(Height, na.rm = TRUE),
average_Weight = mean(Weight, na.rm = TRUE))
# Joining the athlete_data and average_athlete_data using Sex, Season, Year and Event as the key
cleaned_athlete_data = merge(x=athlete_data, y=average_athlete_data, by=c("Sex", "Season", "Year", "Event"), all.x=TRUE)
cleaned_athlete_data <- tibble(cleaned_athlete_data)
# Replacing the missing values in 'Age', 'Height', and 'Weight' variables with the corresponding values in 'Average_Age', 'Average_Height', and 'Average_Weight' variables
cleaned_athlete_data <- cleaned_athlete_data%>%
mutate(Age = coalesce(Age, average_Age),
Height = coalesce(Height, average_Height),
Weight = coalesce(Weight, average_Weight))
# Dropping the variables 'Average_Age', 'Average_Height', and 'Average_Weight' from cleaned_athlete_data as they are no longer needed
cleaned_athlete_data <- cleaned_athlete_data%>%
select(-c(16,17,18))
# Rounded off the Age', 'Height', and 'Weight' variables to the nearest integer
cleaned_athlete_data <- cleaned_athlete_data%>%
mutate(Age = round(Age, digits = 0),
Height = round(Height, digits = 0),
Weight = round(Weight, digits = 1))
# Finding the percentage of missing values for the variables 'Age', 'Height', and 'Weight' to check whether the percentage of missing values has decreased
cleaned_athlete_data %>% summarize_all(funs(sum(is.na(.)) / length(.)))
```
```{r}
# Displaying the count of missing values in cleaned_athlete_data for each variable
col <- colnames(cleaned_athlete_data)
for (c in col){
print(paste0("NA values in ", c, ": ", sum(is.na(cleaned_athlete_data[,c]))))
}
```
The percentage of missing values for the variables 'Age', 'Height', and 'Weight' has reduced from 0.03%, 0.22% and 0.23% to 0.00056%, 0.02% and 0.046 % respectively which is a significant improvement. The remaining missing values could not be imputed as all the observations in the groups (grouped by Sex, Season, Year and Event) had missing values for 'Age'/'Height'/'Weight' which makes it impossible to get the mean values. One possible solution is to remove all the observations with missing values in any of the variables. This would result in 12,792 observations being dropped which is about 5% of the total data. For now, I am keeping the observations with missing values. However, I can remove the 12,792 observations and store it in another tibble for performing visualization in the future.
The 'Games' variable is redundant as it contains information about the year and season of the Olympic games which is already present in the 'Year' and 'Season' variables. Hence, I dropped the 'Games' variable.
```{r}
# Dropping the 'Games' variable from cleaned_athlete_data
cleaned_athlete_data <- cleaned_athlete_data%>%
select(-c(12))
```
The cleaned_athlete_data is left with 269731 observations and 14 variables after cleaning.
## Join Data
As the next step after tidying the datasets, I joined the cleaned_athlete_data and noc_data using 'NOC' as the key, into a single dataset. The joined dataset has 269731 observations and 15 variables which makes sense as the cleaned_athlete_data had 269731 observations and cleaned_athlete_data and noc_data datasets had 14 and 2 attributes respectively. Since, the "NOC" attribute is common in both datasets we count it only once.
```{r}
# performed left join for cleaned_athlete_data and noc_data datasets.
olympic_data = merge(x=cleaned_athlete_data, y=noc_data, by="NOC", all.x=TRUE)
olympic_data <- tibble(olympic_data)
olympic_data
```
I rearranged the order of variables in olympic_data to make the data more understandable and easier for analyzing. I also sorted the olympic_data in ascending order based on 'Season' and 'Year'.
```{r}
# Rearranging the columns in olympic_data
olympic_data <- olympic_data%>%
select(c("Season", "Year", "ID", "Name", "Sex", "Age", "Height", "Weight", "Team", "NOC", "region", "City", "Sport", "Event", "Medal"))
# Sorting the olympic_data in ascending order based on 'Season' and 'Year'
olympic_data <- olympic_data%>%
arrange(Season, Year)
```
The olympic_data is cleaned and can be used for answering various research questions related to the Olympic games.
## Potential Research Questions
I have arrived at few potential research questions that can be answered or analyzed using the cleaned olympic_data.
1) What is the ratio of male to female athletes participating in the Olympic games and has gender equality of athletes participating increased over the past 120 years?
2) Has the performance of female athletes improved over the years for teams?
3) Does the host city have any advantage in Olympic games in terms of winning more medals?
4) Is it possible to identify which sport event wins more medals for each team?
5) Has there been a significant change in the age/height/weight of athletes participating in the various events over the years?