DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Homework 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read in data
    • Briefly describe the data
    • Looking into duplicate data
  • Tidy Data and Mutate Variables (as needed)
  • Join Data
  • Potential Research Questions

Homework 2

  • Show All Code
  • Hide All Code

  • View Source
hw2
Olympics
Author

Vinitha Maheswaran

Published

December 10, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

For this homework I will be working with the 120 years of Olympic history: athletes and results dataset. The Olympics data has two csv files - “athlete_events.csv” and “noc_regions.csv”. This historical dataset contains information on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. This data was scraped from www.sports-reference.com in May 2018 and is available on Kaggle.

Code
# Reading the "athlete_events.csv" and "noc_regions.csv" files

athlete_data <- read_csv("_data/athlete_events.csv")
noc_data <- read_csv("_data/noc_regions.csv")
Code
# Displaying athlete_data dataset

athlete_data
# A tibble: 271,116 × 15
      ID Name     Sex     Age Height Weight Team  NOC   Games  Year Season City 
   <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
 1     1 A Dijia… M        24    180     80 China CHN   1992…  1992 Summer Barc…
 2     2 A Lamusi M        23    170     60 China CHN   2012…  2012 Summer Lond…
 3     3 Gunnar … M        24     NA     NA Denm… DEN   1920…  1920 Summer Antw…
 4     4 Edgar L… M        34     NA     NA Denm… DEN   1900…  1900 Summer Paris
 5     5 Christi… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
 6     5 Christi… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
 7     5 Christi… F        25    185     82 Neth… NED   1992…  1992 Winter Albe…
 8     5 Christi… F        25    185     82 Neth… NED   1992…  1992 Winter Albe…
 9     5 Christi… F        27    185     82 Neth… NED   1994…  1994 Winter Lill…
10     5 Christi… F        27    185     82 Neth… NED   1994…  1994 Winter Lill…
# … with 271,106 more rows, and 3 more variables: Sport <chr>, Event <chr>,
#   Medal <chr>
Code
# Displaying noc_data dataset

noc_data
# A tibble: 230 × 3
   NOC   region      notes               
   <chr> <chr>       <chr>               
 1 AFG   Afghanistan <NA>                
 2 AHO   Curacao     Netherlands Antilles
 3 ALB   Albania     <NA>                
 4 ALG   Algeria     <NA>                
 5 AND   Andorra     <NA>                
 6 ANG   Angola      <NA>                
 7 ANT   Antigua     Antigua and Barbuda 
 8 ANZ   Australia   Australasia         
 9 ARG   Argentina   <NA>                
10 ARM   Armenia     <NA>                
# … with 220 more rows
Code
# Finding dimension of both datasets

dim(athlete_data)
[1] 271116     15
Code
dim(noc_data)
[1] 230   3
Code
# Structure of athlete_data dataset

str(athlete_data)
spc_tbl_ [271,116 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID    : num [1:271116] 1 2 3 4 5 5 5 5 5 5 ...
 $ Name  : chr [1:271116] "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
 $ Sex   : chr [1:271116] "M" "M" "M" "M" ...
 $ Age   : num [1:271116] 24 23 24 34 21 21 25 25 27 27 ...
 $ Height: num [1:271116] 180 170 NA NA 185 185 185 185 185 185 ...
 $ Weight: num [1:271116] 80 60 NA NA 82 82 82 82 82 82 ...
 $ Team  : chr [1:271116] "China" "China" "Denmark" "Denmark/Sweden" ...
 $ NOC   : chr [1:271116] "CHN" "CHN" "DEN" "DEN" ...
 $ Games : chr [1:271116] "1992 Summer" "2012 Summer" "1920 Summer" "1900 Summer" ...
 $ Year  : num [1:271116] 1992 2012 1920 1900 1988 ...
 $ Season: chr [1:271116] "Summer" "Summer" "Summer" "Summer" ...
 $ City  : chr [1:271116] "Barcelona" "London" "Antwerpen" "Paris" ...
 $ Sport : chr [1:271116] "Basketball" "Judo" "Football" "Tug-Of-War" ...
 $ Event : chr [1:271116] "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
 $ Medal : chr [1:271116] NA NA NA "Gold" ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_double(),
  ..   Name = col_character(),
  ..   Sex = col_character(),
  ..   Age = col_double(),
  ..   Height = col_double(),
  ..   Weight = col_double(),
  ..   Team = col_character(),
  ..   NOC = col_character(),
  ..   Games = col_character(),
  ..   Year = col_double(),
  ..   Season = col_character(),
  ..   City = col_character(),
  ..   Sport = col_character(),
  ..   Event = col_character(),
  ..   Medal = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
Code
# Structure of noc_data dataset

str(noc_data)
spc_tbl_ [230 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ NOC   : chr [1:230] "AFG" "AHO" "ALB" "ALG" ...
 $ region: chr [1:230] "Afghanistan" "Curacao" "Albania" "Algeria" ...
 $ notes : chr [1:230] NA "Netherlands Antilles" NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   NOC = col_character(),
  ..   region = col_character(),
  ..   notes = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
Code
#Summary of athlete_data

library(summarytools)
print(summarytools::dfSummary(athlete_data,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.60, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

athlete_data

Dimensions: 271116 x 15
Duplicates: 1385
Variable Stats / Values Freqs (% of Valid) Graph Missing
ID [numeric]
Mean (sd) : 68249 (39022.3)
min ≤ med ≤ max:
1 ≤ 68205 ≤ 135571
IQR (CV) : 67454.2 (0.6)
135571 distinct values 0 (0.0%)
Name [character]
1. Robert Tait McKenzie
2. Heikki Ilmari Savolainen
3. Joseph "Josy" Stoffel
4. Ioannis Theofilakis
5. Takashi Ono
6. Alexandros Theofilakis
7. Alfrd (Arnold-) Hajs (Gut
8. Andreas Wecker
9. Jean Lucien Nicolas Jacob
10. Alfred August "Al" Jochim
[ 134721 others ]
58(0.0%)
39(0.0%)
38(0.0%)
36(0.0%)
33(0.0%)
32(0.0%)
32(0.0%)
32(0.0%)
32(0.0%)
31(0.0%)
270753(99.9%)
0 (0.0%)
Sex [character]
1. F
2. M
74522(27.5%)
196594(72.5%)
0 (0.0%)
Age [numeric]
Mean (sd) : 25.6 (6.4)
min ≤ med ≤ max:
10 ≤ 24 ≤ 97
IQR (CV) : 7 (0.3)
74 distinct values 9474 (3.5%)
Height [numeric]
Mean (sd) : 175.3 (10.5)
min ≤ med ≤ max:
127 ≤ 175 ≤ 226
IQR (CV) : 15 (0.1)
95 distinct values 60171 (22.2%)
Weight [numeric]
Mean (sd) : 70.7 (14.3)
min ≤ med ≤ max:
25 ≤ 70 ≤ 214
IQR (CV) : 19 (0.2)
220 distinct values 62875 (23.2%)
Team [character]
1. United States
2. France
3. Great Britain
4. Italy
5. Germany
6. Canada
7. Japan
8. Sweden
9. Australia
10. Hungary
[ 1174 others ]
17847(6.6%)
11988(4.4%)
11404(4.2%)
10260(3.8%)
9326(3.4%)
9279(3.4%)
8289(3.1%)
8052(3.0%)
7513(2.8%)
6547(2.4%)
170611(62.9%)
0 (0.0%)
NOC [character]
1. USA
2. FRA
3. GBR
4. ITA
5. GER
6. CAN
7. JPN
8. SWE
9. AUS
10. HUN
[ 220 others ]
18853(7.0%)
12758(4.7%)
12256(4.5%)
10715(4.0%)
9830(3.6%)
9733(3.6%)
8444(3.1%)
8339(3.1%)
7638(2.8%)
6607(2.4%)
165943(61.2%)
0 (0.0%)
Games [character]
1. 2000 Summer
2. 1996 Summer
3. 2016 Summer
4. 2008 Summer
5. 2004 Summer
6. 1992 Summer
7. 2012 Summer
8. 1988 Summer
9. 1972 Summer
10. 1984 Summer
[ 41 others ]
13821(5.1%)
13780(5.1%)
13688(5.0%)
13602(5.0%)
13443(5.0%)
12977(4.8%)
12920(4.8%)
12037(4.4%)
10304(3.8%)
9454(3.5%)
145090(53.5%)
0 (0.0%)
Year [numeric]
Mean (sd) : 1978.4 (29.9)
min ≤ med ≤ max:
1896 ≤ 1988 ≤ 2016
IQR (CV) : 42 (0)
35 distinct values 0 (0.0%)
Season [character]
1. Summer
2. Winter
222552(82.1%)
48564(17.9%)
0 (0.0%)
City [character]
1. London
2. Athina
3. Sydney
4. Atlanta
5. Rio de Janeiro
6. Beijing
7. Barcelona
8. Los Angeles
9. Seoul
10. Munich
[ 32 others ]
22426(8.3%)
15556(5.7%)
13821(5.1%)
13780(5.1%)
13688(5.0%)
13602(5.0%)
12977(4.8%)
12423(4.6%)
12037(4.4%)
10304(3.8%)
130502(48.1%)
0 (0.0%)
Sport [character]
1. Athletics
2. Gymnastics
3. Swimming
4. Shooting
5. Cycling
6. Fencing
7. Rowing
8. Cross Country Skiing
9. Alpine Skiing
10. Wrestling
[ 56 others ]
38624(14.2%)
26707(9.9%)
23195(8.6%)
11448(4.2%)
10859(4.0%)
10735(4.0%)
10595(3.9%)
9133(3.4%)
8829(3.3%)
7154(2.6%)
113837(42.0%)
0 (0.0%)
Event [character]
1. Football Men's Football
2. Ice Hockey Men's Ice Hock
3. Hockey Men's Hockey
4. Water Polo Men's Water Po
5. Basketball Men's Basketba
6. Cycling Men's Road Race,
7. Gymnastics Men's Individu
8. Rowing Men's Coxed Eights
9. Gymnastics Men's Team All
10. Handball Men's Handball
[ 755 others ]
5733(2.1%)
4762(1.8%)
3958(1.5%)
3358(1.2%)
3280(1.2%)
2947(1.1%)
2500(0.9%)
2423(0.9%)
2411(0.9%)
2264(0.8%)
237480(87.6%)
0 (0.0%)
Medal [character]
1. Bronze
2. Gold
3. Silver
13295(33.4%)
13372(33.6%)
13116(33.0%)
231333 (85.3%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-24

Code
#Summary of noc_data

library(summarytools)
print(summarytools::dfSummary(noc_data,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.60, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

noc_data

Dimensions: 230 x 3
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
NOC [character]
1. AFG
2. AHO
3. ALB
4. ALG
5. AND
6. ANG
7. ANT
8. ANZ
9. ARG
10. ARM
[ 220 others ]
1(0.4%)
1(0.4%)
1(0.4%)
1(0.4%)
1(0.4%)
1(0.4%)
1(0.4%)
1(0.4%)
1(0.4%)
1(0.4%)
220(95.7%)
0 (0.0%)
region [character]
1. Germany
2. Czech Republic
3. Malaysia
4. Russia
5. Serbia
6. Yemen
7. Australia
8. Canada
9. China
10. Greece
[ 196 others ]
4(1.8%)
3(1.3%)
3(1.3%)
3(1.3%)
3(1.3%)
3(1.3%)
2(0.9%)
2(0.9%)
2(0.9%)
2(0.9%)
200(88.1%)
3 (1.3%)
notes [character]
1. Antigua and Barbuda
2. Australasia
3. Bohemia
4. Crete
5. Hong Kong
6. Individual Olympic Athlet
7. Netherlands Antilles
8. Newfoundland
9. North Borneo
10. North Yemen
[ 11 others ]
1(4.8%)
1(4.8%)
1(4.8%)
1(4.8%)
1(4.8%)
1(4.8%)
1(4.8%)
1(4.8%)
1(4.8%)
1(4.8%)
11(52.4%)
209 (90.9%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-24

Briefly describe the data

The dataset contains information on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016 (the past 120 years). The Winter and Summer Games were held in the same year until 1992. After that, they were staggered such that the Winter Games occur once every 4 years starting with 1994, and the Summer Games occur once every 4 years starting with 1996. The “athlete_events.csv” file has 271,116 observations and 15 variables/attributes. Each row in this csv file corresponds to an individual athlete competing in an individual Olympic event (athlete-events). It includes information about the athlete’s name, gender, age, height (in cm), weight (in kg), team/country they represent, National Olympic Committee (3-letter code) they are representing, year and season participated, Olympic games host city for that year and season, sport, athlete event and medal won. Each athlete will have multiple observations in the data as they would have participated in multiple events and during different seasons. This csv file has 1385 duplicates which I will be investigating in the next steps. The “noc_regions.csv” file has 230 observations and 3 variables/attributes. This file contains information about the ‘NOC’ National Olympic Committee which is a 3-letter code, the corresponding region and notes. The file has 230 unique codes for the NOC variable. Few of the regions have same NOC code which in some cases is distinguished using the notes. The notes has missing value for 209 observations. The NOC variable is present in both the files and can be used as a key to join both the files into a single dataset.

Looking into duplicate data

Code
# Displaying the duplicate observations in "athlete_events.csv" file

duplicate_athlete_data <- athlete_data[duplicated(athlete_data),]
duplicate_athlete_data
# A tibble: 1,385 × 15
      ID Name     Sex     Age Height Weight Team  NOC   Games  Year Season City 
   <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
 1   704 Dsir An… M        27     NA     NA Belg… BEL   1932…  1932 Summer Los …
 2  2449 William… M        48     NA     NA Unit… USA   1928…  1928 Summer Amst…
 3  2449 William… M        48     NA     NA Unit… USA   1928…  1928 Summer Amst…
 4  2777 Hermann… M        43     NA     NA Germ… GER   1928…  1928 Summer Amst…
 5  2777 Hermann… M        43     NA     NA Germ… GER   1928…  1928 Summer Amst…
 6  2777 Hermann… M        51     NA     NA Germ… GER   1936…  1936 Summer Berl…
 7  2903 Lucien … M        46     NA     NA Fran… FRA   1924…  1924 Summer Paris
 8  2903 Lucien … M        46     NA     NA Fran… FRA   1924…  1924 Summer Paris
 9  4319 Ludwig … M        41     NA     NA Germ… GER   1932…  1932 Summer Los …
10  4319 Ludwig … M        41     NA     NA Germ… GER   1932…  1932 Summer Los …
# … with 1,375 more rows, and 3 more variables: Sport <chr>, Event <chr>,
#   Medal <chr>
Code
table(duplicate_athlete_data$Sport)

Art Competitions          Cycling    Equestrianism          Sailing 
            1315               32                1               37 

The “athlete_events.csv” file has 1385 duplicates as shown above. The table() shows that more than 90% of the duplicate observations are for the Sport ‘Art Competitions’. These duplicates could have been introduced during the data collection while performing scraping. The duplicates can be removed from the athlete_data during the data cleaning process and before joining the datasets.

Tidy Data and Mutate Variables (as needed)

The noc_data has some missing value in 212 observations. Hence, I start by cleaning the noc_data.

Code
#Check for missing/null data in the noc_data

sum(is.na(noc_data))
[1] 212
Code
sum(is.null(noc_data))
[1] 0
Code
# Checking which columns have NA values in noc_data

col <- colnames(noc_data)
for (c in col){
  print(paste0("NA values in ", c, ": ", sum(is.na(noc_data[,c]))))
}
[1] "NA values in NOC: 0"
[1] "NA values in region: 3"
[1] "NA values in notes: 209"

The ‘region’ variable in noc_data has missing values for 3 observations. The corresponding NOC code for these 3 observations are ROT, TUV, and UNK. I have displayed the 3 observations below.

Code
# Displaying the observations with missing value in 'region' variable

noc_data%>%filter(is.na(region))
# A tibble: 3 × 3
  NOC   region notes               
  <chr> <chr>  <chr>               
1 ROT   <NA>   Refugee Olympic Team
2 TUV   <NA>   Tuvalu              
3 UNK   <NA>   Unknown             
Code
# Displaying the observations with same value for both 'region' and 'notes' variables

noc_data%>%filter(region==notes)
# A tibble: 1 × 3
  NOC   region                      notes                      
  <chr> <chr>                       <chr>                      
1 IOA   Individual Olympic Athletes Individual Olympic Athletes

Although the ‘region’ value is missing for these observations, we have the ‘notes’ for them. From the notes it is evident that ROT stands for Refugee Olympic Team, TUV stands for Tuvalu and UNK stands for Unknown. I further analyzed whether there are any observations in noc_data with the same value for both ‘region’ and ‘notes’ variables and found 1 observation. For the NOC code ‘IOA’, the region and notes is given the value ‘Individual Olympic Athletes’. Hence, for the NOC codes ‘ROT’, ‘TUV’ and ‘UNK’ I decided to impute the missing ‘region’ values with the corresponding ‘notes’ values.

Code
# Imputing the missing 'region' values with the corresponding 'notes' values in noc_data

noc_data <- noc_data%>%
  mutate(region = coalesce(region,notes))

# Sanity Check: Checking that the 3 observations no longer have missing 'region' values

noc_data%>%filter(is.na(region))
# A tibble: 0 × 3
# … with 3 variables: NOC <chr>, region <chr>, notes <chr>

The ‘notes’ variable in noc_data has missing values for 209 observations. Since, this is more than 90% I decided to drop the ‘notes’ variable. After dropping the ‘notes’ variable from the noc_data, it is left with 230 observations and 2 variables.

Code
# Dropping the 'notes' variable from noc_data
noc_data <- noc_data%>%
  select(-c(3))

# Displaying the noc_data after tidying

noc_data
# A tibble: 230 × 2
   NOC   region     
   <chr> <chr>      
 1 AFG   Afghanistan
 2 AHO   Curacao    
 3 ALB   Albania    
 4 ALG   Algeria    
 5 AND   Andorra    
 6 ANG   Angola     
 7 ANT   Antigua    
 8 ANZ   Australia  
 9 ARG   Argentina  
10 ARM   Armenia    
# … with 220 more rows

Next, I cleaned the athlete_data. As the first step of cleaning the athlete_data, I dropped the 1385 duplicate observations which I had identified earlier while exploring the data. After dropping the duplicate observations, the athlete_data has 269,731 observations and 15 variables.

Code
# Dropping the 1385 duplicate observations from athlete_data

athlete_data <- athlete_data%>%
  distinct()

The athlete_data has 359615 instances of missing values.

Code
#Check for missing/null data in the athlete_data

sum(is.na(athlete_data))
[1] 359615
Code
sum(is.null(athlete_data))
[1] 0

The variables ‘Age’, ‘Height’, ‘Weight’ and ‘Medal’ have missing values in the athlete_data.

Code
# Checking which columns have NA values in athlete_data

col <- colnames(athlete_data)
for (c in col){
  print(paste0("NA values in ", c, ": ", sum(is.na(athlete_data[,c]))))
}
[1] "NA values in ID: 0"
[1] "NA values in Name: 0"
[1] "NA values in Sex: 0"
[1] "NA values in Age: 9315"
[1] "NA values in Height: 58814"
[1] "NA values in Weight: 61527"
[1] "NA values in Team: 0"
[1] "NA values in NOC: 0"
[1] "NA values in Games: 0"
[1] "NA values in Year: 0"
[1] "NA values in Season: 0"
[1] "NA values in City: 0"
[1] "NA values in Sport: 0"
[1] "NA values in Event: 0"
[1] "NA values in Medal: 229959"

The ‘Medal’ variable has 13295 observations with value Bronze, 13108 observations with value Silver, and 13369 observations with value Gold. The remaining values are missing for ‘Medal’ variable. The missing values indicate that the athlete did not win a medal for that sport event during that year and season.

Code
table(athlete_data$Medal)

Bronze   Gold Silver 
 13295  13369  13108 

I handled the missing data in ‘Medal’ variable by imputing the missing values with ‘No Medal’ as the athlete did not win a medal.

Code
# Handling missing data in 'Medal' variable

athlete_data <- athlete_data%>%
  mutate(Medal = replace(Medal, is.na(Medal), "No Medal"))

#Sanity Check: Checking that the 'Medal' variable has no missing values after data imputation

sum(is.na(athlete_data$Medal))
[1] 0
Code
table(athlete_data$Medal)

  Bronze     Gold No Medal   Silver 
   13295    13369   229959    13108 

The variables ‘Age’, ‘Height’, and ‘Weight’ have 9315, 58814, and 61527 missing values respectively. This is equivalent to 0.03%, 0.22% and 0.23% of missing values. This is a significantly large number and I performed data imputation for these variables. I imputed the missing values with the average Age, Height and Weight of the athletes grouped by Sex, Season, Year, and Event. I grouped based on those variables as the athletes participating in the various events are usually in the same age, height and weight range. For example, the male athletes participating in the heavy weight wrestling belong to weight categories like 55kg/60kg/etc.

Code
# Finding the percentage of missing values for the variables 'Age', 'Height', and 'Weight'

athlete_data %>% summarize_all(funs(sum(is.na(.)) / length(.)))
# A tibble: 1 × 15
     ID  Name   Sex    Age Height Weight  Team   NOC Games  Year Season  City
  <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
1     0     0     0 0.0345  0.218  0.228     0     0     0     0      0     0
# … with 3 more variables: Sport <dbl>, Event <dbl>, Medal <dbl>
Code
# Handling the missing data in 'Age', 'Height', and 'Weight' variables using data imputation

# Storing the average age, height, and weight for each group
average_athlete_data <- athlete_data%>%
  group_by(Sex, Season, Year, Event)%>%
  summarise(average_Age = mean(Age, na.rm = TRUE),
            average_Height = mean(Height, na.rm = TRUE),
            average_Weight = mean(Weight, na.rm = TRUE))

# Joining the athlete_data and average_athlete_data using Sex, Season, Year and Event as the key
cleaned_athlete_data = merge(x=athlete_data, y=average_athlete_data, by=c("Sex", "Season", "Year", "Event"), all.x=TRUE)
cleaned_athlete_data <- tibble(cleaned_athlete_data)

# Replacing the missing values in 'Age', 'Height', and 'Weight' variables with the corresponding values in 'Average_Age', 'Average_Height', and 'Average_Weight' variables
cleaned_athlete_data <- cleaned_athlete_data%>%
    mutate(Age = coalesce(Age, average_Age),
           Height = coalesce(Height, average_Height),
           Weight = coalesce(Weight, average_Weight))

# Dropping the variables 'Average_Age', 'Average_Height', and 'Average_Weight' from cleaned_athlete_data as they are no longer needed

cleaned_athlete_data <- cleaned_athlete_data%>%
  select(-c(16,17,18))

# Rounded off the Age', 'Height', and 'Weight' variables to the nearest integer

cleaned_athlete_data <- cleaned_athlete_data%>%
  mutate(Age = round(Age, digits = 0),
         Height = round(Height, digits = 0),
         Weight = round(Weight, digits = 1))

# Finding the percentage of missing values for the variables 'Age', 'Height', and 'Weight' to check whether the percentage of missing values has decreased

cleaned_athlete_data %>% summarize_all(funs(sum(is.na(.)) / length(.)))
# A tibble: 1 × 15
    Sex Season  Year Event    ID  Name      Age Height Weight  Team   NOC Games
  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
1     0      0     0     0     0     0 0.000552 0.0207 0.0452     0     0     0
# … with 3 more variables: City <dbl>, Sport <dbl>, Medal <dbl>
Code
# Displaying the count of missing values in cleaned_athlete_data for each variable

col <- colnames(cleaned_athlete_data)
for (c in col){
  print(paste0("NA values in ", c, ": ", sum(is.na(cleaned_athlete_data[,c]))))
}
[1] "NA values in Sex: 0"
[1] "NA values in Season: 0"
[1] "NA values in Year: 0"
[1] "NA values in Event: 0"
[1] "NA values in ID: 0"
[1] "NA values in Name: 0"
[1] "NA values in Age: 149"
[1] "NA values in Height: 5586"
[1] "NA values in Weight: 12185"
[1] "NA values in Team: 0"
[1] "NA values in NOC: 0"
[1] "NA values in Games: 0"
[1] "NA values in City: 0"
[1] "NA values in Sport: 0"
[1] "NA values in Medal: 0"

The percentage of missing values for the variables ‘Age’, ‘Height’, and ‘Weight’ has reduced from 0.03%, 0.22% and 0.23% to 0.00056%, 0.02% and 0.046 % respectively which is a significant improvement. The remaining missing values could not be imputed as all the observations in the groups (grouped by Sex, Season, Year and Event) had missing values for ‘Age’/‘Height’/‘Weight’ which makes it impossible to get the mean values. One possible solution is to remove all the observations with missing values in any of the variables. This would result in 12,792 observations being dropped which is about 5% of the total data. For now, I am keeping the observations with missing values. However, I can remove the 12,792 observations and store it in another tibble for performing visualization in the future.

The ‘Games’ variable is redundant as it contains information about the year and season of the Olympic games which is already present in the ‘Year’ and ‘Season’ variables. Hence, I dropped the ‘Games’ variable.

Code
# Dropping the 'Games' variable from cleaned_athlete_data

cleaned_athlete_data <- cleaned_athlete_data%>%
  select(-c(12))

The cleaned_athlete_data is left with 269731 observations and 14 variables after cleaning.

Join Data

As the next step after tidying the datasets, I joined the cleaned_athlete_data and noc_data using ‘NOC’ as the key, into a single dataset. The joined dataset has 269731 observations and 15 variables which makes sense as the cleaned_athlete_data had 269731 observations and cleaned_athlete_data and noc_data datasets had 14 and 2 attributes respectively. Since, the “NOC” attribute is common in both datasets we count it only once.

Code
# performed left join for cleaned_athlete_data and noc_data datasets.
olympic_data = merge(x=cleaned_athlete_data, y=noc_data, by="NOC", all.x=TRUE)
olympic_data <- tibble(olympic_data)
olympic_data
# A tibble: 269,731 × 15
   NOC   Sex   Season  Year Event       ID Name    Age Height Weight Team  City 
   <chr> <chr> <chr>  <dbl> <chr>    <dbl> <chr> <dbl>  <dbl>  <dbl> <chr> <chr>
 1 AFG   M     Summer  1988 Wrestl…  84618 Ahma…    24    161   53.6 Afgh… Seoul
 2 AFG   F     Summer  2012 Athlet…  61961 Tahm…    23    160   52   Afgh… Lond…
 3 AFG   M     Summer  1964 Wrestl…  86829 Nour…    20    162   58.8 Afgh… Tokyo
 4 AFG   M     Summer  1936 Hockey…   5841 Saye…    22    174   82   Afgh… Berl…
 5 AFG   M     Summer  1960 Wrestl… 116125 Niza…    34    168  111   Afgh… Roma 
 6 AFG   M     Summer  1948 Hockey…   5844 Moha…    28    176   74.1 Afgh… Lond…
 7 AFG   M     Summer  1956 Hockey…  87372 Din …    27    177   72.6 Afgh… Melb…
 8 AFG   M     Summer  1960 Wrestl…  58364 Moha…    20    166   62   Afgh… Roma 
 9 AFG   M     Summer  2012 Judo M…  33817 Ajma…    25    173   66   Afgh… Lond…
10 AFG   M     Summer  1972 Wrestl…  28855 Ahma…    24    165   52   Afgh… Muni…
# … with 269,721 more rows, and 3 more variables: Sport <chr>, Medal <chr>,
#   region <chr>

I rearranged the order of variables in olympic_data to make the data more understandable and easier for analyzing. I also sorted the olympic_data in ascending order based on ‘Season’ and ‘Year’.

Code
# Rearranging the columns in olympic_data

olympic_data <- olympic_data%>%
  select(c("Season", "Year", "ID", "Name", "Sex", "Age", "Height", "Weight", "Team", "NOC", "region", "City", "Sport", "Event", "Medal"))

# Sorting the olympic_data in ascending order based on 'Season' and 'Year'

olympic_data <- olympic_data%>%
  arrange(Season, Year)

The olympic_data is cleaned and can be used for answering various research questions related to the Olympic games.

Potential Research Questions

I have arrived at few potential research questions that can be answered or analyzed using the cleaned olympic_data.

  1. What is the ratio of male to female athletes participating in the Olympic games and has gender equality of athletes participating increased over the past 120 years?
  2. Has the performance of female athletes improved over the years for teams?
  3. Does the host city have any advantage in Olympic games in terms of winning more medals?
  4. Is it possible to identify which sport event wins more medals for each team?
  5. Has there been a significant change in the age/height/weight of athletes participating in the various events over the years?
Source Code
---
title: "Homework 2"
author: "Vinitha Maheswaran"
date: "12/10/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - hw2
  - Olympics
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```


## Read in data


For this homework I will be working with the 120 years of Olympic history: athletes and results dataset. The Olympics data has two csv files - "athlete_events.csv" and "noc_regions.csv". This historical dataset contains information on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. This data was scraped from www.sports-reference.com in May 2018 and is available on Kaggle. 

```{r}
# Reading the "athlete_events.csv" and "noc_regions.csv" files

athlete_data <- read_csv("_data/athlete_events.csv")
noc_data <- read_csv("_data/noc_regions.csv")
```

```{r}
# Displaying athlete_data dataset

athlete_data
```

```{r}
# Displaying noc_data dataset

noc_data
```

```{r}
# Finding dimension of both datasets

dim(athlete_data)
dim(noc_data)
```

```{r}
# Structure of athlete_data dataset

str(athlete_data)
```

```{r}
# Structure of noc_data dataset

str(noc_data)
```

```{r}
#Summary of athlete_data

library(summarytools)
print(summarytools::dfSummary(athlete_data,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.60, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')
```

```{r}
#Summary of noc_data

library(summarytools)
print(summarytools::dfSummary(noc_data,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.60, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')
```


### Briefly describe the data

The dataset contains information on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016 (the past 120 years). The Winter and Summer Games were held in the same year until 1992. After that, they were staggered such that the Winter Games occur once every 4 years starting with 1994, and the Summer Games occur once every 4 years starting with 1996. The "athlete_events.csv" file has 271,116 observations and 15 variables/attributes. Each row in this csv file corresponds to an individual athlete competing in an individual Olympic event (athlete-events). It includes information about the athlete's name, gender, age, height (in cm), weight (in kg), team/country they represent, National Olympic Committee (3-letter code) they are representing, year and season participated, Olympic games host city for that year and season, sport, athlete event and medal won. Each athlete will have multiple observations in the data as they would have participated in multiple events and during different seasons. This csv file has 1385 duplicates which I will be investigating in the next steps. The "noc_regions.csv" file has 230 observations and 3 variables/attributes. This file contains information about the 'NOC' National Olympic Committee which is a 3-letter code, the corresponding region and notes. The file has 230 unique codes for the NOC variable. Few of the regions have same NOC code which in some cases is distinguished using the notes. The notes has missing value for 209 observations. The NOC variable is present in both the files and can be used as a key to join both the files into a single dataset.

### Looking into duplicate data

```{r}
# Displaying the duplicate observations in "athlete_events.csv" file

duplicate_athlete_data <- athlete_data[duplicated(athlete_data),]
duplicate_athlete_data
```

```{r}
table(duplicate_athlete_data$Sport)
```

The "athlete_events.csv" file has 1385 duplicates as shown above. The table() shows that more than 90% of the duplicate observations are for the Sport 'Art Competitions'. These duplicates could have been introduced during the data collection while performing scraping. The duplicates can be removed from the athlete_data during the data cleaning process and before joining the datasets.


## Tidy Data and Mutate Variables (as needed)

The noc_data has some missing value in 212 observations. Hence, I start by cleaning the noc_data.

```{r}
#Check for missing/null data in the noc_data

sum(is.na(noc_data))
sum(is.null(noc_data))
```

```{r}
# Checking which columns have NA values in noc_data

col <- colnames(noc_data)
for (c in col){
  print(paste0("NA values in ", c, ": ", sum(is.na(noc_data[,c]))))
}
```

The 'region' variable in noc_data has missing values for 3 observations. The corresponding NOC code for these 3 observations are ROT, TUV, and UNK. I have displayed the 3 observations below.

```{r}
# Displaying the observations with missing value in 'region' variable

noc_data%>%filter(is.na(region))
```

```{r}
# Displaying the observations with same value for both 'region' and 'notes' variables

noc_data%>%filter(region==notes)
``` 

Although the 'region' value is missing for these observations, we have the 'notes' for them. From the notes it is evident that ROT stands for Refugee Olympic Team, TUV stands for Tuvalu and UNK stands for Unknown. I further analyzed whether there are any observations in noc_data with the same value for both 'region' and 'notes' variables and found 1 observation. For the NOC code 'IOA', the region and notes is given the value 'Individual Olympic Athletes'. Hence, for the NOC codes 'ROT', 'TUV' and 'UNK' I decided to impute the missing 'region' values with the corresponding 'notes' values.

```{r}
# Imputing the missing 'region' values with the corresponding 'notes' values in noc_data

noc_data <- noc_data%>%
  mutate(region = coalesce(region,notes))

# Sanity Check: Checking that the 3 observations no longer have missing 'region' values

noc_data%>%filter(is.na(region))
```


The 'notes' variable in noc_data has missing values for 209 observations. Since, this is more than 90% I decided to drop the 'notes' variable. After dropping the 'notes' variable from the noc_data, it is left with 230 observations and 2 variables.

```{r}
# Dropping the 'notes' variable from noc_data
noc_data <- noc_data%>%
  select(-c(3))

# Displaying the noc_data after tidying

noc_data
```

Next, I cleaned the athlete_data. As the first step of cleaning the athlete_data, I dropped the 1385 duplicate observations which I had identified earlier while exploring the data. After dropping the duplicate observations, the athlete_data has 269,731 observations and 15 variables.

```{r}
# Dropping the 1385 duplicate observations from athlete_data

athlete_data <- athlete_data%>%
  distinct()
```

The athlete_data has 359615 instances of missing values.

```{r}
#Check for missing/null data in the athlete_data

sum(is.na(athlete_data))
sum(is.null(athlete_data))
```

The variables 'Age', 'Height', 'Weight' and 'Medal' have missing values in the athlete_data.

```{r}
# Checking which columns have NA values in athlete_data

col <- colnames(athlete_data)
for (c in col){
  print(paste0("NA values in ", c, ": ", sum(is.na(athlete_data[,c]))))
}
```

The 'Medal' variable has 13295 observations with value Bronze, 13108 observations with value Silver, and 13369 observations with value Gold. The remaining values are missing for 'Medal' variable. The missing values indicate that the athlete did not win a medal for that sport event during that year and season.

```{r}
table(athlete_data$Medal)
```

I handled the missing data in 'Medal' variable by imputing the missing values with 'No Medal' as the athlete did not win a medal.

```{r}
# Handling missing data in 'Medal' variable

athlete_data <- athlete_data%>%
  mutate(Medal = replace(Medal, is.na(Medal), "No Medal"))

#Sanity Check: Checking that the 'Medal' variable has no missing values after data imputation

sum(is.na(athlete_data$Medal))
table(athlete_data$Medal)
```

The variables 'Age', 'Height', and 'Weight' have 9315, 58814, and 61527 missing values respectively. This is equivalent to 0.03%, 0.22% and 0.23% of missing values. This is a significantly large number and I performed data imputation for these variables. I imputed the missing values with the average Age, Height and Weight of the athletes grouped by Sex, Season, Year, and Event. I grouped based on those variables as the athletes participating in the various events are usually in the same age, height and weight range. For example, the male athletes participating in the heavy weight wrestling belong to weight categories like 55kg/60kg/etc.

```{r}
# Finding the percentage of missing values for the variables 'Age', 'Height', and 'Weight'

athlete_data %>% summarize_all(funs(sum(is.na(.)) / length(.)))
```


```{r}
# Handling the missing data in 'Age', 'Height', and 'Weight' variables using data imputation

# Storing the average age, height, and weight for each group
average_athlete_data <- athlete_data%>%
  group_by(Sex, Season, Year, Event)%>%
  summarise(average_Age = mean(Age, na.rm = TRUE),
            average_Height = mean(Height, na.rm = TRUE),
            average_Weight = mean(Weight, na.rm = TRUE))

# Joining the athlete_data and average_athlete_data using Sex, Season, Year and Event as the key
cleaned_athlete_data = merge(x=athlete_data, y=average_athlete_data, by=c("Sex", "Season", "Year", "Event"), all.x=TRUE)
cleaned_athlete_data <- tibble(cleaned_athlete_data)

# Replacing the missing values in 'Age', 'Height', and 'Weight' variables with the corresponding values in 'Average_Age', 'Average_Height', and 'Average_Weight' variables
cleaned_athlete_data <- cleaned_athlete_data%>%
    mutate(Age = coalesce(Age, average_Age),
           Height = coalesce(Height, average_Height),
           Weight = coalesce(Weight, average_Weight))

# Dropping the variables 'Average_Age', 'Average_Height', and 'Average_Weight' from cleaned_athlete_data as they are no longer needed

cleaned_athlete_data <- cleaned_athlete_data%>%
  select(-c(16,17,18))

# Rounded off the Age', 'Height', and 'Weight' variables to the nearest integer

cleaned_athlete_data <- cleaned_athlete_data%>%
  mutate(Age = round(Age, digits = 0),
         Height = round(Height, digits = 0),
         Weight = round(Weight, digits = 1))

# Finding the percentage of missing values for the variables 'Age', 'Height', and 'Weight' to check whether the percentage of missing values has decreased

cleaned_athlete_data %>% summarize_all(funs(sum(is.na(.)) / length(.)))
```

```{r}
# Displaying the count of missing values in cleaned_athlete_data for each variable

col <- colnames(cleaned_athlete_data)
for (c in col){
  print(paste0("NA values in ", c, ": ", sum(is.na(cleaned_athlete_data[,c]))))
}
```

The percentage of missing values for the variables 'Age', 'Height', and 'Weight' has reduced from 0.03%, 0.22% and 0.23% to 0.00056%, 0.02% and 0.046 % respectively which is a significant improvement. The remaining missing values could not be imputed as all the observations in the groups (grouped by Sex, Season, Year and Event) had missing values for 'Age'/'Height'/'Weight' which makes it impossible to get the mean values. One possible solution is to remove all the observations with missing values in any of the variables. This would result in 12,792 observations being dropped which is about 5% of the total data. For now, I am keeping the observations with missing values. However, I can remove the 12,792 observations and store it in another tibble for performing visualization in the future.


The 'Games' variable is redundant as it contains information about the year and season of the Olympic games which is already present in the 'Year' and 'Season' variables. Hence, I dropped the 'Games' variable.

```{r}
# Dropping the 'Games' variable from cleaned_athlete_data

cleaned_athlete_data <- cleaned_athlete_data%>%
  select(-c(12))
```

The cleaned_athlete_data is left with 269731 observations and 14 variables after cleaning.


## Join Data

As the next step after tidying the datasets, I joined the cleaned_athlete_data and noc_data using 'NOC' as the key, into a single dataset. The joined dataset has 269731 observations and 15 variables which makes sense as the cleaned_athlete_data had 269731 observations and cleaned_athlete_data and noc_data datasets had 14 and 2 attributes respectively. Since, the "NOC" attribute is common in both datasets we count it only once.

```{r}
# performed left join for cleaned_athlete_data and noc_data datasets.
olympic_data = merge(x=cleaned_athlete_data, y=noc_data, by="NOC", all.x=TRUE)
olympic_data <- tibble(olympic_data)
olympic_data
```

I rearranged the order of variables in olympic_data to make the data more understandable and easier for analyzing. I also sorted the olympic_data in ascending order based on 'Season' and 'Year'.

```{r}
# Rearranging the columns in olympic_data

olympic_data <- olympic_data%>%
  select(c("Season", "Year", "ID", "Name", "Sex", "Age", "Height", "Weight", "Team", "NOC", "region", "City", "Sport", "Event", "Medal"))

# Sorting the olympic_data in ascending order based on 'Season' and 'Year'

olympic_data <- olympic_data%>%
  arrange(Season, Year)
```

The olympic_data is cleaned and can be used for answering various research questions related to the Olympic games.


## Potential Research Questions

I have arrived at few potential research questions that can be answered or analyzed using the cleaned olympic_data.

1) What is the ratio of male to female athletes participating in the Olympic games and has gender equality of athletes participating increased over the past 120 years?
2) Has the performance of female athletes improved over the years for teams?
3) Does the host city have any advantage in Olympic games in terms of winning more medals?
4) Is it possible to identify which sport event wins more medals for each team?
5) Has there been a significant change in the age/height/weight of athletes participating in the various events over the years?