final
Author

Tyler Tewksbury

Published

September 3, 2022

Code
library(tidyverse)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Project Overview

My project aims to utilize some of the skills learned in DACSS 601 (Data Science Fundamentals) on a dataset of my choosing. With said dataset, the next goal is to ask a research question that can be looked into using tools learned within the course, as well as previous knowledge of statistics. The dataset I chose provides data on Formula 1 drivers, races, circuits, results, and more from every championship since 1950. The first task is to think of a question that can be asked to the dataset, knowing what kind of data exists within.

Research Question

As someone who is not incredibly knowledgeable in Formula 1, only getting into the sport in the past year, the first few months learning about all the drivers, circuits, etc., was incredibly overwhelming. I am deeply interested in visualizations that allow lots of data to be shown in a simple and digestible way. So my primary goal for this project is to show that I can create beginner visualizations that can be valuable to people who may not know anything about Formula 1.

For a formal research question, I want to learn more about the background of drivers, as I know becoming a Formula 1 driver is an incredibly strenuous and expensive journey. With the dataset having nationality data, a simple question that can be asked of the dataset is if there is a dominant nationality of the drivers throughout Formula 1’s history. From there, conclusions can be made as to if the sport is diverse, and if there is equal opportunity for a majority of backgrounds when it comes to becoming a driver.

Read in data

The dataset consists of multiple tables. Based on the naming conventions of the CSV files, it is easy to read in the ones needed. The tables being loaded are the results of each specific grand prix, the information on each driver, the information on the different circuits, data on every single pitstop, and lastly a status code dictionary that indicate what the result of a race was for a driver (i.e., finished, disqualified, etc.)

Code
f1_results<-read_csv("_data/f1_data/results.csv",
                        show_col_types = FALSE)
f1_drivers<-read_csv("_data/f1_data/drivers.csv",
                        show_col_types = FALSE)
status_codes<-read_csv("_data/f1_data/status.csv",
                        show_col_types = FALSE) 
races<-read_csv("_data/f1_data/races.csv",
                        show_col_types = FALSE)
pit_stops<-read_csv("_data/f1_data/pit_stops.csv",
                        show_col_types = FALSE)

Table Summaries

Code
print(dfSummary(f1_results, varnumbers = FALSE, plain.ascii = FALSE, style = "grid", valid.col = FALSE),
method = 'render', table.classes = 'table-condensed')

Data Frame Summary

f1_results

Dimensions: 25660 x 18
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
resultId [numeric]
Mean (sd) : 12831.3 (7408.7)
min ≤ med ≤ max:
1 ≤ 12830.5 ≤ 25665
IQR (CV) : 12829.5 (0.6)
25660 distinct values 0 (0.0%)
raceId [numeric]
Mean (sd) : 527.5 (296.8)
min ≤ med ≤ max:
1 ≤ 511 ≤ 1086
IQR (CV) : 486 (0.6)
1070 distinct values 0 (0.0%)
driverId [numeric]
Mean (sd) : 258.6 (265.7)
min ≤ med ≤ max:
1 ≤ 160 ≤ 855
IQR (CV) : 302 (1)
854 distinct values 0 (0.0%)
constructorId [numeric]
Mean (sd) : 48.3 (59.4)
min ≤ med ≤ max:
1 ≤ 25 ≤ 214
IQR (CV) : 52 (1.2)
210 distinct values 0 (0.0%)
number [numeric]
Mean (sd) : 17.7 (15)
min ≤ med ≤ max:
0 ≤ 15 ≤ 208
IQR (CV) : 17 (0.8)
129 distinct values 6 (0.0%)
grid [numeric]
Mean (sd) : 11.2 (7.3)
min ≤ med ≤ max:
0 ≤ 11 ≤ 34
IQR (CV) : 12 (0.6)
35 distinct values 0 (0.0%)
position [character]
1. \N
2. 3
3. 4
4. 2
5. 5
6. 1
7. 6
8. 7
9. 8
10. 9
[ 24 others ]
10827(42.2%)
1080(4.2%)
1080(4.2%)
1078(4.2%)
1076(4.2%)
1073(4.2%)
1068(4.2%)
1049(4.1%)
1021(4.0%)
983(3.8%)
5325(20.8%)
0 (0.0%)
positionText [character]
1. R
2. F
3. 3
4. 4
5. 2
6. 5
7. 1
8. 6
9. 7
10. 8
[ 29 others ]
8781(34.2%)
1368(5.3%)
1080(4.2%)
1080(4.2%)
1078(4.2%)
1076(4.2%)
1073(4.2%)
1068(4.2%)
1049(4.1%)
1021(4.0%)
6986(27.2%)
0 (0.0%)
positionOrder [numeric]
Mean (sd) : 12.9 (7.7)
min ≤ med ≤ max:
1 ≤ 12 ≤ 39
IQR (CV) : 12 (0.6)
39 distinct values 0 (0.0%)
points [numeric]
Mean (sd) : 1.9 (4.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 50
IQR (CV) : 2 (2.2)
39 distinct values 0 (0.0%)
laps [numeric]
Mean (sd) : 45.9 (29.9)
min ≤ med ≤ max:
0 ≤ 52 ≤ 200
IQR (CV) : 44 (0.7)
172 distinct values 0 (0.0%)
time [character]
1. \N
2. +8:22.19
3. +0.7
4. +1:29.6
5. +46.2
6. +5.7
7. +1.1
8. +1.3
9. +1:15.9
10. +11.061
[ 6707 others ]
18696(72.9%)
5(0.0%)
4(0.0%)
4(0.0%)
4(0.0%)
4(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
6931(27.0%)
0 (0.0%)
milliseconds [character]
1. \N
2. 14259460
3. 10928200
4. 10803700
5. 10839000
6. 11197800
7. 12131000
8. 12189200
9. 13642300
10. 13929950
[ 6917 others ]
18697(72.9%)
5(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
6941(27.0%)
0 (0.0%)
fastestLap [character]
1. \N
2. 50
3. 52
4. 53
5. 51
6. 55
7. 44
8. 49
9. 48
10. 54
[ 70 others ]
18454(71.9%)
264(1.0%)
255(1.0%)
255(1.0%)
234(0.9%)
189(0.7%)
188(0.7%)
183(0.7%)
181(0.7%)
180(0.7%)
5277(20.6%)
0 (0.0%)
rank [character]
1. \N
2. 1
3. 2
4. 3
5. 4
6. 5
7. 6
8. 10
9. 11
10. 12
[ 16 others ]
18249(71.1%)
356(1.4%)
356(1.4%)
356(1.4%)
356(1.4%)
356(1.4%)
356(1.4%)
355(1.4%)
355(1.4%)
355(1.4%)
4210(16.4%)
0 (0.0%)
fastestLapTime [character]
1. \N
2. 1:17.495
3. 1:18.262
4. 1:43.026
5. 1:14.117
6. 1:16.802
7. 1:17.841
8. 1:18.023
9. 1:18.462
10. 1:18.811
[ 6617 others ]
18454(71.9%)
4(0.0%)
4(0.0%)
4(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
7176(28.0%)
0 (0.0%)
fastestLapSpeed [character]
1. \N
2. 189.423
3. 194.706
4. 195.933
5. 196.785
6. 200.091
7. 200.363
8. 200.642
9. 201.330
10. 201.478
[ 6776 others ]
18454(71.9%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
7179(28.0%)
0 (0.0%)
statusId [numeric]
Mean (sd) : 17.6 (26.2)
min ≤ med ≤ max:
1 ≤ 11 ≤ 141
IQR (CV) : 13 (1.5)
137 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-09-04

Code
print(dfSummary(f1_drivers, varnumbers = FALSE, plain.ascii = FALSE, style = "grid", valid.col = FALSE), method = 'render', table.classes = 'table-condensed')

Data Frame Summary

f1_drivers

Dimensions: 854 x 9
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
driverId [numeric]
Mean (sd) : 427.6 (246.8)
min ≤ med ≤ max:
1 ≤ 427.5 ≤ 855
IQR (CV) : 426.5 (0.6)
854 distinct values 0 (0.0%)
driverRef [character]
1. abate
2. abecassis
3. acheson
4. adamich
5. adams
6. ader
7. adolff
8. agabashian
9. ahrens
10. aitken
[ 844 others ]
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
844(98.8%)
0 (0.0%)
number [character]
1. \N
2. 10
3. 22
4. 28
5. 4
6. 6
7. 88
8. 9
9. 99
10. 11
[ 34 others ]
803(94.0%)
2(0.2%)
2(0.2%)
2(0.2%)
2(0.2%)
2(0.2%)
2(0.2%)
2(0.2%)
2(0.2%)
1(0.1%)
34(4.0%)
0 (0.0%)
code [character]
1. \N
2. ALB
3. BIA
4. HAR
5. MAG
6. MSC
7. VER
8. AIT
9. ALG
10. ALO
[ 82 others ]
757(88.6%)
2(0.2%)
2(0.2%)
2(0.2%)
2(0.2%)
2(0.2%)
2(0.2%)
1(0.1%)
1(0.1%)
1(0.1%)
82(9.6%)
0 (0.0%)
forename [character]
1. John
2. Mike
3. Peter
4. Bill
5. Tony
6. Bob
7. David
8. Johnny
9. Paul
10. George
[ 464 others ]
14(1.6%)
14(1.6%)
13(1.5%)
11(1.3%)
11(1.3%)
10(1.2%)
10(1.2%)
9(1.1%)
9(1.1%)
8(0.9%)
745(87.2%)
0 (0.0%)
surname [character]
1. Taylor
2. Fittipaldi
3. Wilson
4. Brabham
5. Brown
6. Hill
7. Russo
8. Schumacher
9. Stewart
10. Winkelhock
[ 785 others ]
5(0.6%)
4(0.5%)
4(0.5%)
3(0.4%)
3(0.4%)
3(0.4%)
3(0.4%)
3(0.4%)
3(0.4%)
3(0.4%)
820(96.0%)
0 (0.0%)
dob [Date]
min : 1896-12-28
med : 1936-12-28
max : 2000-05-11
range : 103y 4m 13d
836 distinct values 0 (0.0%)
nationality [character]
1. British
2. American
3. Italian
4. French
5. German
6. Brazilian
7. Argentine
8. Belgian
9. South African
10. Swiss
[ 32 others ]
165(19.3%)
157(18.4%)
99(11.6%)
73(8.5%)
50(5.9%)
32(3.7%)
24(2.8%)
23(2.7%)
23(2.7%)
23(2.7%)
185(21.7%)
0 (0.0%)
url [character]
1. http://en.wikipedia.org/w
2. http://en.wikipedia.org/w
3. http://en.wikipedia.org/w
4. http://en.wikipedia.org/w
5. http://en.wikipedia.org/w
6. http://en.wikipedia.org/w
7. http://en.wikipedia.org/w
8. http://en.wikipedia.org/w
9. http://en.wikipedia.org/w
10. http://en.wikipedia.org/w
[ 844 others ]
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
844(98.8%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-09-04

Code
print(dfSummary(races, varnumbers = FALSE, plain.ascii = FALSE, style = "grid", valid.col = FALSE), method = 'render', table.classes = 'table-condensed')

Data Frame Summary

races

Dimensions: 1079 x 18
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
raceId [numeric]
Mean (sd) : 542 (314.6)
min ≤ med ≤ max:
1 ≤ 540 ≤ 1096
IQR (CV) : 539 (0.6)
1079 distinct values 0 (0.0%)
year [numeric]
Mean (sd) : 1991.4 (20)
min ≤ med ≤ max:
1950 ≤ 1993 ≤ 2022
IQR (CV) : 33 (0)
73 distinct values 0 (0.0%)
round [numeric]
Mean (sd) : 8.4 (5)
min ≤ med ≤ max:
1 ≤ 8 ≤ 22
IQR (CV) : 8 (0.6)
22 distinct values 0 (0.0%)
circuitId [numeric]
Mean (sd) : 23.5 (19)
min ≤ med ≤ max:
1 ≤ 18 ≤ 79
IQR (CV) : 25 (0.8)
76 distinct values 0 (0.0%)
name [character]
1. British Grand Prix
2. Italian Grand Prix
3. Monaco Grand Prix
4. Belgian Grand Prix
5. German Grand Prix
6. French Grand Prix
7. Spanish Grand Prix
8. Canadian Grand Prix
9. Brazilian Grand Prix
10. United States Grand Prix
[ 43 others ]
73(6.8%)
73(6.8%)
68(6.3%)
67(6.2%)
64(5.9%)
62(5.7%)
52(4.8%)
51(4.7%)
48(4.4%)
43(4.0%)
478(44.3%)
0 (0.0%)
date [Date]
min : 1950-05-13
med : 1993-07-04
max : 2022-11-20
range : 72y 6m 7d
1079 distinct values 0 (0.0%)
time [character]
1. \N
2. 12:00:00
3. 13:00:00
4. 14:00:00
5. 13:10:00
6. 06:00:00
7. 07:00:00
8. 16:00:00
9. 19:00:00
10. 05:00:00
[ 24 others ]
731(67.7%)
111(10.3%)
33(3.1%)
32(3.0%)
30(2.8%)
19(1.8%)
11(1.0%)
11(1.0%)
11(1.0%)
10(0.9%)
80(7.4%)
0 (0.0%)
url [character]
1. http://en.wikipedia.org/w
2. http://en.wikipedia.org/w
3. http://en.wikipedia.org/w
4. http://en.wikipedia.org/w
5. http://en.wikipedia.org/w
6. http://en.wikipedia.org/w
7. http://en.wikipedia.org/w
8. http://en.wikipedia.org/w
9. http://en.wikipedia.org/w
10. http://en.wikipedia.org/w
[ 1069 others ]
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1069(99.1%)
0 (0.0%)
fp1_date [character]
1. \N
2. 2021-03-26
3. 2021-04-16
4. 2021-04-30
5. 2021-05-07
6. 2021-05-21
7. 2021-06-04
8. 2021-06-18
9. 2021-06-25
10. 2021-07-02
[ 35 others ]
1035(95.9%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
35(3.2%)
0 (0.0%)
fp1_time [character]
1. \N
2. 12:00:00
3. 11:30:00
4. 18:00:00
5. 03:00:00
6. 04:00:00
7. 09:00:00
8. 10:00:00
9. 11:00:00
10. 14:00:00
[ 3 others ]
1057(98.0%)
9(0.8%)
2(0.2%)
2(0.2%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
3(0.3%)
0 (0.0%)
fp2_date [character]
1. \N
2. 2021-03-26
3. 2021-04-16
4. 2021-04-30
5. 2021-05-07
6. 2021-05-21
7. 2021-06-04
8. 2021-06-18
9. 2021-06-25
10. 2021-07-02
[ 35 others ]
1035(95.9%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
35(3.2%)
0 (0.0%)
fp2_time [character]
1. \N
2. 15:00:00
3. 10:30:00
4. 21:00:00
5. 06:00:00
6. 08:00:00
7. 12:00:00
8. 13:30:00
9. 14:00:00
10. 15:30:00
[ 3 others ]
1057(98.0%)
9(0.8%)
2(0.2%)
2(0.2%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
3(0.3%)
0 (0.0%)
fp3_date [character]
1. \N
2. 2021-03-27
3. 2021-04-17
4. 2021-05-01
5. 2021-05-08
6. 2021-05-22
7. 2021-06-05
8. 2021-06-19
9. 2021-06-26
10. 2021-07-03
[ 29 others ]
1041(96.5%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
29(2.7%)
0 (0.0%)
fp3_time [character]
1. \N
2. 03:00:00
3. 04:00:00
4. 10:00:00
5. 11:00:00
6. 12:00:00
7. 14:00:00
8. 17:00:00
9. 19:00:00
1060(98.2%)
1(0.1%)
1(0.1%)
2(0.2%)
9(0.8%)
1(0.1%)
1(0.1%)
3(0.3%)
1(0.1%)
0 (0.0%)
quali_date [character]
1. \N
2. 2021-03-27
3. 2021-04-17
4. 2021-05-01
5. 2021-05-08
6. 2021-05-22
7. 2021-06-05
8. 2021-06-19
9. 2021-06-26
10. 2021-07-03
[ 35 others ]
1035(95.9%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
35(3.2%)
0 (0.0%)
quali_time [character]
1. \N
2. 06:00:00
3. 07:00:00
4. 13:00:00
5. 14:00:00
6. 15:00:00
7. 17:00:00
8. 19:00:00
9. 20:00:00
10. 22:00:00
1057(98.0%)
1(0.1%)
1(0.1%)
2(0.2%)
9(0.8%)
3(0.3%)
1(0.1%)
1(0.1%)
3(0.3%)
1(0.1%)
0 (0.0%)
sprint_date [character]
1. \N
2. 2021-07-17
3. 2021-09-11
4. 2021-11-13
5. 2022-04-23
6. 2022-07-09
7. 2022-11-12
1073(99.4%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
1(0.1%)
0 (0.0%)
sprint_time [character]
1. \N
2. 14:30:00
3. 19:30:00
1076(99.7%)
2(0.2%)
1(0.1%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-09-04

Code
print(dfSummary(pit_stops, varnumbers = FALSE, plain.ascii = FALSE, style = "grid", valid.col = FALSE), method = 'render', table.classes = 'table-condensed')

Data Frame Summary

pit_stops

Dimensions: 9299 x 7
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
raceId [numeric]
Mean (sd) : 951.2 (73.7)
min ≤ med ≤ max:
841 ≤ 949 ≤ 1086
IQR (CV) : 132 (0.1)
230 distinct values 0 (0.0%)
driverId [numeric]
Mean (sd) : 505.1 (392.9)
min ≤ med ≤ max:
1 ≤ 815 ≤ 855
IQR (CV) : 812 (0.8)
69 distinct values 0 (0.0%)
stop [numeric]
Mean (sd) : 1.8 (0.9)
min ≤ med ≤ max:
1 ≤ 2 ≤ 6
IQR (CV) : 1 (0.5)
1:4533(48.7%)
2:3034(32.6%)
3:1272(13.7%)
4:346(3.7%)
5:96(1.0%)
6:18(0.2%)
0 (0.0%)
lap [numeric]
Mean (sd) : 25.2 (14.5)
min ≤ med ≤ max:
1 ≤ 25 ≤ 78
IQR (CV) : 23 (0.6)
74 distinct values 0 (0.0%)
time [hms, difftime]
min : 47071
med : 55410
max : 81134
units : secs
7030 distinct values 0 (0.0%)
duration [character]
1. 22.745
2. 22.105
3. 22.303
4. 22.399
5. 22.534
6. 22.684
7. 22.838
8. 23.477
9. 23.732
10. 24.083
[ 6563 others ]
7(0.1%)
6(0.1%)
6(0.1%)
6(0.1%)
6(0.1%)
6(0.1%)
6(0.1%)
6(0.1%)
6(0.1%)
6(0.1%)
9238(99.3%)
0 (0.0%)
milliseconds [numeric]
Mean (sd) : 73422.6 (278200.5)
min ≤ med ≤ max:
12897 ≤ 23546 ≤ 3069017
IQR (CV) : 4331 (3.8)
6573 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-09-04

Tidy Data

The dataset as a whole is already quite tidy in my eyes. However, there are two issues I have with the workability that can easily be amended. The creator of the dataset used IDs for both the drivers and statuses in order to make cross referencing between tables and merging them easier. But for the purpose of visualizing information for someone not knowledgeable of the dataset, this is an issue. This means that the table needs work to be done before being used to visualize. A simple full join and then removing the columns not needed via subset should do the trick.

Code
f1_results <- f1_results %>% full_join(f1_drivers, by = "driverId")

f1_results = subset(f1_results, select = -c(number.x, number.y, url, nationality, dob))

f1_results <- merge(f1_results, status_codes, by = "statusId")

head(f1_results)
  statusId resultId raceId driverId constructorId grid position positionText
1        1        1     18        1             1    1        1            1
2        1        2     18        2             2    5        2            2
3        1        3     18        3             3    7        3            3
4        1        4     18        4             4   11        4            4
5        1        5     18        5             1    3        5            5
6        1     3286    174       35            16    6        5            5
  positionOrder points laps        time milliseconds fastestLap rank
1             1     10   58 1:34:50.616      5690616         39    2
2             2      8   58      +5.478      5696094         41    3
3             3      6   58      +8.163      5698779         41    5
4             4      5   58     +17.181      5707797         58    7
5             5      4   58     +18.014      5708630         43    1
6             5      2   56   +1:10.692      5824927        \\N  \\N
  fastestLapTime fastestLapSpeed  driverRef code forename    surname   status
1       1:27.452         218.300   hamilton  HAM    Lewis   Hamilton Finished
2       1:27.739         217.586   heidfeld  HEI     Nick   Heidfeld Finished
3       1:28.090         216.719    rosberg  ROS     Nico    Rosberg Finished
4       1:28.603         215.464     alonso  ALO Fernando     Alonso Finished
5       1:27.418         218.385 kovalainen  KOV   Heikki Kovalainen Finished
6            \\N             \\N villeneuve  VIL  Jacques Villeneuve Finished

Another table that could be modified is the pit stops data. It is quite a long table, due to entailing every single pitstop per driver, per race. I wanted to make this easier to work with by taking out the actual time data, as I was only interested in the amount of stops per race. By pivoting wider, I can change each race into its own column, and then the driver’s stops in any given lap will be detailed in that row. With this setup, I can see what stop number of that race that was in that specific lap. Something I was attempting to do was to omit all values per race, per driver, lower than the maximum amount of stops to show a simple way to view the number of stops in a race. While this would make the table cleaner, there is no reason practically to do this as I can simply use the maximum as arguments in any visualizations/calculations I want to do. However, it would make the table look cleaner and line up with the original goal of mine (simply detailing the # of stops), so I will continue to find a solution. In the mean time, having access to the lap data will allow me to create different visualizations, such as showing who stopped during a specific lap, as well as which number of stop it was for that driver

Code
pit_stops <- merge(pit_stops, f1_drivers[,c( 1:4)])

pit_stops = subset(pit_stops, select = -c(time, duration, milliseconds, driverRef, number))

pit_stops %>% group_by(raceId) %>% group_by(max(stop))
# A tibble: 9,299 × 6
# Groups:   max(stop) [1]
   driverId raceId  stop   lap code  `max(stop)`
      <dbl>  <dbl> <dbl> <dbl> <chr>       <dbl>
 1        1    982     4    29 HAM             6
 2        1   1083     3    39 HAM             6
 3        1   1024     1    26 HAM             6
 4        1    850     3    51 HAM             6
 5        1   1061     2    27 HAM             6
 6        1    846     2    43 HAM             6
 7        1    956     3    29 HAM             6
 8        1   1035     2    41 HAM             6
 9        1    887     2    36 HAM             6
10        1    846     3    49 HAM             6
# … with 9,289 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
pit_stops <- pit_stops %>% 
pivot_wider(names_from = "raceId",
              values_from = "stop")

Build Workable Tables

One issue present with the dataset, as it pertains to the research question, is the labeling of the “nationality” column. To perform analysis on the specific country, as well as continent, of the drivers, it is necessary to import a dictionary that can convert nationality to country, and then one that will add continent. One of my goals is to show the analysis not just at a country level, but continent as well, so adding these two dictionaries can allow me to do so with an initially limited dataset.

Code
demonyms <- read.csv(url("https://raw.githubusercontent.com/knowitall/chunkedextractor/master/src/main/resources/edu/knowitall/chunkedextractor/demonyms.csv"))

continents <- read.csv(url("https://raw.githubusercontent.com/dbouquin/IS_608/master/NanosatDB_munging/Countries-Continents.csv"))

colnames(demonyms) <- c("nationality", "country")
colnames(continents) <- c("continent", "country")
continents$country[168] <- "United States"


f1_drivers <- merge(f1_drivers, demonyms, by = "nationality", all.x = TRUE)
f1_drivers <- merge(f1_drivers, continents, by = "country", all.x = TRUE)

As a result of merging (using an outer join in order to not lose data) by “nationality”, and then “country”, we can now see the driver’s country of origin itself, as well as continent. This will simplify the creation of visualizations.

Visualizations

The first visualization to create for the research question is a simple chart that shows the distribution of nationalities of the racers. An easy way to present this would be a histogram.

Code
ggplot(f1_drivers, aes(y = country, color = continent, fill = continent)) + geom_bar(width =.5 ) + labs(title = "Formula 1 Driver Countries of Origin")

Code
pie <-ggplot(f1_drivers, aes(x = "", fill = factor(continent))) + 
  geom_bar(width = .1) +
  theme(axis.line = element_blank(), 
        plot.title = element_text(hjust=0.5)) + 
  labs(fill="continent", 
       x=NULL, 
       y=NULL, 
       title="Pie Chart of F1 Continents",) 
pie + coord_polar(theta = "y", start=0)

With just two simple charts, it is immediately apparent how dominant Europe is in the sport. The US also has an incredibly large presence, especially when compared to other countries of similar size.

Next, lets look into the pitstop and result tables that were sorted earlier.

Code
ggplot(pit_stops, aes(lap))+ geom_bar() + labs(title = "Count of Pit Stops by Lap")

Code
race_1086 <- f1_results %>% filter(raceId == 1086)

race_1086$grid_diff <- as.numeric(race_1086$grid) - as.numeric(race_1086$position)

ggplot(race_1086, aes(x=code, y=grid_diff)) +
  geom_segment( aes(x=code, xend=code, y=0, yend=grid_diff, color = code), size=2, alpha=0.9) +
  theme_light() +
  theme(
    legend.position = "none",
    panel.border = element_blank(),
  ) +
  xlab("") +
  ylab("Grid Difference") + labs(title = "Driver Grid vs Finishing Place Difference at 2022 Hungarian Grand Prix")

Reflection and Conclusion

The results from the nationality analysis are to be expected. For those unaware, it is incredibly expensive to become a Formula 1 driver, meaning likely only those with privileged backgrounds have the opportunity to compete in the sport. Thus, seeing exclusively first world countries, and some of the wealthiest ones with the most presnese, is to be expected. However, seeing the visualizations can shed some light on how much (or little, depending on your perspective) diversity there is within the sport. With this dataset, lots of more interesting nationality analysis can be done. Reflecting now, my next goal would likely be categorizing racers by the seasons they were active, and then visualizing the trends of which countries have more or less active drivers as the seasons go on, likely as a standard plot. A line of best fit could be used to determine if the overall presense of a specific country is increasing or decreasing as time goes on.

For the pitstop data, I initially believed pivoting wider would be a simple way to work with the data. While it did properly make the data wider, it actually was more difficult to work with, having many NA values and not initially doing what I intended (that is, simply showing the number of pit stops per race per driver). In the future, I would simplify this by grouping the data, and then slicing if necessary.

Lastly, the result data. I only decided to look at one race just to test the waters. Creating the table and beginning analysis proved that it can be incredibly powerful to work with. The visualization I created came from the difference of a driver’s starting position (known as grid), and their finishing place. I did this because a few racers did incredibly well, given their starting position, and others did poorly. The lollipop bar chart I created can show this difference well. I anticipate myself going back to this project and continuing analysis using that table as I become more comfortable not just in R, but also with my understanding of Formula 1.

All in all, this project was a blast to work on I feel it encapsulated my learning in 601 well, showing a lot of fundamental skills from data tidying, editing tables, and making simple visualizations. This project also pushed me to do a lot of research on what can be done with R, looking through documentation for many packages and getting ideas on how to progress further. I see this specific assignment as a living document, something that I can go back to and work on as I further my education in R. The techniques and theories as of now (9/02/22) are very introductory, very fundamental, but immensely valuable. I am incredibly excited to see what else I can do with the same dataset as I become more and more comfortable and knowledgeable in R.

Bibliography

Continent Data: https://github.com/dbouquin/IS_608/blob/master/NanosatDB_munging/Countries-Continents.csv

Demonym Data: https://github.com/knowitall/chunkedextractor/blob/master/src/main/resources/edu/knowitall/chunkedextractor/demonyms.csv

Formula 1 Data: https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020

R Programming Language: R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Textbook: Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.