Code
library(tidyverse)
library(ggplot2)
library(lubridate)
library(dplyr)
::opts_chunk$set(echo = TRUE) knitr
Vishnupriya Varadharaju
December 16, 2022
I’m using the 2015 Flights Delay and Cancellation Dataset. This dataset has three different files. One is flights.csv which has details of flight names and their ID. The second file is airports.csv, which has details of the airport name, location, city, airlines that fly there and the latitude and longitude. The third file is flights.csv which has details about the date of a trip, the airline name, number, origin and destination airport, scheduled departure time and much more.
#reading in airlines csv
airlines <- read_csv('_data/2015_flight_cancellation/airlines.csv',show_col_types = FALSE)
#reading in airports csv
airports <- read_csv('_data/2015_flight_cancellation/airports.csv', show_col_types = FALSE)
#reading in flights csv
flights <- read_csv('_data/2015_flight_cancellation/flights.csv', show_col_types = FALSE)
[1] "IATA_CODE" "AIRLINE"
[1] "IATA_CODE" "AIRPORT" "CITY" "STATE" "COUNTRY" "LATITUDE"
[7] "LONGITUDE"
[1] "YEAR" "MONTH" "DAY"
[4] "DAY_OF_WEEK" "AIRLINE" "FLIGHT_NUMBER"
[7] "TAIL_NUMBER" "ORIGIN_AIRPORT" "DESTINATION_AIRPORT"
[10] "SCHEDULED_DEPARTURE" "DEPARTURE_TIME" "DEPARTURE_DELAY"
[13] "TAXI_OUT" "WHEELS_OFF" "SCHEDULED_TIME"
[16] "ELAPSED_TIME" "AIR_TIME" "DISTANCE"
[19] "WHEELS_ON" "TAXI_IN" "SCHEDULED_ARRIVAL"
[22] "ARRIVAL_TIME" "ARRIVAL_DELAY" "DIVERTED"
[25] "CANCELLED" "CANCELLATION_REASON" "AIR_SYSTEM_DELAY"
[28] "SECURITY_DELAY" "AIRLINE_DELAY" "LATE_AIRCRAFT_DELAY"
[31] "WEATHER_DELAY"
As stated previously the main table ‘flights’ has entries that correspond to a flight. The details of year, month, day and day_of_week tell the date of the flight. The airline name, flight number, tail number are to do with the aircraft details. The scheduled departure and scheduled arrival, where the actual times at which the flight should take off or land. The departure time and the arrival time are the actual time at which the flight took off or landed. The time difference between the scheduled and actual time is also given. The taxi-in time is the time take between touch down and the aircraft arriving at the gate. The taxi-out time is the time taken from the gate to wheels taking off. The time of travel, the distance covered is also given. Details about aircraft diversion, cancellation, cancellation reasons are also present.
A new point that I learnt here is the difference between flight number and tail number. A tail number is a number given to each physical aircraft. Whereas, flight number is a number given to an aircraft on a particular route. So flights with different tail numbers can have the same flight number.
# A tibble: 6 × 31
YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT…¹ TAIL_…² ORIGI…³ DESTI…⁴ SCHED…⁵
<dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
1 2015 12 1 2 AA 1230 N3KSAA SEA DFW 0005
2 2015 12 1 2 DL 1426 N820DN SFO MSP 0005
3 2015 12 1 2 AA 2406 N850AA LAX DFW 0010
4 2015 12 1 2 UA 680 N38473 SFO ORD 0014
5 2015 12 1 2 AA 883 N199UW SFO CLT 0015
6 2015 12 1 2 AA 2020 N977UY PHX CLT 0015
# … with 21 more variables: DEPARTURE_TIME <chr>, DEPARTURE_DELAY <dbl>,
# TAXI_OUT <dbl>, WHEELS_OFF <chr>, SCHEDULED_TIME <dbl>, ELAPSED_TIME <dbl>,
# AIR_TIME <dbl>, DISTANCE <dbl>, WHEELS_ON <chr>, TAXI_IN <dbl>,
# SCHEDULED_ARRIVAL <chr>, ARRIVAL_TIME <chr>, ARRIVAL_DELAY <dbl>,
# DIVERTED <dbl>, CANCELLED <dbl>, CANCELLATION_REASON <chr>,
# AIR_SYSTEM_DELAY <dbl>, SECURITY_DELAY <dbl>, AIRLINE_DELAY <dbl>,
# LATE_AIRCRAFT_DELAY <dbl>, WEATHER_DELAY <dbl>, and abbreviated variable …
The flights table is very large with close to 5800000 entries. For the sake of uploading a smaller file, I have edited the csv such as only month = 12 is included. To reduce on computation time, I am selecting only month equal to December for further analysis.
# A tibble: 6 × 31
YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT…¹ TAIL_…² ORIGI…³ DESTI…⁴ SCHED…⁵
<dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
1 2015 12 1 2 AA 1230 N3KSAA SEA DFW 0005
2 2015 12 1 2 DL 1426 N820DN SFO MSP 0005
3 2015 12 1 2 AA 2406 N850AA LAX DFW 0010
4 2015 12 1 2 UA 680 N38473 SFO ORD 0014
5 2015 12 1 2 AA 883 N199UW SFO CLT 0015
6 2015 12 1 2 AA 2020 N977UY PHX CLT 0015
# … with 21 more variables: DEPARTURE_TIME <chr>, DEPARTURE_DELAY <dbl>,
# TAXI_OUT <dbl>, WHEELS_OFF <chr>, SCHEDULED_TIME <dbl>, ELAPSED_TIME <dbl>,
# AIR_TIME <dbl>, DISTANCE <dbl>, WHEELS_ON <chr>, TAXI_IN <dbl>,
# SCHEDULED_ARRIVAL <chr>, ARRIVAL_TIME <chr>, ARRIVAL_DELAY <dbl>,
# DIVERTED <dbl>, CANCELLED <dbl>, CANCELLATION_REASON <chr>,
# AIR_SYSTEM_DELAY <dbl>, SECURITY_DELAY <dbl>, AIRLINE_DELAY <dbl>,
# LATE_AIRCRAFT_DELAY <dbl>, WEATHER_DELAY <dbl>, and abbreviated variable …
The resulting table has close to 479200 flights.
YEAR MONTH DAY DAY_OF_WEEK AIRLINE
Min. :2015 Min. :12 Min. : 1.00 Min. :1.000 Length:479230
1st Qu.:2015 1st Qu.:12 1st Qu.: 8.00 1st Qu.:2.000 Class :character
Median :2015 Median :12 Median :16.00 Median :4.000 Mode :character
Mean :2015 Mean :12 Mean :15.97 Mean :3.843
3rd Qu.:2015 3rd Qu.:12 3rd Qu.:23.00 3rd Qu.:5.000
Max. :2015 Max. :12 Max. :31.00 Max. :7.000
FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
Min. : 1 Length:479230 Length:479230 Length:479230
1st Qu.: 730 Class :character Class :character Class :character
Median :1742 Mode :character Mode :character Mode :character
Mean :2158
3rd Qu.:3100
Max. :8445
SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT
Length:479230 Length:479230 Min. : -82.00 Min. : 1.00
Class :character Class :character 1st Qu.: -4.00 1st Qu.: 11.00
Mode :character Mode :character Median : -1.00 Median : 14.00
Mean : 11.78 Mean : 16.21
3rd Qu.: 10.00 3rd Qu.: 19.00
Max. :1649.00 Max. :172.00
NA's :7679 NA's :7961
WHEELS_OFF SCHEDULED_TIME ELAPSED_TIME AIR_TIME
Length:479230 Min. : 21.0 Min. : 15.0 Min. : 8.0
Class :character 1st Qu.: 88.0 1st Qu.: 84.0 1st Qu.: 61.0
Mode :character Median :126.0 Median :122.0 Median : 97.0
Mean :145.2 Mean :139.9 Mean :116.2
3rd Qu.:178.0 3rd Qu.:172.0 3rd Qu.:147.0
Max. :705.0 Max. :730.0 Max. :690.0
NA's :9513 NA's :9513
DISTANCE WHEELS_ON TAXI_IN SCHEDULED_ARRIVAL
Min. : 31.0 Length:479230 Min. : 1.000 Length:479230
1st Qu.: 383.0 Class :character 1st Qu.: 4.000 Class :character
Median : 672.0 Mode :character Median : 6.000 Mode :character
Mean : 837.8 Mean : 7.505
3rd Qu.:1076.0 3rd Qu.: 9.000
Max. :4983.0 Max. :248.000
NA's :8453
ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
Length:479230 Min. : -80.000 Min. :0.000000 Min. :0.00000
Class :character 1st Qu.: -14.000 1st Qu.:0.000000 1st Qu.:0.00000
Mode :character Median : -5.000 Median :0.000000 Median :0.00000
Mean : 6.093 Mean :0.003026 Mean :0.01682
3rd Qu.: 10.000 3rd Qu.:0.000000 3rd Qu.:0.00000
Max. :1636.000 Max. :1.000000 Max. :1.00000
NA's :9513
CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
Length:479230 Min. : 0.0 Min. : 0.0 Min. : 0.0
Class :character 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0
Mode :character Median : 1.0 Median : 0.0 Median : 4.0
Mean : 13.3 Mean : 0.1 Mean : 19.6
3rd Qu.: 17.0 3rd Qu.: 0.0 3rd Qu.: 19.0
Max. :1049.0 Max. :154.0 Max. :1636.0
NA's :382458 NA's :382458 NA's :382458
LATE_AIRCRAFT_DELAY WEATHER_DELAY
Min. : 0.0 Min. : 0.0
1st Qu.: 0.0 1st Qu.: 0.0
Median : 6.0 Median : 0.0
Mean : 26.2 Mean : 3.4
3rd Qu.: 32.0 3rd Qu.: 0.0
Max. :1190.0 Max. :1211.0
NA's :382458 NA's :382458
From the summary table we can see that for AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY AND WEATHER_DELAY, the number of NA values is close to 80%. These columns will not be very useful for analysis and hence can be removed. Furthermore, the CANCELLATION_REASON column also has nearly 98% NA values, and hence can be removed.
[1] NA "A" "B" "C"
# A tibble: 4 × 2
CANCELLATION_REASON n
<chr> <int>
1 A 1579
2 B 5613
3 C 871
4 <NA> 471167
Creating a new field called date, which has the year, month and day in a single field.
Next, we can see that the time stamps in SCHEDULED_DEPARTURE, SCHEDULED_ARRIVAL,DEPARTURE_TIME, ARRIVAL_TIME have the format with the first two digits indicating the hour and the next two digits indicating the minutes. This can be converted into a more readable time stamp. TAXI_IN, TAXI_OUT, WHEELS_OFF, WHEELS_ON - these columns may not be needed for our analysis and can be removed.
flights_short <- flights_short %>%
mutate(SCHEDULED_DEPARTURE = str_c(substr(SCHEDULED_DEPARTURE, start=1, stop=2),
substr(SCHEDULED_DEPARTURE, start=3, stop=4),sep=":"), SCHEDULED_ARRIVAL = str_c(substr(SCHEDULED_ARRIVAL, start=1, stop=2),
substr(SCHEDULED_ARRIVAL, start=3, stop=4),sep=":"),
DEPARTURE_TIME = str_c(substr(DEPARTURE_TIME, start=1, stop=2),
substr(DEPARTURE_TIME, start=3, stop=4),sep=":"), ARRIVAL_TIME = str_c(substr(ARRIVAL_TIME, start=1, stop=2),substr(ARRIVAL_TIME, start=3, stop=4),sep=":")) %>%
select(-c(TAXI_OUT, TAXI_IN, WHEELS_OFF, WHEELS_ON, AIR_TIME, ELAPSED_TIME, SCHEDULED_TIME, FLIGHT_NUMBER,TAIL_NUMBER))
# A tibble: 0 × 13
# … with 13 variables: FLIGHT_DATE <date>, AIRLINE <chr>, ORIGIN_AIRPORT <chr>,
# DESTINATION_AIRPORT <chr>, SCHEDULED_DEPARTURE <chr>, DEPARTURE_TIME <chr>,
# DEPARTURE_DELAY <dbl>, DISTANCE <dbl>, SCHEDULED_ARRIVAL <chr>,
# ARRIVAL_TIME <chr>, ARRIVAL_DELAY <dbl>, DIVERTED <dbl>, CANCELLED <dbl>
We can create a new column to show the flights that were cancelled or diverted. 1 corresponds to diverted and 2 corresponds to cancelled and 3 corresponds to flight journey that has taken place.
There are some NA values in the DEPARTURE_DELAY. These null values can be removed from the data set as they are a small percentage when compared to the rest of the data. But some of these null values may also be because the flight was either cancelled or diverted. So just retaining the fields for now.
We can merge the airlines table with the current table to get the full name of the airlines. Merging the airports table with the current table to get the Origin and Destination City names as it will be easier to understand than the IATA code (left as an extension).
flights_short <- flights_short %>%
rename(
IATA_CODE = AIRLINE
)
flight_table = merge(x=flights_short, y=airlines, by="IATA_CODE", all.x=TRUE)
flight_table <- flight_table %>% select(-IATA_CODE)
colOrder <- c("FLIGHT_DATE", "AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT",
"STATUS", "DISTANCE", "SCHEDULED_DEPARTURE","DEPARTURE_TIME",
"DEPARTURE_DELAY","SCHEDULED_ARRIVAL","ARRIVAL_TIME","ARRIVAL_DELAY")
flight_table1 <- flight_table[, colOrder]
head(flight_table1)
FLIGHT_DATE AIRLINE ORIGIN_AIRPORT DESTINATION_AIRPORT STATUS
1 2015-12-01 American Airlines Inc. SEA DFW 3
2 2015-12-23 American Airlines Inc. DFW CLT 3
3 2015-12-01 American Airlines Inc. LAX DFW 3
4 2015-12-23 American Airlines Inc. CLT BUF 3
5 2015-12-01 American Airlines Inc. SFO CLT 3
6 2015-12-01 American Airlines Inc. PHX CLT 3
DISTANCE SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY SCHEDULED_ARRIVAL
1 1660 00:05 00:09 4 05:55
2 936 20:20 20:25 5 23:50
3 1235 00:10 00:07 -3 05:01
4 546 20:20 20:32 12 22:00
5 2296 00:15 00:13 -2 08:05
6 1773 00:15 00:11 -4 06:01
ARRIVAL_TIME ARRIVAL_DELAY
1 05:38 -17
2 23:39 -11
3 04:53 -8
4 22:06 6
5 08:10 5
6 05:29 -32
One simple visualization is to plot the different airports that are there in the US from the airports table on to a geographical map.
# A tibble: 6 × 7
IATA_CODE AIRPORT CITY STATE COUNTRY LATIT…¹ LONGI…²
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 ABE Lehigh Valley International Air… Alle… PA USA 40.7 -75.4
2 ABI Abilene Regional Airport Abil… TX USA 32.4 -99.7
3 ABQ Albuquerque International Sunpo… Albu… NM USA 35.0 -107.
4 ABR Aberdeen Regional Airport Aber… SD USA 45.4 -98.4
5 ABY Southwest Georgia Regional Airp… Alba… GA USA 31.5 -84.2
6 ACK Nantucket Memorial Airport Nant… MA USA 41.3 -70.1
# … with abbreviated variable names ¹LATITUDE, ²LONGITUDE
Attaching package: 'maps'
The following object is masked from 'package:purrr':
map
ggplot() +
geom_polygon(data = USA, aes(x=long, y = lat, group = group), fill="lightgreen", alpha=0.5) +
geom_point(data=airport_loc, aes(x=LONGITUDE, y=LATITUDE), size = 0.1) +
theme_void() + coord_map() + ylim(NA,150) + xlim(NA, 5) + ggtitle("Location of airports in the USA") + theme(plot.title = element_text(lineheight=1, face="bold"))
Error in `mproject()`:
! The package `mapproj` is required for `coord_map()`
The above graph is just to give a rough idea about where the airports are roughly located in the USA. We can see that the airports are present in Alaska and in all the other island states of the USA.
The table flight_table can be further cleaned based on the requirements of the analysis questions. This data can provide the analysis to the following questions.
In the month of December, what percentage of flight trips were cancelled, or diverted? Which airlines had the most number of cancellations or diversions?
Descriptive statistics about the most popular airlines or the most popular destinations. What % of the flights belonged to which airline?
Percentage of flight journeys that occurred on weekends vs on weekdays.
What is the mean delay of various airlines? (have to include both departure and arrival delay) What % of flights had 0 delay in departure or arrival time? What % of flights took off much early than the scheduled time or landed earlier than the scheduled arrival? Are departure delays more than arrival delays? What % of flights are able to compensate the departure delay by arriving early?
How many different airports do each of the airlines visit? This can show how expanded each airline is.
---
title: "Homework 2"
author: "Vishnupriya Varadharaju"
desription: "Homework 2"
date: "12/16/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw2
- 2015_Flight_Delay_Cancellation
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
library(ggplot2)
library(lubridate)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE)
```
## Overview
I'm using the 2015 Flights Delay and Cancellation Dataset. This dataset has three different files. One is flights.csv which has details of flight names and their ID. The second file is airports.csv, which has details of the airport name, location, city, airlines that fly there and the latitude and longitude. The third file is flights.csv which has details about the date of a trip, the airline name, number, origin and destination airport, scheduled departure time and much more.
## Reading in the dataset
```{r}
#reading in airlines csv
airlines <- read_csv('_data/2015_flight_cancellation/airlines.csv',show_col_types = FALSE)
#reading in airports csv
airports <- read_csv('_data/2015_flight_cancellation/airports.csv', show_col_types = FALSE)
#reading in flights csv
flights <- read_csv('_data/2015_flight_cancellation/flights.csv', show_col_types = FALSE)
```
```{r}
#trying to get all the col names
colnames(airlines)
colnames(airports)
```
```{r}
#trying to get all the col names
colnames(flights)
```
## Describe the data
As stated previously the main table 'flights' has entries that correspond to a flight. The details of year, month, day and day_of_week tell the date of the flight. The airline name, flight number, tail number are to do with the aircraft details. The scheduled departure and scheduled arrival, where the actual times at which the flight should take off or land. The departure time and the arrival time are the actual time at which the flight took off or landed. The time difference between the scheduled and actual time is also given. The taxi-in time is the time take between touch down and the aircraft arriving at the gate. The taxi-out time is the time taken from the gate to wheels taking off. The time of travel, the distance covered is also given. Details about aircraft diversion, cancellation, cancellation reasons are also present.
A new point that I learnt here is the difference between flight number and tail number. A tail number is a number given to each physical aircraft. Whereas, flight number is a number given to an aircraft on a particular route. So flights with different tail numbers can have the same flight number.
```{r}
head(flights)
```
## Tidying data
The flights table is very large with close to 5800000 entries. For the sake of uploading a smaller file, I have edited the csv such as only month = 12 is included. To reduce on computation time, I am selecting only month equal to December for further analysis.
```{r}
flights_short <- flights %>% filter(MONTH == 12)
head(flights_short)
```
The resulting table has close to 479200 flights.
```{r}
summary(flights_short)
```
From the summary table we can see that for AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY AND WEATHER_DELAY, the number of NA values is close to 80%. These columns will not be very useful for analysis and hence can be removed. Furthermore, the CANCELLATION_REASON column also has nearly 98% NA values, and hence can be removed.
```{r}
(382458/479230) * 100
```
```{r}
unique(flights_short$CANCELLATION_REASON)
count(flights_short,CANCELLATION_REASON)
```
Creating a new field called date, which has the year, month and day in a single field.
```{r}
flights_short <- flights_short %>%
mutate(FLIGHT_DATE = str_c(DAY,
MONTH,
YEAR, sep="/"),
.before = YEAR,
FLIGHT_DATE = dmy(FLIGHT_DATE)) %>% select(-c(YEAR, MONTH, DAY,AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY,WEATHER_DELAY, DAY_OF_WEEK,CANCELLATION_REASON))
```
Next, we can see that the time stamps in SCHEDULED_DEPARTURE, SCHEDULED_ARRIVAL,DEPARTURE_TIME, ARRIVAL_TIME have the format with the first two digits indicating the hour and the next two digits indicating the minutes. This can be converted into a more readable time stamp. TAXI_IN, TAXI_OUT, WHEELS_OFF, WHEELS_ON - these columns may not be needed for our analysis and can be removed.
```{r}
flights_short <- flights_short %>%
mutate(SCHEDULED_DEPARTURE = str_c(substr(SCHEDULED_DEPARTURE, start=1, stop=2),
substr(SCHEDULED_DEPARTURE, start=3, stop=4),sep=":"), SCHEDULED_ARRIVAL = str_c(substr(SCHEDULED_ARRIVAL, start=1, stop=2),
substr(SCHEDULED_ARRIVAL, start=3, stop=4),sep=":"),
DEPARTURE_TIME = str_c(substr(DEPARTURE_TIME, start=1, stop=2),
substr(DEPARTURE_TIME, start=3, stop=4),sep=":"), ARRIVAL_TIME = str_c(substr(ARRIVAL_TIME, start=1, stop=2),substr(ARRIVAL_TIME, start=3, stop=4),sep=":")) %>%
select(-c(TAXI_OUT, TAXI_IN, WHEELS_OFF, WHEELS_ON, AIR_TIME, ELAPSED_TIME, SCHEDULED_TIME, FLIGHT_NUMBER,TAIL_NUMBER))
```
```{r}
flights_short %>% filter(DIVERTED == 1 & CANCELLED == 1)
```
We can create a new column to show the flights that were cancelled or diverted. 1 corresponds to diverted and 2 corresponds to cancelled and 3 corresponds to flight journey that has taken place.
```{r}
flights_short <- flights_short %>% mutate(STATUS = case_when(
"DEPARTED" == 1 ~ 1,
"CANCELLED" == 1 ~ 2,
TRUE ~ 3
)) %>% select(-c(DIVERTED, CANCELLED))
```
There are some NA values in the DEPARTURE_DELAY. These null values can be removed from the data set as they are a small percentage when compared to the rest of the data. But some of these null values may also be because the flight was either cancelled or diverted. So just retaining the fields for now.
```{r}
flights_short <- flights_short %>% filter(is.na(DEPARTURE_DELAY) == FALSE & is.na(DEPARTURE_TIME) == FALSE & STATUS == 3) %>% filter(is.na(ARRIVAL_DELAY) == FALSE & is.na(ARRIVAL_TIME) == FALSE & STATUS == 3)
#summary(flights_short)
```
We can merge the airlines table with the current table to get the full name of the airlines. Merging the airports table with the current table to get the Origin and Destination City names as it will be easier to understand than the IATA code (left as an extension).
```{r}
flights_short <- flights_short %>%
rename(
IATA_CODE = AIRLINE
)
flight_table = merge(x=flights_short, y=airlines, by="IATA_CODE", all.x=TRUE)
flight_table <- flight_table %>% select(-IATA_CODE)
colOrder <- c("FLIGHT_DATE", "AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT",
"STATUS", "DISTANCE", "SCHEDULED_DEPARTURE","DEPARTURE_TIME",
"DEPARTURE_DELAY","SCHEDULED_ARRIVAL","ARRIVAL_TIME","ARRIVAL_DELAY")
flight_table1 <- flight_table[, colOrder]
head(flight_table1)
```
## Visualizing the data
One simple visualization is to plot the different airports that are there in the US from the airports table on to a geographical map.
```{r}
head(airports)
```
```{r}
library(maps)
USA <- map_data("world") %>% filter(region=="USA")
#filter out those airports which do not have a latitude and longitude
airport_loc <- airports %>% filter(is.na(LATITUDE) == FALSE, is.na(LONGITUDE) == FALSE) %>% select(-c(AIRPORT))
```
```{r}
ggplot() +
geom_polygon(data = USA, aes(x=long, y = lat, group = group), fill="lightgreen", alpha=0.5) +
geom_point(data=airport_loc, aes(x=LONGITUDE, y=LATITUDE), size = 0.1) +
theme_void() + coord_map() + ylim(NA,150) + xlim(NA, 5) + ggtitle("Location of airports in the USA") + theme(plot.title = element_text(lineheight=1, face="bold"))
```
The above graph is just to give a rough idea about where the airports are roughly located in the USA. We can see that the airports are present in Alaska and in all the other island states of the USA.
## Research Questions
The table flight_table can be further cleaned based on the requirements of the analysis questions.
This data can provide the analysis to the following questions.
1. In the month of December, what percentage of flight trips were cancelled, or diverted? Which airlines had the most number of cancellations or diversions?
2. Descriptive statistics about the most popular airlines or the most popular destinations. What % of the flights belonged to which airline?
3. Percentage of flight journeys that occurred on weekends vs on weekdays.
4. What is the mean delay of various airlines? (have to include both departure and arrival delay) What % of flights had 0 delay in departure or arrival time? What % of flights took off much early than the scheduled time or landed earlier than the scheduled arrival? Are departure delays more than arrival delays? What % of flights are able to compensate the departure delay by arriving early?
5. How many different airports do each of the airlines visit? This can show how expanded each airline is.