DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Homework 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Overview
  • Reading in the dataset
  • Describe the data
  • Tidying data
  • Visualizing the data
  • Research Questions

Homework 2

  • Show All Code
  • Hide All Code

  • View Source
hw2
2015_Flight_Delay_Cancellation
Author

Vishnupriya Varadharaju

Published

December 16, 2022

Code
library(tidyverse)
library(ggplot2)
library(lubridate)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE)

Overview

I’m using the 2015 Flights Delay and Cancellation Dataset. This dataset has three different files. One is flights.csv which has details of flight names and their ID. The second file is airports.csv, which has details of the airport name, location, city, airlines that fly there and the latitude and longitude. The third file is flights.csv which has details about the date of a trip, the airline name, number, origin and destination airport, scheduled departure time and much more.

Reading in the dataset

Code
#reading in airlines csv
airlines <- read_csv('_data/2015_flight_cancellation/airlines.csv',show_col_types = FALSE)
#reading in airports csv
airports <- read_csv('_data/2015_flight_cancellation/airports.csv', show_col_types = FALSE)
#reading in flights csv
flights <- read_csv('_data/2015_flight_cancellation/flights.csv', show_col_types = FALSE)
Code
#trying to get all the col names 
colnames(airlines)
[1] "IATA_CODE" "AIRLINE"  
Code
colnames(airports)
[1] "IATA_CODE" "AIRPORT"   "CITY"      "STATE"     "COUNTRY"   "LATITUDE" 
[7] "LONGITUDE"
Code
#trying to get all the col names 
colnames(flights)
 [1] "YEAR"                "MONTH"               "DAY"                
 [4] "DAY_OF_WEEK"         "AIRLINE"             "FLIGHT_NUMBER"      
 [7] "TAIL_NUMBER"         "ORIGIN_AIRPORT"      "DESTINATION_AIRPORT"
[10] "SCHEDULED_DEPARTURE" "DEPARTURE_TIME"      "DEPARTURE_DELAY"    
[13] "TAXI_OUT"            "WHEELS_OFF"          "SCHEDULED_TIME"     
[16] "ELAPSED_TIME"        "AIR_TIME"            "DISTANCE"           
[19] "WHEELS_ON"           "TAXI_IN"             "SCHEDULED_ARRIVAL"  
[22] "ARRIVAL_TIME"        "ARRIVAL_DELAY"       "DIVERTED"           
[25] "CANCELLED"           "CANCELLATION_REASON" "AIR_SYSTEM_DELAY"   
[28] "SECURITY_DELAY"      "AIRLINE_DELAY"       "LATE_AIRCRAFT_DELAY"
[31] "WEATHER_DELAY"      

Describe the data

As stated previously the main table ‘flights’ has entries that correspond to a flight. The details of year, month, day and day_of_week tell the date of the flight. The airline name, flight number, tail number are to do with the aircraft details. The scheduled departure and scheduled arrival, where the actual times at which the flight should take off or land. The departure time and the arrival time are the actual time at which the flight took off or landed. The time difference between the scheduled and actual time is also given. The taxi-in time is the time take between touch down and the aircraft arriving at the gate. The taxi-out time is the time taken from the gate to wheels taking off. The time of travel, the distance covered is also given. Details about aircraft diversion, cancellation, cancellation reasons are also present.

A new point that I learnt here is the difference between flight number and tail number. A tail number is a number given to each physical aircraft. Whereas, flight number is a number given to an aircraft on a particular route. So flights with different tail numbers can have the same flight number.

Code
head(flights)
# A tibble: 6 × 31
   YEAR MONTH   DAY DAY_OF_WEEK AIRLINE FLIGHT…¹ TAIL_…² ORIGI…³ DESTI…⁴ SCHED…⁵
  <dbl> <dbl> <dbl>       <dbl> <chr>      <dbl> <chr>   <chr>   <chr>   <chr>  
1  2015    12     1           2 AA          1230 N3KSAA  SEA     DFW     0005   
2  2015    12     1           2 DL          1426 N820DN  SFO     MSP     0005   
3  2015    12     1           2 AA          2406 N850AA  LAX     DFW     0010   
4  2015    12     1           2 UA           680 N38473  SFO     ORD     0014   
5  2015    12     1           2 AA           883 N199UW  SFO     CLT     0015   
6  2015    12     1           2 AA          2020 N977UY  PHX     CLT     0015   
# … with 21 more variables: DEPARTURE_TIME <chr>, DEPARTURE_DELAY <dbl>,
#   TAXI_OUT <dbl>, WHEELS_OFF <chr>, SCHEDULED_TIME <dbl>, ELAPSED_TIME <dbl>,
#   AIR_TIME <dbl>, DISTANCE <dbl>, WHEELS_ON <chr>, TAXI_IN <dbl>,
#   SCHEDULED_ARRIVAL <chr>, ARRIVAL_TIME <chr>, ARRIVAL_DELAY <dbl>,
#   DIVERTED <dbl>, CANCELLED <dbl>, CANCELLATION_REASON <chr>,
#   AIR_SYSTEM_DELAY <dbl>, SECURITY_DELAY <dbl>, AIRLINE_DELAY <dbl>,
#   LATE_AIRCRAFT_DELAY <dbl>, WEATHER_DELAY <dbl>, and abbreviated variable …

Tidying data

The flights table is very large with close to 5800000 entries. For the sake of uploading a smaller file, I have edited the csv such as only month = 12 is included. To reduce on computation time, I am selecting only month equal to December for further analysis.

Code
flights_short <- flights %>% filter(MONTH == 12)
head(flights_short)
# A tibble: 6 × 31
   YEAR MONTH   DAY DAY_OF_WEEK AIRLINE FLIGHT…¹ TAIL_…² ORIGI…³ DESTI…⁴ SCHED…⁵
  <dbl> <dbl> <dbl>       <dbl> <chr>      <dbl> <chr>   <chr>   <chr>   <chr>  
1  2015    12     1           2 AA          1230 N3KSAA  SEA     DFW     0005   
2  2015    12     1           2 DL          1426 N820DN  SFO     MSP     0005   
3  2015    12     1           2 AA          2406 N850AA  LAX     DFW     0010   
4  2015    12     1           2 UA           680 N38473  SFO     ORD     0014   
5  2015    12     1           2 AA           883 N199UW  SFO     CLT     0015   
6  2015    12     1           2 AA          2020 N977UY  PHX     CLT     0015   
# … with 21 more variables: DEPARTURE_TIME <chr>, DEPARTURE_DELAY <dbl>,
#   TAXI_OUT <dbl>, WHEELS_OFF <chr>, SCHEDULED_TIME <dbl>, ELAPSED_TIME <dbl>,
#   AIR_TIME <dbl>, DISTANCE <dbl>, WHEELS_ON <chr>, TAXI_IN <dbl>,
#   SCHEDULED_ARRIVAL <chr>, ARRIVAL_TIME <chr>, ARRIVAL_DELAY <dbl>,
#   DIVERTED <dbl>, CANCELLED <dbl>, CANCELLATION_REASON <chr>,
#   AIR_SYSTEM_DELAY <dbl>, SECURITY_DELAY <dbl>, AIRLINE_DELAY <dbl>,
#   LATE_AIRCRAFT_DELAY <dbl>, WEATHER_DELAY <dbl>, and abbreviated variable …

The resulting table has close to 479200 flights.

Code
summary(flights_short)
      YEAR          MONTH         DAY         DAY_OF_WEEK      AIRLINE         
 Min.   :2015   Min.   :12   Min.   : 1.00   Min.   :1.000   Length:479230     
 1st Qu.:2015   1st Qu.:12   1st Qu.: 8.00   1st Qu.:2.000   Class :character  
 Median :2015   Median :12   Median :16.00   Median :4.000   Mode  :character  
 Mean   :2015   Mean   :12   Mean   :15.97   Mean   :3.843                     
 3rd Qu.:2015   3rd Qu.:12   3rd Qu.:23.00   3rd Qu.:5.000                     
 Max.   :2015   Max.   :12   Max.   :31.00   Max.   :7.000                     
                                                                               
 FLIGHT_NUMBER  TAIL_NUMBER        ORIGIN_AIRPORT     DESTINATION_AIRPORT
 Min.   :   1   Length:479230      Length:479230      Length:479230      
 1st Qu.: 730   Class :character   Class :character   Class :character   
 Median :1742   Mode  :character   Mode  :character   Mode  :character   
 Mean   :2158                                                            
 3rd Qu.:3100                                                            
 Max.   :8445                                                            
                                                                         
 SCHEDULED_DEPARTURE DEPARTURE_TIME     DEPARTURE_DELAY      TAXI_OUT     
 Length:479230       Length:479230      Min.   : -82.00   Min.   :  1.00  
 Class :character    Class :character   1st Qu.:  -4.00   1st Qu.: 11.00  
 Mode  :character    Mode  :character   Median :  -1.00   Median : 14.00  
                                        Mean   :  11.78   Mean   : 16.21  
                                        3rd Qu.:  10.00   3rd Qu.: 19.00  
                                        Max.   :1649.00   Max.   :172.00  
                                        NA's   :7679      NA's   :7961    
  WHEELS_OFF        SCHEDULED_TIME   ELAPSED_TIME      AIR_TIME    
 Length:479230      Min.   : 21.0   Min.   : 15.0   Min.   :  8.0  
 Class :character   1st Qu.: 88.0   1st Qu.: 84.0   1st Qu.: 61.0  
 Mode  :character   Median :126.0   Median :122.0   Median : 97.0  
                    Mean   :145.2   Mean   :139.9   Mean   :116.2  
                    3rd Qu.:178.0   3rd Qu.:172.0   3rd Qu.:147.0  
                    Max.   :705.0   Max.   :730.0   Max.   :690.0  
                                    NA's   :9513    NA's   :9513   
    DISTANCE       WHEELS_ON            TAXI_IN        SCHEDULED_ARRIVAL 
 Min.   :  31.0   Length:479230      Min.   :  1.000   Length:479230     
 1st Qu.: 383.0   Class :character   1st Qu.:  4.000   Class :character  
 Median : 672.0   Mode  :character   Median :  6.000   Mode  :character  
 Mean   : 837.8                      Mean   :  7.505                     
 3rd Qu.:1076.0                      3rd Qu.:  9.000                     
 Max.   :4983.0                      Max.   :248.000                     
                                     NA's   :8453                        
 ARRIVAL_TIME       ARRIVAL_DELAY         DIVERTED          CANCELLED      
 Length:479230      Min.   : -80.000   Min.   :0.000000   Min.   :0.00000  
 Class :character   1st Qu.: -14.000   1st Qu.:0.000000   1st Qu.:0.00000  
 Mode  :character   Median :  -5.000   Median :0.000000   Median :0.00000  
                    Mean   :   6.093   Mean   :0.003026   Mean   :0.01682  
                    3rd Qu.:  10.000   3rd Qu.:0.000000   3rd Qu.:0.00000  
                    Max.   :1636.000   Max.   :1.000000   Max.   :1.00000  
                    NA's   :9513                                           
 CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY   AIRLINE_DELAY   
 Length:479230       Min.   :   0.0   Min.   :  0.0    Min.   :   0.0  
 Class :character    1st Qu.:   0.0   1st Qu.:  0.0    1st Qu.:   0.0  
 Mode  :character    Median :   1.0   Median :  0.0    Median :   4.0  
                     Mean   :  13.3   Mean   :  0.1    Mean   :  19.6  
                     3rd Qu.:  17.0   3rd Qu.:  0.0    3rd Qu.:  19.0  
                     Max.   :1049.0   Max.   :154.0    Max.   :1636.0  
                     NA's   :382458   NA's   :382458   NA's   :382458  
 LATE_AIRCRAFT_DELAY WEATHER_DELAY   
 Min.   :   0.0      Min.   :   0.0  
 1st Qu.:   0.0      1st Qu.:   0.0  
 Median :   6.0      Median :   0.0  
 Mean   :  26.2      Mean   :   3.4  
 3rd Qu.:  32.0      3rd Qu.:   0.0  
 Max.   :1190.0      Max.   :1211.0  
 NA's   :382458      NA's   :382458  

From the summary table we can see that for AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY AND WEATHER_DELAY, the number of NA values is close to 80%. These columns will not be very useful for analysis and hence can be removed. Furthermore, the CANCELLATION_REASON column also has nearly 98% NA values, and hence can be removed.

Code
(382458/479230) * 100
[1] 79.80677
Code
unique(flights_short$CANCELLATION_REASON)
[1] NA  "A" "B" "C"
Code
count(flights_short,CANCELLATION_REASON)
# A tibble: 4 × 2
  CANCELLATION_REASON      n
  <chr>                <int>
1 A                     1579
2 B                     5613
3 C                      871
4 <NA>                471167

Creating a new field called date, which has the year, month and day in a single field.

Code
flights_short <- flights_short %>% 
  mutate(FLIGHT_DATE = str_c(DAY,
                              MONTH,
                              YEAR, sep="/"),
         .before = YEAR,
         FLIGHT_DATE = dmy(FLIGHT_DATE)) %>% select(-c(YEAR, MONTH, DAY,AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY,WEATHER_DELAY, DAY_OF_WEEK,CANCELLATION_REASON))

Next, we can see that the time stamps in SCHEDULED_DEPARTURE, SCHEDULED_ARRIVAL,DEPARTURE_TIME, ARRIVAL_TIME have the format with the first two digits indicating the hour and the next two digits indicating the minutes. This can be converted into a more readable time stamp. TAXI_IN, TAXI_OUT, WHEELS_OFF, WHEELS_ON - these columns may not be needed for our analysis and can be removed.

Code
flights_short <- flights_short %>% 
  mutate(SCHEDULED_DEPARTURE = str_c(substr(SCHEDULED_DEPARTURE, start=1, stop=2),
                              substr(SCHEDULED_DEPARTURE, start=3, stop=4),sep=":"), SCHEDULED_ARRIVAL = str_c(substr(SCHEDULED_ARRIVAL, start=1, stop=2),
                              substr(SCHEDULED_ARRIVAL, start=3, stop=4),sep=":"),
         DEPARTURE_TIME = str_c(substr(DEPARTURE_TIME, start=1, stop=2),
                              substr(DEPARTURE_TIME, start=3, stop=4),sep=":"), ARRIVAL_TIME = str_c(substr(ARRIVAL_TIME, start=1, stop=2),substr(ARRIVAL_TIME, start=3, stop=4),sep=":")) %>% 
  select(-c(TAXI_OUT, TAXI_IN, WHEELS_OFF, WHEELS_ON, AIR_TIME, ELAPSED_TIME, SCHEDULED_TIME, FLIGHT_NUMBER,TAIL_NUMBER))
Code
flights_short %>% filter(DIVERTED == 1 & CANCELLED == 1) 
# A tibble: 0 × 13
# … with 13 variables: FLIGHT_DATE <date>, AIRLINE <chr>, ORIGIN_AIRPORT <chr>,
#   DESTINATION_AIRPORT <chr>, SCHEDULED_DEPARTURE <chr>, DEPARTURE_TIME <chr>,
#   DEPARTURE_DELAY <dbl>, DISTANCE <dbl>, SCHEDULED_ARRIVAL <chr>,
#   ARRIVAL_TIME <chr>, ARRIVAL_DELAY <dbl>, DIVERTED <dbl>, CANCELLED <dbl>

We can create a new column to show the flights that were cancelled or diverted. 1 corresponds to diverted and 2 corresponds to cancelled and 3 corresponds to flight journey that has taken place.

Code
flights_short <- flights_short %>% mutate(STATUS = case_when(
  "DEPARTED" == 1 ~ 1,
  "CANCELLED" == 1 ~ 2,
  TRUE ~ 3
)) %>% select(-c(DIVERTED, CANCELLED))

There are some NA values in the DEPARTURE_DELAY. These null values can be removed from the data set as they are a small percentage when compared to the rest of the data. But some of these null values may also be because the flight was either cancelled or diverted. So just retaining the fields for now.

Code
flights_short <- flights_short %>% filter(is.na(DEPARTURE_DELAY) == FALSE & is.na(DEPARTURE_TIME) == FALSE & STATUS == 3) %>% filter(is.na(ARRIVAL_DELAY) == FALSE & is.na(ARRIVAL_TIME) == FALSE & STATUS == 3)
#summary(flights_short)

We can merge the airlines table with the current table to get the full name of the airlines. Merging the airports table with the current table to get the Origin and Destination City names as it will be easier to understand than the IATA code (left as an extension).

Code
flights_short <- flights_short %>% 
  rename(
    IATA_CODE = AIRLINE
    )

flight_table = merge(x=flights_short, y=airlines, by="IATA_CODE", all.x=TRUE)
flight_table <- flight_table %>% select(-IATA_CODE) 

colOrder <- c("FLIGHT_DATE", "AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT",
              "STATUS", "DISTANCE", "SCHEDULED_DEPARTURE","DEPARTURE_TIME",
              "DEPARTURE_DELAY","SCHEDULED_ARRIVAL","ARRIVAL_TIME","ARRIVAL_DELAY")
            
flight_table1 <- flight_table[, colOrder]
head(flight_table1)
  FLIGHT_DATE                AIRLINE ORIGIN_AIRPORT DESTINATION_AIRPORT STATUS
1  2015-12-01 American Airlines Inc.            SEA                 DFW      3
2  2015-12-23 American Airlines Inc.            DFW                 CLT      3
3  2015-12-01 American Airlines Inc.            LAX                 DFW      3
4  2015-12-23 American Airlines Inc.            CLT                 BUF      3
5  2015-12-01 American Airlines Inc.            SFO                 CLT      3
6  2015-12-01 American Airlines Inc.            PHX                 CLT      3
  DISTANCE SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY SCHEDULED_ARRIVAL
1     1660               00:05          00:09               4             05:55
2      936               20:20          20:25               5             23:50
3     1235               00:10          00:07              -3             05:01
4      546               20:20          20:32              12             22:00
5     2296               00:15          00:13              -2             08:05
6     1773               00:15          00:11              -4             06:01
  ARRIVAL_TIME ARRIVAL_DELAY
1        05:38           -17
2        23:39           -11
3        04:53            -8
4        22:06             6
5        08:10             5
6        05:29           -32

Visualizing the data

One simple visualization is to plot the different airports that are there in the US from the airports table on to a geographical map.

Code
head(airports)
# A tibble: 6 × 7
  IATA_CODE AIRPORT                          CITY  STATE COUNTRY LATIT…¹ LONGI…²
  <chr>     <chr>                            <chr> <chr> <chr>     <dbl>   <dbl>
1 ABE       Lehigh Valley International Air… Alle… PA    USA        40.7   -75.4
2 ABI       Abilene Regional Airport         Abil… TX    USA        32.4   -99.7
3 ABQ       Albuquerque International Sunpo… Albu… NM    USA        35.0  -107. 
4 ABR       Aberdeen Regional Airport        Aber… SD    USA        45.4   -98.4
5 ABY       Southwest Georgia Regional Airp… Alba… GA    USA        31.5   -84.2
6 ACK       Nantucket Memorial Airport       Nant… MA    USA        41.3   -70.1
# … with abbreviated variable names ¹​LATITUDE, ²​LONGITUDE
Code
library(maps)

Attaching package: 'maps'
The following object is masked from 'package:purrr':

    map
Code
USA <- map_data("world") %>% filter(region=="USA")

#filter out those airports which do not have a latitude and longitude
airport_loc <- airports %>% filter(is.na(LATITUDE) == FALSE, is.na(LONGITUDE) == FALSE) %>% select(-c(AIRPORT))
Code
ggplot() +
  geom_polygon(data = USA, aes(x=long, y = lat, group = group), fill="lightgreen", alpha=0.5) +
  geom_point(data=airport_loc, aes(x=LONGITUDE, y=LATITUDE), size = 0.1) +
  theme_void() + coord_map() + ylim(NA,150) + xlim(NA, 5) + ggtitle("Location of airports in the USA") + theme(plot.title = element_text(lineheight=1, face="bold"))
Error in `mproject()`:
! The package `mapproj` is required for `coord_map()`

The above graph is just to give a rough idea about where the airports are roughly located in the USA. We can see that the airports are present in Alaska and in all the other island states of the USA.

Research Questions

The table flight_table can be further cleaned based on the requirements of the analysis questions. This data can provide the analysis to the following questions.

  1. In the month of December, what percentage of flight trips were cancelled, or diverted? Which airlines had the most number of cancellations or diversions?

  2. Descriptive statistics about the most popular airlines or the most popular destinations. What % of the flights belonged to which airline?

  3. Percentage of flight journeys that occurred on weekends vs on weekdays.

  4. What is the mean delay of various airlines? (have to include both departure and arrival delay) What % of flights had 0 delay in departure or arrival time? What % of flights took off much early than the scheduled time or landed earlier than the scheduled arrival? Are departure delays more than arrival delays? What % of flights are able to compensate the departure delay by arriving early?

  5. How many different airports do each of the airlines visit? This can show how expanded each airline is.

Source Code
---
title: "Homework 2"
author: "Vishnupriya Varadharaju"
desription: "Homework 2"
date: "12/16/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - hw2
  - 2015_Flight_Delay_Cancellation
---

```{r}
#| label: setup
#| warning: false

library(tidyverse)
library(ggplot2)
library(lubridate)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE)
```

## Overview

I'm using the 2015 Flights Delay and Cancellation Dataset. This dataset has three different files. One is flights.csv which has details of flight names and their ID. The second file is airports.csv, which has details of the airport name, location, city, airlines that fly there and the latitude and longitude. The third file is flights.csv which has details about the date of a trip, the airline name, number, origin and destination airport, scheduled departure time and much more.

## Reading in the dataset

```{r}
#reading in airlines csv
airlines <- read_csv('_data/2015_flight_cancellation/airlines.csv',show_col_types = FALSE)
#reading in airports csv
airports <- read_csv('_data/2015_flight_cancellation/airports.csv', show_col_types = FALSE)
#reading in flights csv
flights <- read_csv('_data/2015_flight_cancellation/flights.csv', show_col_types = FALSE)
```

```{r}
#trying to get all the col names 
colnames(airlines)
colnames(airports)
```

```{r}
#trying to get all the col names 
colnames(flights)
```

## Describe the data

As stated previously the main table 'flights' has entries that correspond to a flight. The details of year, month, day and day_of_week tell the date of the flight. The airline name, flight number, tail number are to do with the aircraft details. The scheduled departure and scheduled arrival, where the actual times at which the flight should take off or land. The departure time and the arrival time are the actual time at which the flight took off or landed. The time difference between the scheduled and actual time is also given. The taxi-in time is the time take between touch down and the aircraft arriving at the gate. The taxi-out time is the time taken from the gate to wheels taking off. The time of travel, the distance covered is also given. Details about aircraft diversion, cancellation, cancellation reasons are also present. 

A new point that I learnt here is the difference between flight number and tail number. A tail number is a number given to each physical aircraft. Whereas, flight number is a number given to an aircraft on a particular route. So flights with different tail numbers can have the same flight number.

```{r}
head(flights)
```

## Tidying data

The flights table is very large with close to 5800000 entries. For the sake of uploading a smaller file, I have edited the csv such as only month = 12 is included. To reduce on computation time, I am selecting only month equal to December for further analysis.

```{r}
flights_short <- flights %>% filter(MONTH == 12)
head(flights_short)
```
The resulting table has close to 479200 flights. 

```{r}
summary(flights_short)
```

From the summary table we can see that for AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY AND WEATHER_DELAY, the number of NA values is close to 80%. These columns will not be very useful for analysis and hence can be removed. Furthermore, the CANCELLATION_REASON column also has nearly 98% NA values, and hence can be removed.

```{r}
(382458/479230) * 100
```

```{r}
unique(flights_short$CANCELLATION_REASON)
count(flights_short,CANCELLATION_REASON)
```

Creating a new field called date, which has the year, month and day in a single field.
```{r}
flights_short <- flights_short %>% 
  mutate(FLIGHT_DATE = str_c(DAY,
                              MONTH,
                              YEAR, sep="/"),
         .before = YEAR,
         FLIGHT_DATE = dmy(FLIGHT_DATE)) %>% select(-c(YEAR, MONTH, DAY,AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY,WEATHER_DELAY, DAY_OF_WEEK,CANCELLATION_REASON))
```

Next, we can see that the time stamps in SCHEDULED_DEPARTURE, SCHEDULED_ARRIVAL,DEPARTURE_TIME, ARRIVAL_TIME have the format with the first two digits indicating the hour and the next two digits indicating the minutes. This can be converted into a more readable time stamp. TAXI_IN, TAXI_OUT, WHEELS_OFF, WHEELS_ON - these columns may not be needed for our analysis and can be removed. 


```{r}
flights_short <- flights_short %>% 
  mutate(SCHEDULED_DEPARTURE = str_c(substr(SCHEDULED_DEPARTURE, start=1, stop=2),
                              substr(SCHEDULED_DEPARTURE, start=3, stop=4),sep=":"), SCHEDULED_ARRIVAL = str_c(substr(SCHEDULED_ARRIVAL, start=1, stop=2),
                              substr(SCHEDULED_ARRIVAL, start=3, stop=4),sep=":"),
         DEPARTURE_TIME = str_c(substr(DEPARTURE_TIME, start=1, stop=2),
                              substr(DEPARTURE_TIME, start=3, stop=4),sep=":"), ARRIVAL_TIME = str_c(substr(ARRIVAL_TIME, start=1, stop=2),substr(ARRIVAL_TIME, start=3, stop=4),sep=":")) %>% 
  select(-c(TAXI_OUT, TAXI_IN, WHEELS_OFF, WHEELS_ON, AIR_TIME, ELAPSED_TIME, SCHEDULED_TIME, FLIGHT_NUMBER,TAIL_NUMBER))
```


```{r}
flights_short %>% filter(DIVERTED == 1 & CANCELLED == 1) 
```
We can create a new column to show the flights that were cancelled or diverted. 1 corresponds to diverted and 2 corresponds to cancelled and 3 corresponds to flight journey that has taken place.

```{r}
flights_short <- flights_short %>% mutate(STATUS = case_when(
  "DEPARTED" == 1 ~ 1,
  "CANCELLED" == 1 ~ 2,
  TRUE ~ 3
)) %>% select(-c(DIVERTED, CANCELLED))
```

There are some NA values in the DEPARTURE_DELAY. These null values can be removed from the data set as they are a small percentage when compared to the rest of the data. But some of these null values may also be because the flight was either cancelled or diverted. So just retaining the fields for now.

```{r}
flights_short <- flights_short %>% filter(is.na(DEPARTURE_DELAY) == FALSE & is.na(DEPARTURE_TIME) == FALSE & STATUS == 3) %>% filter(is.na(ARRIVAL_DELAY) == FALSE & is.na(ARRIVAL_TIME) == FALSE & STATUS == 3)
#summary(flights_short)
```

We can merge the airlines table with the current table to get the full name of the airlines. Merging the airports table with the current table to get the Origin and Destination City names as it will be easier to understand than the IATA code (left as an extension).

```{r}
flights_short <- flights_short %>% 
  rename(
    IATA_CODE = AIRLINE
    )

flight_table = merge(x=flights_short, y=airlines, by="IATA_CODE", all.x=TRUE)
flight_table <- flight_table %>% select(-IATA_CODE) 

colOrder <- c("FLIGHT_DATE", "AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT",
              "STATUS", "DISTANCE", "SCHEDULED_DEPARTURE","DEPARTURE_TIME",
              "DEPARTURE_DELAY","SCHEDULED_ARRIVAL","ARRIVAL_TIME","ARRIVAL_DELAY")
            
flight_table1 <- flight_table[, colOrder]
head(flight_table1)
```

## Visualizing the data

One simple visualization is to plot the different airports that are there in the US from the airports table on to a geographical map.

```{r}
head(airports)
```


```{r}
library(maps)
USA <- map_data("world") %>% filter(region=="USA")

#filter out those airports which do not have a latitude and longitude
airport_loc <- airports %>% filter(is.na(LATITUDE) == FALSE, is.na(LONGITUDE) == FALSE) %>% select(-c(AIRPORT))
```

```{r}
ggplot() +
  geom_polygon(data = USA, aes(x=long, y = lat, group = group), fill="lightgreen", alpha=0.5) +
  geom_point(data=airport_loc, aes(x=LONGITUDE, y=LATITUDE), size = 0.1) +
  theme_void() + coord_map() + ylim(NA,150) + xlim(NA, 5) + ggtitle("Location of airports in the USA") + theme(plot.title = element_text(lineheight=1, face="bold"))
```


The above graph is just to give a rough idea about where the airports are roughly located in the USA. We can see that the airports are present in Alaska and in all the other island states of the USA. 

## Research Questions

The table flight_table can be further cleaned based on the requirements of the analysis questions.
This data can provide the analysis to the following questions.

1. In the month of December, what percentage of flight trips were cancelled, or diverted? Which airlines had the most number of cancellations or diversions?

2. Descriptive statistics about the most popular airlines or the most popular destinations. What % of the flights belonged to which airline?

3. Percentage of flight journeys that occurred on weekends vs on weekdays.

4. What is the mean delay of various airlines? (have to include both departure and arrival delay) What % of flights had 0 delay in departure or arrival time? What % of flights took off much early than the scheduled time or landed earlier than the scheduled arrival? Are departure delays more than arrival delays? What % of flights are able to compensate the departure delay by arriving early?

5. How many different airports do each of the airlines visit? This can show how expanded each airline is.