DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Homework 3

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Instructions
  • Reading in the Data
  • Brief Description of the data
  • Narrative about the data set..
  • Potential Research Questions

Homework 3

  • Show All Code
  • Hide All Code

  • View Source
HW3
Used Vehicle Data
Owen Tibby
Author

Owen Tibby

Published

November 30, 2022

Code
library(tidyverse)
library(tidyr)
library(summarytools)
library(dplyr)
library(readr)
library(readxl)
library(stringr)
library(lubridate)
library(vindecodr)
Error in library(vindecodr): there is no package called 'vindecodr'
Code
library(plotrix)
library(sparklyr)
Error in library(sparklyr): there is no package called 'sparklyr'
Code
library(purrr)



knitr::opts_chunk$set(echo = TRUE)

Instructions

The emphasis in this homework is on exploratory data analysis using both graphics and statistics. You should build on your prior homework - incorporating any feedback and adjusting the code and text as needed. These homeworks are intended to be cumulative. Therefore, while it is fine to switch datasets, you will need to include all of the information from HW1 for your new (or old) dataset in this hw submission as well.

  • Include descriptive statistics (e.g, mean, median, and standard deviation for numerical variables, and frequencies and/or mode for categorical variables

  • Include relevant visualizations using ggplot2 to complement these descriptive statistics. Be sure to use faceting, coloring, and titles as needed. Each visualization should be accompanied by descriptive text that highlights:

  • the variable(s) used

  • what questions might be answered with the visualizations

  • what conclusions you can draw

  • Use group_by() and summarise() to compute descriptive stats and/or visualizations for any relevant groupings. For example, if you were interested in how average income varies by state, you might compute mean income for all states combined, and then compare this to the range and distribution of mean income for each individual state in the US.

  • Identify limitations of your visualization, such as:

  • What questions are left unanswered with your visualizations

  • What about the visualizations may be unclear to a naive viewer

  • How could you improve the visualizations for the final project

Reading in the Data

Code
# #RStudio Desktop
carprices <- read_csv("car_prices.csv", col_types = cols(mmr = col_number(), year = col_character(), sellingprice = col_number(), ...17 = col_skip()))
ca_catalog <- read_csv("Catalog.csv", col_types = cols(Brand = col_skip(), ModelName = col_skip()))  #, Trim = col_skip()))
exchange_rate <- read_excel("HistoricalPrices.xlsx", col_types = c("numeric", "text", "skip", "skip", "skip", "numeric"))
# # #RStudio Cloud
#  carprices <-read_csv("/cloud/project/Final_project/car_prices.csv",
#      col_types = cols(mmr = col_number(), year = col_character(), sellingprice = col_number(), ...17 = col_skip()))
#   ca_catalog <- read_csv("/cloud/project/Final_project/Catalog.csv", col_types = cols(Brand = col_skip(), ModelName = col_skip()))  #, Trim = col_skip()))
#  exchange_rate <- read_excel("/cloud/project/Final_project/HistoricalPrices.xlsx",  col_types = c("numeric", "text", "skip", "skip", "skip", "numeric"))

Brief Description of the data

This data set contains historical used vehicle auction prices that took place between 1982 and 2015.

First, lets take a look at the variables in the data set.

Code
summary(carprices)
     year               make              model               trim          
 Length:558837      Length:558837      Length:558837      Length:558837     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
     body           transmission           vin               state          
 Length:558837      Length:558837      Length:558837      Length:558837     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   condition        odometer         color             interior        
 Min.   :1.000   Min.   :     1   Length:558837      Length:558837     
 1st Qu.:2.700   1st Qu.: 28371   Class :character   Class :character  
 Median :3.600   Median : 52254   Mode  :character   Mode  :character  
 Mean   :3.425   Mean   : 68320                                        
 3rd Qu.:4.200   3rd Qu.: 99109                                        
 Max.   :5.000   Max.   :999999                                        
 NA's   :11820   NA's   :94                                            
    seller               mmr          sellingprice      saledate        
 Length:558837      Min.   :    25   Min.   :     1   Length:558837     
 Class :character   1st Qu.:  7100   1st Qu.:  6900   Class :character  
 Mode  :character   Median : 12250   Median : 12100   Mode  :character  
                    Mean   : 13769   Mean   : 13611                     
                    3rd Qu.: 18300   3rd Qu.: 18200                     
                    Max.   :182000   Max.   :230000                     
                    NA's   :26                                          

## Cleaning the Data

I’ve noticed a few records belonging to Canadian provinces. “ab” is Alberta, “on” is Ontario, “qc” is Quebec & “ns” represents Nova Scotia.

Code
#Observing the amount of NAs in our most important variables
# carprices %>% 
#   filter(is.na(model), is.na(make)) %>% 
#   dim()



#Justify dropping NAs... Why did i exclude it

carprices <- carprices %>% 
  filter(!is.na(make), !is.na(model), !is.na(odometer), !is.na(mmr)) 
  #dfSummary(carprices)

#Observing all records for model year 2007 and newer
carprices<- carprices %>% 
  filter(`year`> 2006, sellingprice>100, condition>2.41)
  dim(carprices)
[1] 388155     16
Code
#Checking the consistency in the names of states and body types
  carprices %>% 
  select(state) %>% 
  distinct()
# A tibble: 38 × 1
   state
   <chr>
 1 ca   
 2 tx   
 3 pa   
 4 mn   
 5 az   
 6 tn   
 7 md   
 8 ne   
 9 oh   
10 fl   
# … with 28 more rows
Code
carprices$state <- toupper(carprices$state)



carprices %>% 
  select(body) %>% 
  distinct() %>%print()
# A tibble: 83 × 1
   body         
   <chr>        
 1 SUV          
 2 Sedan        
 3 Convertible  
 4 Wagon        
 5 Coupe        
 6 Crew Cab     
 7 G Coupe      
 8 G Sedan      
 9 Hatchback    
10 Elantra Coupe
# … with 73 more rows
Code
carprices$body <- toupper(carprices$body)

I notice that there are 76 distinct values for body. I’ll explore and condense them into fewer categories in a new column called “Body_type” below

Code
#Checking the consistency in the names of Body types}
carprices %>% 
select(body) %>% 
distinct()
# A tibble: 43 × 1
   body         
   <chr>        
 1 SUV          
 2 SEDAN        
 3 CONVERTIBLE  
 4 WAGON        
 5 COUPE        
 6 CREW CAB     
 7 G COUPE      
 8 G SEDAN      
 9 HATCHBACK    
10 ELANTRA COUPE
# … with 33 more rows
Code
#Creating a Body Type column
carprices <- mutate(carprices, Body_Type = case_when(str_detect(body, "EDAN")~ "Sedan", str_detect(body, "SUV") ~ "SUV", str_detect(body, "AB") ~ "Pickup", str_detect(body, "OUP")~ "Coupe", str_detect(`body`, "VAN")~ "Minivan", str_detect(`body`,"ONV")~ "Convertible", str_detect(`body`,"AGO") ~ "Wagon", str_detect(`body`,"ATCH")~ "Hatchback", str_detect(`body`,"CREW")~ "Pickup" ))
# print(head(carprices))


count(carprices)
# A tibble: 1 × 1
       n
   <int>
1 388155

Next step is to link multiple files to get the MSRP from a historical catalog. We’ll need this in order to get AV/MSRP ratios.

Code
#Creating a Key in the CDN catalog
ca_catalog <-  ca_catalog %>% 
mutate(YearOfLaunch= `Model Year`-1)




#Creating Catalog with MSRP in CAD
ca_catalog <- ca_catalog %>% 
group_by(`Combined`, `YearOfLaunch`, `Segment` ) %>% 
summarise( Trim_Count=n(), AVG_MSRP_CAD= mean(`MSRP`, na.rm= TRUE), MSRP_Range_CAD= (max(`MSRP`)- min(`MSRP`)), StdDev_MSRP_CAD= sd(`MSRP`, na.rm=FALSE), Error.MSRP_CAD= std.error(`MSRP`, na.rm=FALSE))%>% 
ungroup()
Code
#Joining the data to convert CAD MSRP to USD MSRP later
ca_catalog <- left_join(ca_catalog, exchange_rate, c("YearOfLaunch"= "Year"))


#Creating US$ Catalog and Converting to MSRP from CAD to USD
us_catalog <- ca_catalog %>% 
  mutate(AVG_MSRP= round((`AVG_MSRP_CAD` / `Close`), digits = 0), MSRP_Range= `MSRP_Range_CAD` / `Close`, StdDev_MSRP= `StdDev_MSRP_CAD`/ `Close`, Error.MSRP= `Error.MSRP_CAD`/ `Close` )

#Finalizing US Catalog for Joining
us_catalog <- us_catalog %>% 
select(Combined, Segment, AVG_MSRP, MSRP_Range, Trim_Count, StdDev_MSRP, Error.MSRP) %>% 
  mutate(`Error %`= round((`Error.MSRP`/ `AVG_MSRP` * 100), digits = 1), `%MSRP_Range` = round((`MSRP_Range`/ `AVG_MSRP` *100), digits=1))

#Creating a common key to join carprices and us_catalog
carprices<- carprices %>% 
  mutate(year2=year, make2=make, model2=model)
carprices <- carprices %>%  unite(`YMM`, `year2`, `make2`, `model2`, sep=" ")
Code
#Joining Data with us_catalog
# carprices <- head(carprices) 
carprices <- left_join(carprices, us_catalog, c("YMM"= "Combined" )) 



#saving our raw records beofre filtering
raw_carprices <- carprices


#Filtering out the keys that were not mapped (Run these false keys through a vin decoder to get exact matches)
false_keys <- carprices %>% 
  filter(is.na(AVG_MSRP))



#Removing NAs in MSRP from main dataset

carprices <- carprices %>% 
  filter(!is.na(AVG_MSRP)) 

#replace_na(carprices$StdDev_MSRP, 0)
print(head(carprices))
# A tibble: 6 × 26
  year  make  model    trim      body  trans…¹ vin   state condi…² odome…³ color
  <chr> <chr> <chr>    <chr>     <chr> <chr>   <chr> <chr>   <dbl>   <dbl> <chr>
1 2015  Kia   Sorento  LX        SUV   automa… 5xyk… CA        5     16639 white
2 2015  Kia   Sorento  LX        SUV   automa… 5xyk… CA        5      9393 white
3 2014  BMW   3 Series 328i SUL… SEDAN automa… wba3… CA        4.5    1331 gray 
4 2015  Volvo S60      T5        SEDAN automa… yv16… CA        4.1   14282 white
5 2014  BMW   M5       Base      SEDAN automa… wbsf… CA        3.4   14943 black
6 2014  Audi  A4       2.0T Pre… SEDAN automa… wauf… CA        4.2    9557 white
# … with 15 more variables: interior <chr>, seller <chr>, mmr <dbl>,
#   sellingprice <dbl>, saledate <chr>, Body_Type <chr>, YMM <chr>,
#   Segment <chr>, AVG_MSRP <dbl>, MSRP_Range <dbl>, Trim_Count <int>,
#   StdDev_MSRP <dbl>, Error.MSRP <dbl>, `Error %` <dbl>, `%MSRP_Range` <dbl>,
#   and abbreviated variable names ¹​transmission, ²​condition, ³​odometer
Code
# To clean Year, make, Model fields in false_keys, Run Below 

VinIn <-false_keys %>% select(vin) 
VinIn <- VinIn$vin

decode <- function(x) {
 VIN<- decode_vin(x)}

decoded_keys <- map_dfr(VinIn,  decode)

#Saving the above output as a CSV
write.csv(decoded_keys, "C:\\Users\\Owen-Matthew\\Documents\\Github\\601_Fall_2022\\posts\\decoded_keys.csv", row.names=FALSE)
Code
false_keys %>% 
  select(vin) %>% 
  distinct() %>% 
  dim()
[1] 46123     1
Code
# #R Studio Desktop Read In
# #Cleaning False keys
action_keys <- read_csv("decoded_keys.csv")


# #R Studio Cloud Read In
# #Cleaning False keys
# action_keys <- read_csv("/cloud/project/Final_project/decoded_keys.csv")


a_keys <- action_keys

a_keys <-  a_keys %>% 
  filter(!is.na(year)) %>% 
filter(!is.na(make)) %>% 
filter(!is.na(model)) %>% 
 filter(!is.na(VIN)) 

#Join with Catalog
a_keys <-  a_keys %>% 
  select(model_year, make,model, VIN) %>% 
  mutate(year= model_year, make2=make, model2=model) %>% 
  unite (YMM, model_year, make2, model2, sep = " ")
a_keys <- inner_join(a_keys, us_catalog, c("YMM"= "Combined" )) 

#Checking to see if a vehicle was auctioned multiple times, to see if we need to include sale date as a matching key
unique_check <- a_keys %>% 
  select(`VIN`) %>% 
  distinct() %>% 
  dim() 
Code
#Selecting useful columns in false_keys
false_keys <- false_keys %>%
select(trim, body, Body_Type, transmission, vin, state , condition, odometer, color, interior,seller, mmr, sellingprice, saledate)

#Join with false keys to get sale records
cleaned_keys= inner_join(a_keys, false_keys, c("VIN" = "vin")) %>% 
  select(1:26) %>% 
 rename("vin" = "VIN")


#get column names & order from carprices
check_cols <-  colnames(carprices) %>% as.factor()

#checking to see if the column order is the same
cleaned_keys <- cleaned_keys[, check_cols]
print(head(cleaned_keys))
# A tibble: 6 × 26
   year make     model trim      body  trans…¹ vin   state condi…² odome…³ color
  <dbl> <chr>    <chr> <chr>     <chr> <chr>   <chr> <chr>   <dbl>   <dbl> <chr>
1  2013 INFINITI G37   G37 Sport G CO… automa… jn1c… CA        4.5   14716 white
2  2013 INFINITI G37   G37 Jour… G SE… automa… jn1c… CA        4.6      68 black
3  2013 INFINITI G37   G37 Jour… G SE… automa… jn1c… CA        4.5   24112 white
4  2013 INFINITI G37   G37 Sport G CO… automa… jn1c… CA        4.7    4859 black
5  2013 INFINITI G37   G37 Sport G CO… automa… jn1c… CA        4.5   21126 black
6  2013 INFINITI G37   G37 Jour… G SE… automa… jn1c… CA        4.2   26675 blue 
# … with 15 more variables: interior <chr>, seller <chr>, mmr <dbl>,
#   sellingprice <dbl>, saledate <chr>, Body_Type <chr>, YMM <chr>,
#   Segment <chr>, AVG_MSRP <dbl>, MSRP_Range <dbl>, Trim_Count <int>,
#   StdDev_MSRP <dbl>, Error.MSRP <dbl>, `Error %` <dbl>, `%MSRP_Range` <dbl>,
#   and abbreviated variable names ¹​transmission, ²​condition, ³​odometer
Code
print(head(carprices))
# A tibble: 6 × 26
  year  make  model    trim      body  trans…¹ vin   state condi…² odome…³ color
  <chr> <chr> <chr>    <chr>     <chr> <chr>   <chr> <chr>   <dbl>   <dbl> <chr>
1 2015  Kia   Sorento  LX        SUV   automa… 5xyk… CA        5     16639 white
2 2015  Kia   Sorento  LX        SUV   automa… 5xyk… CA        5      9393 white
3 2014  BMW   3 Series 328i SUL… SEDAN automa… wba3… CA        4.5    1331 gray 
4 2015  Volvo S60      T5        SEDAN automa… yv16… CA        4.1   14282 white
5 2014  BMW   M5       Base      SEDAN automa… wbsf… CA        3.4   14943 black
6 2014  Audi  A4       2.0T Pre… SEDAN automa… wauf… CA        4.2    9557 white
# … with 15 more variables: interior <chr>, seller <chr>, mmr <dbl>,
#   sellingprice <dbl>, saledate <chr>, Body_Type <chr>, YMM <chr>,
#   Segment <chr>, AVG_MSRP <dbl>, MSRP_Range <dbl>, Trim_Count <int>,
#   StdDev_MSRP <dbl>, Error.MSRP <dbl>, `Error %` <dbl>, `%MSRP_Range` <dbl>,
#   and abbreviated variable names ¹​transmission, ²​condition, ³​odometer
Code
#We managed to clean 30% of the previous NA records; 13632 of 46608 records
Code
#reorganize cleaned keys to rbind with carprices data
carprices <-  rbind(carprices, cleaned_keys)
dim(carprices)
[1] 367892     26
Code
#Checking for NA values in AVG_MSRP and YMM
# carprices %>% 
#   filter(is.na(AVG_MSRP)) %>% 
#   print()
# carprices %>% 
#   filter(is.na(YMM)) %>%
#   print()

#Removing na records in Error%
carprices$`Error %` <- carprices$`Error %` %>% replace_na(0)

carprices <- carprices %>% filter(`Error %`< 7)

#Removing 677 NA records for Bodytype
na.body <- carprices %>%  filter(is.na(Body_Type))

carprices <- carprices %>%  filter(!is.na(Body_Type))

•

Code
# Displaying the Canadian records
suppressWarnings({
  
  carprices %>%
  filter(state== c("AB", "ON", "QC", "NS")) 

# Excluding Canadian records
carprices<-carprices %>%
filter(state!= c("AB", "ON", "QC", "NS"))

})

I’ll create a new variable called Launch_date. Since most vehicles of model year x are released in the fall of the the previous year, I’ll follow that logic and assume that for all makes.

Code
#Creating Launch Date Step 1
carprices$year <- as.numeric(as.character(carprices$year))

carprices<- carprices %>% 
  mutate(launch_month= "9", launch_day= "1", launch_year= (`year`-1))

#Creating Launch Date Step 2
carprices <- carprices %>% 
mutate(launch_date= make_date(year= `launch_year`, month = `launch_month`, day=`launch_day`))


#Converting saledate to a date variable
carprices <- carprices %>% 
  separate(saledate, into= c("ddd", "sale_month", "sale_day", "sale_year", "sale_time", "GMT", "TMZ"), sep=" ")   

#Recoding the format of the sale month
carprices <- mutate(carprices, sale_month2= recode(sale_month, `Jan` ="1", `Feb`= "2", `Mar`="3", `Apr`="4", `May`="5", `Jun`="6", `Jul`="7", `Aug`="8", `Sep`="9", `Oct`="10", `Nov`="11", `Dec`="12" ))

#Creating Date_Sold
#carprices <- mutate(carprices, sale_year2=sale_year)
carprices <- mutate(carprices, sale_year2=sale_year, Date_Sold = make_date(year=sale_year2, month = sale_month2, day = sale_day))

# Tidyiing up: selecting important columns
carprices <- carprices %>% 
select(year, make, model, trim, transmission, Body_Type, state, condition, odometer, color, mmr, sellingprice, sale_year, launch_date, Date_Sold, AVG_MSRP, Trim_Count, `Error %`, `%MSRP_Range`, Segment ) 
#print(head(carprices))

Calculating the age of the vehicle at the time of sale:

Code
#Age of vehicle
carprices <- mutate(carprices, Age_months= round(as.numeric(difftime(Date_Sold, launch_date, units= "days"))/30))

#Checking for negatives in Vehicle Age
carprices %>% 
  select(Age_months) %>% 
  filter(Age_months<0)# %>% 
# A tibble: 5 × 1
  Age_months
       <dbl>
1         -8
2         -8
3         -8
4         -8
5         -8
Code
  #print.difftime()

#Recoding Vehicle Age
carprices<-carprices %>% 
  mutate(Age_months=recode(Age_months, `-8`=0))

#Creating Age in years
carprices<-carprices %>% 
  mutate(Age_years=round(Age_months/12, 0))
print(head(carprices))
# A tibble: 6 × 22
   year make  model    trim    trans…¹ Body_…² state condi…³ odome…⁴ color   mmr
  <dbl> <chr> <chr>    <chr>   <chr>   <chr>   <chr>   <dbl>   <dbl> <chr> <dbl>
1  2015 Kia   Sorento  LX      automa… SUV     CA        5     16639 white 20500
2  2015 Kia   Sorento  LX      automa… SUV     CA        5      9393 white 20800
3  2014 BMW   3 Series 328i S… automa… Sedan   CA        4.5    1331 gray  31900
4  2015 Volvo S60      T5      automa… Sedan   CA        4.1   14282 white 27500
5  2014 BMW   M5       Base    automa… Sedan   CA        3.4   14943 black 69000
6  2014 Audi  A4       2.0T P… automa… Sedan   CA        4.2    9557 white 32100
# … with 11 more variables: sellingprice <dbl>, sale_year <chr>,
#   launch_date <date>, Date_Sold <date>, AVG_MSRP <dbl>, Trim_Count <int>,
#   `Error %` <dbl>, `%MSRP_Range` <dbl>, Segment <chr>, Age_months <dbl>,
#   Age_years <dbl>, and abbreviated variable names ¹​transmission, ²​Body_Type,
#   ³​condition, ⁴​odometer
Code
#Retained value
carprices <- carprices %>% 
mutate(RV_percent= round(`sellingprice` / `AVG_MSRP` *100, digits = 1))

carprices <-  carprices %>% filter(RV_percent < 125)

print(head(carprices))
# A tibble: 6 × 23
   year make  model    trim    trans…¹ Body_…² state condi…³ odome…⁴ color   mmr
  <dbl> <chr> <chr>    <chr>   <chr>   <chr>   <chr>   <dbl>   <dbl> <chr> <dbl>
1  2015 Kia   Sorento  LX      automa… SUV     CA        5     16639 white 20500
2  2015 Kia   Sorento  LX      automa… SUV     CA        5      9393 white 20800
3  2014 BMW   3 Series 328i S… automa… Sedan   CA        4.5    1331 gray  31900
4  2015 Volvo S60      T5      automa… Sedan   CA        4.1   14282 white 27500
5  2014 BMW   M5       Base    automa… Sedan   CA        3.4   14943 black 69000
6  2014 Audi  A4       2.0T P… automa… Sedan   CA        4.2    9557 white 32100
# … with 12 more variables: sellingprice <dbl>, sale_year <chr>,
#   launch_date <date>, Date_Sold <date>, AVG_MSRP <dbl>, Trim_Count <int>,
#   `Error %` <dbl>, `%MSRP_Range` <dbl>, Segment <chr>, Age_months <dbl>,
#   Age_years <dbl>, RV_percent <dbl>, and abbreviated variable names
#   ¹​transmission, ²​Body_Type, ³​condition, ⁴​odometer
Code
#How does the distriburion of RV% look?
summary(carprices$RV_percent)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.90   37.00   47.20   47.98   58.40  124.70 

Narrative about the data set..

This data set contains historical used vehicle auction prices that took place between 1982 and 2015. The data was collected from all Manheim auction houses across the 34 states in the USA. Each observation details a wholesale auction transaction that includes: the vehicle information, state, selling price and the market average price i.e the Manheim Market Report price (MMR) and other specifics about the sale.

Potential Research Questions

Code
#Establishing vehicle groups
#Error in Nissan records
mainstream <- c("Toyota", "Honda", "Hyundai", "Nissan", "Hyundai", "Kia", "Mazda", "Subaru", "Volkswagen")
luxury <- c("BMW", "Audi", "Mercedes Benz", "Porsche", "Lexus", "INFINITI", "Acura", "Genesis", "Volvo", "Cadillac", "Lincoln")
passenger <-  c("Sedan", "SUV", "Coupe", "Convertible", "Wagon", "Hatchback")
Japanese4 <-  c("Toyota", "Honda", "Nissan", "Mazda")
Korean <- c("Kia", "Hyundai", "Genesis")
German <- c("Audi", "BMW", "Mercedes Benz")
  1. Is there a correlation between mileage, age and condition?
Code
plot2 <- carprices %>% 
  filter(make== c(`mainstream`, `luxury`), Body_Type== c(passenger) )
Warning in make == c(mainstream, luxury): longer object length is not a multiple
of shorter object length
Warning in Body_Type == c(passenger): longer object length is not a multiple of
shorter object length
Code
ggplot(data = plot2) + aes(x =condition , y = (odometer), color= Age_years) +
 geom_point(position = "jitter")+
  geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: The following aesthetics were dropped during statistical transformation: colour
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

  1. How does mileage affect vehicle sale price/ RV?
Code
plot2 <- carprices %>% 
  filter(make== c(`mainstream`, `luxury`), Body_Type== c(passenger), RV_percent >15) %>% 
  arrange(desc(RV_percent)) 
Warning in make == c(mainstream, luxury): longer object length is not a multiple
of shorter object length
Warning in Body_Type == c(passenger): longer object length is not a multiple of
shorter object length
Code
ggplot(data = plot2) + aes(y = RV_percent, x = (odometer)) +
 geom_point(position = "jitter")+
geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Code
#Control for Age & condition
  1. What is there a correlation between condition and Retained value?
Code
plot2 <- carprices %>% 
  filter(make== c(`mainstream`, `luxury`), Body_Type== c(passenger) ) %>% 
group_by(Age_years)
Warning in make == c(mainstream, luxury): longer object length is not a multiple
of shorter object length
Warning in Body_Type == c(passenger): longer object length is not a multiple of
shorter object length
Code
ggplot(data = plot2) + aes(x =condition, y = (RV_percent)) +
 geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

  1. How does the transaction volume vary by age across volume states
Code
#Analyzing the number of transactions by state by the age 


vol.states<- carprices %>% 
  filter(Age_years < 7, Age_years >0) %>% 
  group_by(Age_years, state) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  slice(1:5)


vol.states %>% 
  ggplot()+ aes(x=reorder(state, -n), y= n) +
  geom_bar(stat = "identity", fill= "blue") +
labs(x= "State", y="Number of Transactions") +
facet_wrap(vars(Age_years))+
ggtitle("Number of Transactions by State faceted by Vehicle Age")

From the above, we see that most vehicles are auctioned between ages 2 and 4 years. This corresponds to the popular lease terms of 24 to 48 months.

  1. Which segment retains the best value?
Code
by_segment <- carprices %>% 
  filter(Age_years < 8, Age_years>0, !is.na(Segment), Body_Type== passenger, make==c(mainstream, luxury)) %>% 
  group_by(Age_years, Segment, sale_year, Age_months) %>% 

  summarise(Segment_RV= round(mean(RV_percent), digits=1)) %>% 
  arrange(desc(Segment_RV)) %>%
  arrange((Age_years)) %>% 
  arrange((sale_year)) %>% 
 slice(1:5) 

  print(head(by_segment))
# A tibble: 6 × 5
# Groups:   Age_years, Segment, sale_year [2]
  Age_years Segment     sale_year Age_months Segment_RV
      <dbl> <chr>       <chr>          <dbl>      <dbl>
1         1 Compact Car 2014              16       70.3
2         1 Compact Car 2015               8       94.3
3         1 Compact Car 2015              10       80.5
4         1 Compact Car 2015               9       77.4
5         1 Compact Car 2015              17       65.2
6         1 Compact Car 2015              16       63.3
Code
ggplot(data = by_segment) + 
  geom_smooth(mapping = aes(x = Age_months, y = Segment_RV, color = `Segment`), se= F) #+

Code
  # facet_wrap(~sale_year)
Code
#Finding the Segments with the highest RV
best_value_segments <- carprices %>%
  filter(Age_years> 1, Age_years <6, Body_Type== passenger, `Error %` < 5, !is.na(Segment)) %>% 
  group_by(Segment,Age_years,  RV_percent, odometer, condition, model) %>% 
   tally() %>% 
  summarise(RV_percentage = mean(RV_percent), odometer=mean(odometer),condition=mean(condition), .groups = "keep") %>% 
  tally() %>% 
  
  summarise(Segment=Segment, Age_years= Age_years, RV_percent= mean(RV_percent), odometer=mean(odometer),condition=mean(condition),  n = sum(n), .groups = "keep") %>%
  ungroup() %>% 
   group_by(Segment, Age_years) %>% 
  summarise(Segment=Segment, Age_years= Age_years, RV_percent= mean(RV_percent),Mileage=mean(odometer), Condition=mean(condition),  Sample_Size = sum(n), .groups = "keep") %>%
  slice(1:1) %>% 
  arrange(desc(RV_percent)) %>% 
  arrange((Age_years))
Warning in Body_Type == passenger: longer object length is not a multiple of
shorter object length
Code
print(best_value_segments)
# A tibble: 36 × 6
# Groups:   Segment, Age_years [36]
   Segment                  Age_years RV_percent Mileage Condition Sample_Size
   <chr>                        <dbl>      <dbl>   <dbl>     <dbl>       <int>
 1 Mid-Size SUV/Crossover           2       63.2  31575.      4.16        1658
 2 Subcompact SUV/Crossover         2       62.0  25790.      4.18         381
 3 Compact SUV/Crossover            2       61.2  31791.      4.18        1892
 4 Full-Size SUV/Crossover          2       61.1  35426.      4.13         415
 5 Subcompact Car                   2       59.9  27101.      3.97         992
 6 Compact Car                      2       57.9  28675.      3.94        3189
 7 Sports Car                       2       52.1  27601.      4.00         651
 8 Mid-Size Car                     2       50.5  29878.      3.99        4368
 9 Full-Size Car                    2       47.8  33692.      3.95         868
10 Mid-Size SUV/Crossover           3       54.1  40314.      4.00         919
# … with 26 more rows
Code
#Removing a few missing fields in Segments that were missing in Catalog
BVS <- best_value_segments %>% 
  filter(!is.na(Segment)) 
  
# ggplot(data = BVS) + 
#   geom_point(mapping = aes(x = Age_years, y = RV_percent, color = `Segment`)) +
# geom_smooth(mapping = aes(x = Age_years, y = RV_percent, color = `Segment`), se= F)
 
BVS %>% 
ggplot() + aes(x = Age_years, y =(RV_percent)) +
  geom_bar(stat= "identity", fill="darkred") +
  facet_wrap(~Segment)

Code
#Volume models

vol.models<- carprices %>% 
filter( Age_years >0) %>% 
  group_by(model) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  ungroup() %>% 
  slice(1:10)

vol.models.passenger <- carprices %>% 
filter(Age_years >0, Body_Type== passenger) %>% 
  group_by(model) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  ungroup() %>% 
  slice(1:10)
Warning in Body_Type == passenger: longer object length is not a multiple of
shorter object length
Code
popular.models <-  vol.models.passenger$model


vol.models %>% 
  ggplot()+ aes(x=reorder(model, -n), y= n) +
  geom_bar(stat = "identity", fill= "blue") +
labs(x= "Model", y="Number of Wholesale Transactions")+
  ggtitle("Top Selling Used Vehicles")

Code
vol.models.passenger %>% 
  ggplot()+ aes(x=reorder(model, -n), y= n) +
  geom_bar(stat = "identity", fill= "blue") +
labs(x= "Model", y="Number of Transactions for Top  Selling Models")

Do the most popular vehicles retain the best value or is it the opposite? Let’s explore below:

Code
#Finding the models with the highest RV
best_value_models <- carprices %>%
  filter(Age_years> 1, Age_years <6, Body_Type== passenger, `Error %` < 5) %>% 
  group_by(make, model,Age_years,  RV_percent, odometer, condition, Segment) %>% 
   tally() %>% 
  summarise(RV_percentage = mean(RV_percent), odometer=mean(odometer), condition=mean(condition), .groups = "keep") %>% 
  tally() %>% 
  
  summarise(make=make, model=model, Age_years= Age_years, RV_percent= mean(RV_percent), odometer=mean(odometer),condition=mean(condition),  n = sum(n), .groups = "keep") %>%
  ungroup() %>% 
   group_by(make, model, Age_years) %>% 
  summarise(make=make, model=model, Age_years= Age_years, RV_percent= mean(RV_percent), mileage=mean(odometer), condition=mean(condition),  n = sum(n), .groups = "keep") %>%
  slice(1:1) %>% 
  arrange(desc(RV_percent)) %>% 
  
  #Filtering a sample size 20 or more
  filter(n>19) %>%
arrange((Age_years))
Warning in Body_Type == passenger: longer object length is not a multiple of
shorter object length
Code
     print(best_value_models)
# A tibble: 366 × 7
# Groups:   make, model, Age_years [366]
   make   model        Age_years RV_percent mileage condition     n
   <chr>  <chr>            <dbl>      <dbl>   <dbl>     <dbl> <int>
 1 Jeep   Wrangler             2       86.0  22507.      4.55   119
 2 Toyota 4Runner              2       85.6  26949.      4.47    30
 3 Subaru XV Crosstrek         2       83.0  15515       4.68    21
 4 Jeep   Cherokee             2       77.5  18633.      4.29    43
 5 Toyota RAV4                 2       72.9  27580.      4.19    72
 6 Toyota Highlander           2       72.0  28409.      4.10    33
 7 Nissan Versa Note           2       71.9  17331.      4.16    54
 8 Subaru Impreza              2       71.4  16454.      4.22    23
 9 BMW    X3                   2       69.9  20962.      4.20    22
10 Subaru Outback              2       69.3  28139.      4.29    69
# … with 356 more rows

How does the RV% compare among volume models?

Code
# vol.models %>% 
#   ggplot()+ aes(y= RV_percent , x=Age_months)

Popular makes

Code
#Distribution of makes

vol.brands<- carprices %>% 
  filter(Age_years < 7, Age_years >0) %>% 
  group_by(make) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  slice(1:10)

print(vol.brands)
# A tibble: 10 × 2
   make          n
   <chr>     <int>
 1 Ford      65459
 2 Nissan    35946
 3 Chevrolet 32329
 4 Toyota    23732
 5 Dodge     21475
 6 Hyundai   15149
 7 Honda     13749
 8 Kia       13547
 9 INFINITI  12451
10 BMW       11348
Code
vol.brands %>% 
  ggplot()+ aes(x=reorder(make, -n), y= n) +
  geom_bar(stat = "identity", fill= "blue") +
labs(x= "Brand", y="Number of Transactions") +
#facet_wrap(vars(Age_years))+
ggtitle("Popular Brands")

Code
#Japanese makes
plot <- carprices %>% 
  filter(Age_years<7, make== c("Toyota", "Honda", "Nissan", "Hyundai"), Body_Type== ("Sedan"))

ggplot(data = plot) + 
  geom_smooth(mapping = aes(x = Age_months, y = RV_percent, color = `make`))
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Best mainstream cars to buy post-2015?

Code
carprices %>% 
  filter(make== mainstream, Age_years == 2, `%MSRP_Range` <50, Body_Type== passenger) %>% 
  group_by(model) %>% 
  arrange(desc(RV_percent)) %>% 
  slice(1:10)
# A tibble: 198 × 23
# Groups:   model [25]
    year make    model  trim   trans…¹ Body_…² state condi…³ odome…⁴ color   mmr
   <dbl> <chr>   <chr>  <chr>  <chr>   <chr>   <chr>   <dbl>   <dbl> <chr> <dbl>
 1  2014 Hyundai Accent GS     automa… Hatchb… TN        4.6    6566 blue  11300
 2  2013 Hyundai Accent GS     automa… Hatchb… FL        4.8   14730 silv…  9875
 3  2013 Hyundai Accent GS     automa… Hatchb… FL        2.7   21101 black  9350
 4  2014 Hyundai Accent GS     <NA>    Hatchb… ON        2.8   40860 black 11100
 5  2014 Nissan  Altima 2.5 S  automa… Sedan   WA        4.1    2371 white 17050
 6  2014 Nissan  Altima 2.5 S  automa… Sedan   TN        4.1   22847 —     15600
 7  2014 Nissan  Altima 2.5 S  automa… Sedan   TN        4.7   27050 black 15250
 8  2013 Nissan  Altima 3.5 SL automa… Sedan   GA        3.7   19959 blue  16800
 9  2013 Nissan  Altima 3.5 SL automa… Sedan   MO        4.7   18735 white 19200
10  2014 Nissan  Altima 2.5 S  <NA>    Sedan   CA        2.9    2030 white 17250
# … with 188 more rows, 12 more variables: sellingprice <dbl>, sale_year <chr>,
#   launch_date <date>, Date_Sold <date>, AVG_MSRP <dbl>, Trim_Count <int>,
#   `Error %` <dbl>, `%MSRP_Range` <dbl>, Segment <chr>, Age_months <dbl>,
#   Age_years <dbl>, RV_percent <dbl>, and abbreviated variable names
#   ¹​transmission, ²​Body_Type, ³​condition, ⁴​odometer
Code
#Monthly mileage by state
plot7 <- carprices %>% 
  filter(make== c(`mainstream`, `luxury`), Body_Type== c(passenger) )

ggplot(data = plot7) + aes(x =state , y = ((odometer)/ Age_months)) +
 geom_histogram(stat = "identity") 

:::

Source Code
---
title: "Homework 3"
author: "Owen Tibby"
desription: "Homework 3"
date: "11/30/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - HW3
  - Used Vehicle Data
  - Owen Tibby
---

```{r}
#| label: setup
#| warning: false

library(tidyverse)
library(tidyr)
library(summarytools)
library(dplyr)
library(readr)
library(readxl)
library(stringr)
library(lubridate)
library(vindecodr)
library(plotrix)
library(sparklyr)
library(purrr)



knitr::opts_chunk$set(echo = TRUE)
```

## Instructions

The emphasis in this homework is on exploratory data analysis using both graphics and statistics. You should build on your prior homework - incorporating any feedback and adjusting the code and text as needed. These homeworks are intended to be cumulative. Therefore, while it is fine to switch datasets, you will need to include all of the information from HW1 for your new (or old) dataset in this hw submission as well.

-   Include descriptive statistics (e.g, mean, median, and standard deviation for numerical variables, and frequencies and/or mode for categorical variables

-   Include relevant visualizations using ggplot2 to complement these descriptive statistics. Be sure to use faceting, coloring, and titles as needed. Each visualization should be accompanied by descriptive text that highlights:

-   the variable(s) used

-   what questions might be answered with the visualizations

-   what conclusions you can draw

-   Use group_by() and summarise() to compute descriptive stats and/or visualizations for any relevant groupings. For example, if you were interested in how average income varies by state, you might compute mean income for all states combined, and then compare this to the range and distribution of mean income for each individual state in the US.

-   Identify limitations of your visualization, such as:

-   What questions are left unanswered with your visualizations

-   What about the visualizations may be unclear to a naive viewer

-   How could you improve the visualizations for the final project


## 

## Reading in the Data


```{r message=FALSE, warning=FALSE}


# #RStudio Desktop
carprices <- read_csv("car_prices.csv", col_types = cols(mmr = col_number(), year = col_character(), sellingprice = col_number(), ...17 = col_skip()))
ca_catalog <- read_csv("Catalog.csv", col_types = cols(Brand = col_skip(), ModelName = col_skip()))  #, Trim = col_skip()))
exchange_rate <- read_excel("HistoricalPrices.xlsx", col_types = c("numeric", "text", "skip", "skip", "skip", "numeric"))
# # #RStudio Cloud
#  carprices <-read_csv("/cloud/project/Final_project/car_prices.csv",
#      col_types = cols(mmr = col_number(), year = col_character(), sellingprice = col_number(), ...17 = col_skip()))
#   ca_catalog <- read_csv("/cloud/project/Final_project/Catalog.csv", col_types = cols(Brand = col_skip(), ModelName = col_skip()))  #, Trim = col_skip()))
#  exchange_rate <- read_excel("/cloud/project/Final_project/HistoricalPrices.xlsx",  col_types = c("numeric", "text", "skip", "skip", "skip", "numeric"))

```

## Brief Description of the data

This data set contains historical used vehicle auction prices that took place between 1982 and 2015.

First, lets take a look at the variables in the data set.

```{r message=FALSE, warning=FALSE}
summary(carprices)
```


 ## Cleaning the Data

I've noticed a few records belonging to Canadian provinces. "ab" is Alberta, "on" is Ontario, "qc" is Quebec & "ns" represents Nova Scotia.

```{r}
#Observing the amount of NAs in our most important variables
# carprices %>% 
#   filter(is.na(model), is.na(make)) %>% 
#   dim()



#Justify dropping NAs... Why did i exclude it

carprices <- carprices %>% 
  filter(!is.na(make), !is.na(model), !is.na(odometer), !is.na(mmr)) 
  #dfSummary(carprices)

#Observing all records for model year 2007 and newer
carprices<- carprices %>% 
  filter(`year`> 2006, sellingprice>100, condition>2.41)
  dim(carprices)
```

```{r}
#Checking the consistency in the names of states and body types
  carprices %>% 
  select(state) %>% 
  distinct()

carprices$state <- toupper(carprices$state)



carprices %>% 
  select(body) %>% 
  distinct() %>%print()

carprices$body <- toupper(carprices$body)


```

I notice that there are 76 distinct values for body. I'll explore and condense them into fewer categories in a new column called "Body_type" below

```{r}
#Checking the consistency in the names of Body types}
carprices %>% 
select(body) %>% 
distinct()



#Creating a Body Type column
carprices <- mutate(carprices, Body_Type = case_when(str_detect(body, "EDAN")~ "Sedan", str_detect(body, "SUV") ~ "SUV", str_detect(body, "AB") ~ "Pickup", str_detect(body, "OUP")~ "Coupe", str_detect(`body`, "VAN")~ "Minivan", str_detect(`body`,"ONV")~ "Convertible", str_detect(`body`,"AGO") ~ "Wagon", str_detect(`body`,"ATCH")~ "Hatchback", str_detect(`body`,"CREW")~ "Pickup" ))
# print(head(carprices))


count(carprices)
```

Next step is to link multiple files to get the MSRP from a historical catalog. We'll need this in order to get AV/MSRP ratios.

```{r message=FALSE, warning=FALSE}
#Creating a Key in the CDN catalog
ca_catalog <-  ca_catalog %>% 
mutate(YearOfLaunch= `Model Year`-1)




#Creating Catalog with MSRP in CAD
ca_catalog <- ca_catalog %>% 
group_by(`Combined`, `YearOfLaunch`, `Segment` ) %>% 
summarise( Trim_Count=n(), AVG_MSRP_CAD= mean(`MSRP`, na.rm= TRUE), MSRP_Range_CAD= (max(`MSRP`)- min(`MSRP`)), StdDev_MSRP_CAD= sd(`MSRP`, na.rm=FALSE), Error.MSRP_CAD= std.error(`MSRP`, na.rm=FALSE))%>% 
ungroup()
```

```{r}
#Joining the data to convert CAD MSRP to USD MSRP later
ca_catalog <- left_join(ca_catalog, exchange_rate, c("YearOfLaunch"= "Year"))


#Creating US$ Catalog and Converting to MSRP from CAD to USD
us_catalog <- ca_catalog %>% 
  mutate(AVG_MSRP= round((`AVG_MSRP_CAD` / `Close`), digits = 0), MSRP_Range= `MSRP_Range_CAD` / `Close`, StdDev_MSRP= `StdDev_MSRP_CAD`/ `Close`, Error.MSRP= `Error.MSRP_CAD`/ `Close` )

#Finalizing US Catalog for Joining
us_catalog <- us_catalog %>% 
select(Combined, Segment, AVG_MSRP, MSRP_Range, Trim_Count, StdDev_MSRP, Error.MSRP) %>% 
  mutate(`Error %`= round((`Error.MSRP`/ `AVG_MSRP` * 100), digits = 1), `%MSRP_Range` = round((`MSRP_Range`/ `AVG_MSRP` *100), digits=1))

#Creating a common key to join carprices and us_catalog
carprices<- carprices %>% 
  mutate(year2=year, make2=make, model2=model)
carprices <- carprices %>%  unite(`YMM`, `year2`, `make2`, `model2`, sep=" ")
```

```{r}
#Joining Data with us_catalog
# carprices <- head(carprices) 
carprices <- left_join(carprices, us_catalog, c("YMM"= "Combined" )) 



#saving our raw records beofre filtering
raw_carprices <- carprices


#Filtering out the keys that were not mapped (Run these false keys through a vin decoder to get exact matches)
false_keys <- carprices %>% 
  filter(is.na(AVG_MSRP))



#Removing NAs in MSRP from main dataset

carprices <- carprices %>% 
  filter(!is.na(AVG_MSRP)) 

#replace_na(carprices$StdDev_MSRP, 0)
print(head(carprices))
```


```{r eval=FALSE }
# To clean Year, make, Model fields in false_keys, Run Below 

VinIn <-false_keys %>% select(vin) 
VinIn <- VinIn$vin

decode <- function(x) {
 VIN<- decode_vin(x)}

decoded_keys <- map_dfr(VinIn,  decode)

#Saving the above output as a CSV
write.csv(decoded_keys, "C:\\Users\\Owen-Matthew\\Documents\\Github\\601_Fall_2022\\posts\\decoded_keys.csv", row.names=FALSE)



```

```{r message=FALSE, warning=FALSE}
false_keys %>% 
  select(vin) %>% 
  distinct() %>% 
  dim()
# #R Studio Desktop Read In
# #Cleaning False keys
action_keys <- read_csv("decoded_keys.csv")


# #R Studio Cloud Read In
# #Cleaning False keys
# action_keys <- read_csv("/cloud/project/Final_project/decoded_keys.csv")


a_keys <- action_keys

a_keys <-  a_keys %>% 
  filter(!is.na(year)) %>% 
filter(!is.na(make)) %>% 
filter(!is.na(model)) %>% 
 filter(!is.na(VIN)) 

#Join with Catalog
a_keys <-  a_keys %>% 
  select(model_year, make,model, VIN) %>% 
  mutate(year= model_year, make2=make, model2=model) %>% 
  unite (YMM, model_year, make2, model2, sep = " ")
a_keys <- inner_join(a_keys, us_catalog, c("YMM"= "Combined" )) 

#Checking to see if a vehicle was auctioned multiple times, to see if we need to include sale date as a matching key
unique_check <- a_keys %>% 
  select(`VIN`) %>% 
  distinct() %>% 
  dim() 
```

```{r}
#Selecting useful columns in false_keys
false_keys <- false_keys %>%
select(trim, body, Body_Type, transmission, vin, state , condition, odometer, color, interior,seller, mmr, sellingprice, saledate)

#Join with false keys to get sale records
cleaned_keys= inner_join(a_keys, false_keys, c("VIN" = "vin")) %>% 
  select(1:26) %>% 
 rename("vin" = "VIN")


#get column names & order from carprices
check_cols <-  colnames(carprices) %>% as.factor()

#checking to see if the column order is the same
cleaned_keys <- cleaned_keys[, check_cols]
print(head(cleaned_keys))
print(head(carprices))


#We managed to clean 30% of the previous NA records; 13632 of 46608 records
```

```{r}
#reorganize cleaned keys to rbind with carprices data
carprices <-  rbind(carprices, cleaned_keys)
dim(carprices)

#Checking for NA values in AVG_MSRP and YMM
# carprices %>% 
#   filter(is.na(AVG_MSRP)) %>% 
#   print()
# carprices %>% 
#   filter(is.na(YMM)) %>%
#   print()

#Removing na records in Error%
carprices$`Error %` <- carprices$`Error %` %>% replace_na(0)

carprices <- carprices %>% filter(`Error %`< 7)

#Removing 677 NA records for Bodytype
na.body <- carprices %>%  filter(is.na(Body_Type))

carprices <- carprices %>%  filter(!is.na(Body_Type))



```

•

```{r}
# Displaying the Canadian records
suppressWarnings({
  
  carprices %>%
  filter(state== c("AB", "ON", "QC", "NS")) 

# Excluding Canadian records
carprices<-carprices %>%
filter(state!= c("AB", "ON", "QC", "NS"))

})

```

I'll create a new variable called Launch_date. Since most vehicles of model year x are released in the fall of the the previous year, I'll follow that logic and assume that for all makes.

```{r}
#Creating Launch Date Step 1
carprices$year <- as.numeric(as.character(carprices$year))

carprices<- carprices %>% 
  mutate(launch_month= "9", launch_day= "1", launch_year= (`year`-1))

#Creating Launch Date Step 2
carprices <- carprices %>% 
mutate(launch_date= make_date(year= `launch_year`, month = `launch_month`, day=`launch_day`))


#Converting saledate to a date variable
carprices <- carprices %>% 
  separate(saledate, into= c("ddd", "sale_month", "sale_day", "sale_year", "sale_time", "GMT", "TMZ"), sep=" ")   

#Recoding the format of the sale month
carprices <- mutate(carprices, sale_month2= recode(sale_month, `Jan` ="1", `Feb`= "2", `Mar`="3", `Apr`="4", `May`="5", `Jun`="6", `Jul`="7", `Aug`="8", `Sep`="9", `Oct`="10", `Nov`="11", `Dec`="12" ))

#Creating Date_Sold
#carprices <- mutate(carprices, sale_year2=sale_year)
carprices <- mutate(carprices, sale_year2=sale_year, Date_Sold = make_date(year=sale_year2, month = sale_month2, day = sale_day))

# Tidyiing up: selecting important columns
carprices <- carprices %>% 
select(year, make, model, trim, transmission, Body_Type, state, condition, odometer, color, mmr, sellingprice, sale_year, launch_date, Date_Sold, AVG_MSRP, Trim_Count, `Error %`, `%MSRP_Range`, Segment ) 
#print(head(carprices))
```

Calculating the age of the vehicle at the time of sale:

```{r}
#Age of vehicle
carprices <- mutate(carprices, Age_months= round(as.numeric(difftime(Date_Sold, launch_date, units= "days"))/30))

#Checking for negatives in Vehicle Age
carprices %>% 
  select(Age_months) %>% 
  filter(Age_months<0)# %>% 
  #print.difftime()

#Recoding Vehicle Age
carprices<-carprices %>% 
  mutate(Age_months=recode(Age_months, `-8`=0))

#Creating Age in years
carprices<-carprices %>% 
  mutate(Age_years=round(Age_months/12, 0))
print(head(carprices))

```

```{r Calculating Retained Value% }
#Retained value
carprices <- carprices %>% 
mutate(RV_percent= round(`sellingprice` / `AVG_MSRP` *100, digits = 1))

carprices <-  carprices %>% filter(RV_percent < 125)

print(head(carprices))


#How does the distriburion of RV% look?
summary(carprices$RV_percent)



```


## Narrative about the data set..

This data set contains historical used vehicle auction prices that took place between 1982 and 2015. The data was collected from all Manheim auction houses across the 34 states in the USA. Each observation details a wholesale auction transaction that includes: the vehicle information, state, selling price and the market average price i.e the Manheim Market Report price (MMR) and other specifics about the sale.


## Potential Research Questions

```{r}
#Establishing vehicle groups
#Error in Nissan records
mainstream <- c("Toyota", "Honda", "Hyundai", "Nissan", "Hyundai", "Kia", "Mazda", "Subaru", "Volkswagen")
luxury <- c("BMW", "Audi", "Mercedes Benz", "Porsche", "Lexus", "INFINITI", "Acura", "Genesis", "Volvo", "Cadillac", "Lincoln")
passenger <-  c("Sedan", "SUV", "Coupe", "Convertible", "Wagon", "Hatchback")
Japanese4 <-  c("Toyota", "Honda", "Nissan", "Mazda")
Korean <- c("Kia", "Hyundai", "Genesis")
German <- c("Audi", "BMW", "Mercedes Benz")




```

1.  Is there a correlation between mileage, age and condition?

```{r}
plot2 <- carprices %>% 
  filter(make== c(`mainstream`, `luxury`), Body_Type== c(passenger) )

ggplot(data = plot2) + aes(x =condition , y = (odometer), color= Age_years) +
 geom_point(position = "jitter")+
  geom_smooth()
```

2.  How does mileage affect vehicle sale price/ RV?

```{r}
plot2 <- carprices %>% 
  filter(make== c(`mainstream`, `luxury`), Body_Type== c(passenger), RV_percent >15) %>% 
  arrange(desc(RV_percent)) 

ggplot(data = plot2) + aes(y = RV_percent, x = (odometer)) +
 geom_point(position = "jitter")+
geom_smooth()
 

#Control for Age & condition


```



3. What is there a correlation between condition and Retained value?

```{r}
plot2 <- carprices %>% 
  filter(make== c(`mainstream`, `luxury`), Body_Type== c(passenger) ) %>% 
group_by(Age_years)

ggplot(data = plot2) + aes(x =condition, y = (RV_percent)) +
 geom_smooth()
 
```



4.  How does the transaction volume vary by age across volume states

```{r}
#Analyzing the number of transactions by state by the age 


vol.states<- carprices %>% 
  filter(Age_years < 7, Age_years >0) %>% 
  group_by(Age_years, state) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  slice(1:5)


vol.states %>% 
  ggplot()+ aes(x=reorder(state, -n), y= n) +
  geom_bar(stat = "identity", fill= "blue") +
labs(x= "State", y="Number of Transactions") +
facet_wrap(vars(Age_years))+
ggtitle("Number of Transactions by State faceted by Vehicle Age")


```

From the above, we see that most vehicles are auctioned between ages 2 and 4 years. This corresponds to the popular lease terms of 24 to 48 months.

2.  Which segment retains the best value?

```{r message=FALSE, warning=FALSE}


by_segment <- carprices %>% 
  filter(Age_years < 8, Age_years>0, !is.na(Segment), Body_Type== passenger, make==c(mainstream, luxury)) %>% 
  group_by(Age_years, Segment, sale_year, Age_months) %>% 

  summarise(Segment_RV= round(mean(RV_percent), digits=1)) %>% 
  arrange(desc(Segment_RV)) %>%
  arrange((Age_years)) %>% 
  arrange((sale_year)) %>% 
 slice(1:5) 

  print(head(by_segment))

ggplot(data = by_segment) + 
  geom_smooth(mapping = aes(x = Age_months, y = Segment_RV, color = `Segment`), se= F) #+
  # facet_wrap(~sale_year)


```



```{r}
#Finding the Segments with the highest RV
best_value_segments <- carprices %>%
  filter(Age_years> 1, Age_years <6, Body_Type== passenger, `Error %` < 5, !is.na(Segment)) %>% 
  group_by(Segment,Age_years,  RV_percent, odometer, condition, model) %>% 
   tally() %>% 
  summarise(RV_percentage = mean(RV_percent), odometer=mean(odometer),condition=mean(condition), .groups = "keep") %>% 
  tally() %>% 
  
  summarise(Segment=Segment, Age_years= Age_years, RV_percent= mean(RV_percent), odometer=mean(odometer),condition=mean(condition),  n = sum(n), .groups = "keep") %>%
  ungroup() %>% 
   group_by(Segment, Age_years) %>% 
  summarise(Segment=Segment, Age_years= Age_years, RV_percent= mean(RV_percent),Mileage=mean(odometer), Condition=mean(condition),  Sample_Size = sum(n), .groups = "keep") %>%
  slice(1:1) %>% 
  arrange(desc(RV_percent)) %>% 
  arrange((Age_years))
     
print(best_value_segments)

#Removing a few missing fields in Segments that were missing in Catalog
BVS <- best_value_segments %>% 
  filter(!is.na(Segment)) 
  
# ggplot(data = BVS) + 
#   geom_point(mapping = aes(x = Age_years, y = RV_percent, color = `Segment`)) +
# geom_smooth(mapping = aes(x = Age_years, y = RV_percent, color = `Segment`), se= F)
 
BVS %>% 
ggplot() + aes(x = Age_years, y =(RV_percent)) +
  geom_bar(stat= "identity", fill="darkred") +
  facet_wrap(~Segment)

```



```{r}
#Volume models

vol.models<- carprices %>% 
filter( Age_years >0) %>% 
  group_by(model) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  ungroup() %>% 
  slice(1:10)

vol.models.passenger <- carprices %>% 
filter(Age_years >0, Body_Type== passenger) %>% 
  group_by(model) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  ungroup() %>% 
  slice(1:10)

popular.models <-  vol.models.passenger$model


vol.models %>% 
  ggplot()+ aes(x=reorder(model, -n), y= n) +
  geom_bar(stat = "identity", fill= "blue") +
labs(x= "Model", y="Number of Wholesale Transactions")+
  ggtitle("Top Selling Used Vehicles")


vol.models.passenger %>% 
  ggplot()+ aes(x=reorder(model, -n), y= n) +
  geom_bar(stat = "identity", fill= "blue") +
labs(x= "Model", y="Number of Transactions for Top  Selling Models")

```

Do the most popular vehicles retain the best value or is it the opposite? Let's explore below:



```{r}
#Finding the models with the highest RV
best_value_models <- carprices %>%
  filter(Age_years> 1, Age_years <6, Body_Type== passenger, `Error %` < 5) %>% 
  group_by(make, model,Age_years,  RV_percent, odometer, condition, Segment) %>% 
   tally() %>% 
  summarise(RV_percentage = mean(RV_percent), odometer=mean(odometer), condition=mean(condition), .groups = "keep") %>% 
  tally() %>% 
  
  summarise(make=make, model=model, Age_years= Age_years, RV_percent= mean(RV_percent), odometer=mean(odometer),condition=mean(condition),  n = sum(n), .groups = "keep") %>%
  ungroup() %>% 
   group_by(make, model, Age_years) %>% 
  summarise(make=make, model=model, Age_years= Age_years, RV_percent= mean(RV_percent), mileage=mean(odometer), condition=mean(condition),  n = sum(n), .groups = "keep") %>%
  slice(1:1) %>% 
  arrange(desc(RV_percent)) %>% 
  
  #Filtering a sample size 20 or more
  filter(n>19) %>%
arrange((Age_years))
     print(best_value_models)
  
```


How does the RV% compare among volume models?

```{r}
# vol.models %>% 
#   ggplot()+ aes(y= RV_percent , x=Age_months)

```



Popular makes

```{r}
#Distribution of makes

vol.brands<- carprices %>% 
  filter(Age_years < 7, Age_years >0) %>% 
  group_by(make) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  slice(1:10)

print(vol.brands)

vol.brands %>% 
  ggplot()+ aes(x=reorder(make, -n), y= n) +
  geom_bar(stat = "identity", fill= "blue") +
labs(x= "Brand", y="Number of Transactions") +
#facet_wrap(vars(Age_years))+
ggtitle("Popular Brands")
```

```{r warning=FALSE}
#Japanese makes
plot <- carprices %>% 
  filter(Age_years<7, make== c("Toyota", "Honda", "Nissan", "Hyundai"), Body_Type== ("Sedan"))

ggplot(data = plot) + 
  geom_smooth(mapping = aes(x = Age_months, y = RV_percent, color = `make`))

```


Best mainstream cars to buy post-2015?

```{r warning=FALSE}
carprices %>% 
  filter(make== mainstream, Age_years == 2, `%MSRP_Range` <50, Body_Type== passenger) %>% 
  group_by(model) %>% 
  arrange(desc(RV_percent)) %>% 
  slice(1:10)



  

  

```





 
```{r eval=FALSE}

#Monthly mileage by state
plot7 <- carprices %>% 
  filter(make== c(`mainstream`, `luxury`), Body_Type== c(passenger) )

ggplot(data = plot7) + aes(x =state , y = ((odometer)/ Age_months)) +
 geom_histogram(stat = "identity") 

```



:::

##