Data Analytics and Computational Social Science: project

Xiaotong

Introduction

With the development of the economy, the number of cars in the hands of people has continued to grow rapidly, and nearly 70% of the profits of the automobile production link are contributed by the circulation and after-sales service links. Used cars are the most important link in the value chain of automobile circulation. China’s used car market has huge potential and unprecedented prospects. The rapid growth of used car circulation will promote circular consumption and promote the healthy development of the automobile industry. It will play an important role in driving employment, promoting stable and rapid economic development, and improving people’s living standards.

Based on this social reality, this report conducts a relevant analysis of used car data and explores its impact on the price and value retention rate of used cars based on factors such as car types and displacement.

Data

Read data

The data comes from the China Used Car Association. The built-in summary function can be used to briefly see the data. There are a total of 15 initial variables. Since the title of the original table is in English, the rename() function is used to rename it.

data=read.csv(file=file.choose(),header = T,na.strings = c("NA",""))

names(data)=c("Vehicle.brand","brand","location","style","type","emission","gear","model","time","kilometres","price","original.price","hedge.ratio","group1","group2")
data=data[,-2]

clean data

Due to the large number of car models, the frequency of occurrence is sorted in descending order, and only the top ten car brands are selected for analysis, and other brands are uniformly changed to “others”. Since the model can be determined according to the brand and style of the vehicle, part of the data is deleted and at the same time, the data with too much missing is deleted.

#In addition to the top ten brands, change to other
brand_10=sort(table(data$Vehicle.brand),decreasing = T)
attach(data)
b=which(Vehicle.brand!="宝马5系"& Vehicle.brand!="大众迈腾"& Vehicle.brand!="福特福克斯"
      & Vehicle.brand!="大众速腾"& Vehicle.brand!="奥迪A6L"& Vehicle.brand!="大众高尔夫"
      & Vehicle.brand!="宝马3系"& Vehicle.brand!="奔驰C级"& Vehicle.brand!="大众途观"
      & Vehicle.brand!="福特蒙迪欧")
detach(data)
data[,1]=as.character(data[,1])
data[b,1]="others"
data[,1]=as.factor(data[,1])

#The model can be determined according to the brand and style of the vehicle, so delete the data
data=data[,-4]
data=data[,-6]

data=na.omit(data)
head(data)

  Vehicle.brand location  style emission gear      time kilometres
1        others     外国 2007款       6L 自动 2012/12/1        1.1
2        others     外国 2010款       6L 无级  2011/5/1        5.8
3        others     外国 2010款       6L 无级  2011/5/1        5.8
4        others     外国 2011款     6.2L 自动 2014/11/1        2.9
5        others     外国 2011款     6.2L 自动 2014/11/1        2.9
6        others     外国 2011款     6.2L 自动  2013/7/1        5.0
  price original.price hedge.ratio   group1   group2
1  59.5          132.4        0.45 小于10万 高级轿车
2  60.0          151.7        0.40 小于10万 高级轿车
3  60.0          151.7        0.40 小于10万 高级轿车
4  48.0           71.4        0.67 小于10万 高级轿车
5  48.0           71.4        0.67 小于10万 高级轿车
6  41.5           71.4        0.58 小于10万 高级轿车

Visualion

Price

First, the used cars are grouped according to the type of second-hand cars, and their price, value retention rate and distribution are tabulated.

car_style <- group_by(data, style) %>%
  summarise(n=n(), mean(price),median(price),sd(price), 
            mean_hedge.ratio=mean(hedge.ratio),sd_hedge.ratio=sd(hedge.ratio), 
            na.rm = TRUE) %>%
  mutate(freq = n / sum(n))
car_style

# A tibble: 12 x 9
   style      n `mean(price)` `median(price)` `sd(price)`
   <chr>  <int>         <dbl>           <dbl>       <dbl>
 1 2005款     1         38.5             38.5       NA   
 2 2006款     8          9.45            10.8        4.39
 3 2007款    35         11.1              7.3       11.5 
 4 2008款   111          6.59             5.1        3.81
 5 2009款   197          7.79             6.5        5.05
 6 2010款   262         13.7             10.5       14.1 
 7 2011款   276         14.9             11         12.2 
 8 2012款   362         16.9             11.1       16.1 
 9 2013款   370         20.9             17.5       13.8 
10 2014款   253         27.6             18.5       22.1 
11 2015款   128         24.6             16.5       27.6 
12 2016款    22         25.7             19         19.4 
# ... with 4 more variables: mean_hedge.ratio <dbl>,
#   sd_hedge.ratio <dbl>, na.rm <lgl>, freq <dbl>

It can be seen from the above table that generally speaking; the newer style of used cars has a higher value retention rate and a higher price. Among them, there are more used cars in 2009-2014, and the number of used cars earlier than 2008 and later than 2015 is less.Next visualize it and analyze it further.

The figure below visualizes the styles and prices of used cars. It can be seen from the figure that most cars will not exceed 400,000, of which 0-200,000 are the majority; starting from the 2019 model, the number of used cars has increased significantly until In 2015, the number of used cars dropped significantly.

ggplot(data, aes(price, fill =style)) + geom_histogram(color="black")+
  labs(title = "price") + 
  theme(axis.text=element_text(size=6)) +
  facet_wrap(vars(style), scales = "free")

Hedging Rate

First, the density distribution of the value preservation rate is visualized. The value preservation rate of the car is roughly normal distribution, and the value preservation rate that is too high or too low is relatively rare in the second-hand car market.

ggplot(data,aes(x=hedge.ratio))+geom_density(fill="pink")+
  labs(title="Hedging Rate Density Plot",x="Hedging Rate", y="Density") +
  theme_bw()

Next, we visualized the value retention rate of second-hand cars of different styles and regions. It can be seen from the figure that whether it is a local car or an imported car, the newer the car, the higher the value retention rate, but in 2007 and 2006 an anomaly occurred in the year.

ggplot(data, aes(x=style, y=hedge.ratio,color = style)) + 
  geom_boxplot()+
  geom_violin(fill = "gray80", size = .5, alpha = .5)+
  stat_summary(fun = "mean",geom="point")+
  labs(title = "hedging rate of different cars",x = "style", y = "hedging rate") + 
  theme(axis.text = element_text(size = 4))+
  facet_wrap(vars(location), scales = "free")

ggplot(car_style,aes(style,mean_hedge.ratio,color=style))+
  geom_col(fill = "gray60", size = .8, alpha = .6)+
  geom_errorbar(aes(style,
                    ymin=mean_hedge.ratio-sd_hedge.ratio,
                    ymax=mean_hedge.ratio+sd_hedge.ratio))+
  labs(title="Uncertainty of hedging rate",x="style", y="hedge.ratio")+
  theme(axis.text=element_text(size=6))

Finally, the relationship between used car mileage and value retention rate is visualized, and it can be seen that with the increase of used car mileage, the value retention rate is decreasing.

ggplot(data, aes(x=kilometres, y=hedge.ratio, color=kilometres)) + 
  geom_jitter(width = .25, alpha = .3)+
  stat_smooth(method = "lm", se = FALSE,
               color = "red", size = 1.3) +
  labs(title="hedging rate of driven distance",
       x="driven distance", y="hedging rate") +
  theme(axis.text=element_text(size=4))

Conclusion

From the above analysis, the following conclusions can be drawn:

The number of used cars from 2008 to 2015 is higher.
The price of most used cars is 50,000-250,000, and the value preservation rate is 40%-60%.
With the increase of used car years, the retention rate and price generally rise.
With the increase of the used car history, the value preservation rate shows a downward trend.

Since there are other independent variables in this data set, such as whether the used car is automatic and whether it has a relevant impact on the price and preservation rate of the used car, this part of the data analysis will be further completed after class.

Bibliography

1.China Automobile Dealers Association (dataset source)

2.R for Data Science -Hadly Wickham & Garrett Grolemund

Comment on this article Share:

project