Data Analytics and Computational Social Science: hw_4

Xiaotong

data=read.csv(file=file.choose(),header = T)
data=data[,-1]
names(data)

 [1] "Vehicle.brand"  "location"       "style"         
 [4] "emission"       "gear"           "model"         
 [7] "time"           "kilometres"     "price"         
[10] "original.price" "hedge.ratio"    "group1"        
[13] "group2"

head(data)

  Vehicle.brand location  style emission gear        model      time
1        others     外国 2011款     6.2L 自动       (进口) 2014/11/1
2        others     外国 2011款     6.2L 自动       (进口) 2014/11/1
3        others     外国 2011款     6.2L 自动       (进口)  2013/7/1
4        others     外国 2011款     6.2L 自动       (进口) 2013/12/1
5        others     外国 2010款     6.0T 自动       (进口)  2010/3/1
6        others     外国 2010款     5.7L 自动 豪华型(进口)  2012/9/1
  kilometres price original.price hedge.ratio   group1   group2
1        2.9  48.0           71.4        0.67 小于10万 高级轿车
2        2.9  48.0           71.4        0.67 小于10万 高级轿车
3        5.0  41.5           71.4        0.58 小于10万 高级轿车
4        5.3  42.5           71.4        0.60 小于10万 高级轿车
5        4.3 100.0          358.0        0.28 小于10万 高级轿车
6        8.0  80.0          171.5        0.47 小于10万 高级轿车

In addition to overall means, medians, and SDs, use group_by() and summarise() to compute

mean/median/SD for any relevant groupings.

library(dplyr)
car_style <- group_by(data, style)
summarise(car_style, mean(price),median(price),sd(price), na.rm = TRUE)

# A tibble: 12 x 5
   style  `mean(price)` `median(price)` `sd(price)` na.rm
   <chr>          <dbl>           <dbl>       <dbl> <lgl>
 1 2005款         38.5             38.5       NA    TRUE 
 2 2006款         11.9             11          1.95 TRUE 
 3 2007款         12.1              8.5        9.07 TRUE 
 4 2008款          6.99             5.8        3.60 TRUE 
 5 2009款          8.81             8.8        5.18 TRUE 
 6 2010款         14.6             13.6       11.4  TRUE 
 7 2011款         16.8             12.6       12.1  TRUE 
 8 2012款         18.4             14.5       12.6  TRUE 
 9 2013款         23.1             19.2       13.2  TRUE 
10 2014款         32.4             30.8       20.7  TRUE 
11 2015款         30.5             18.8       28.4  TRUE 
12 2016款         21.7             17         14.2  TRUE

data %>%
  group_by(style) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))

# A tibble: 12 x 3
   style      n     freq
   <chr>  <int>    <dbl>
 1 2005款     1 0.000801
 2 2006款     5 0.00400 
 3 2007款    19 0.0152  
 4 2008款    79 0.0633  
 5 2009款   125 0.100   
 6 2010款   158 0.127   
 7 2011款   190 0.152   
 8 2012款   230 0.184   
 9 2013款   228 0.183   
10 2014款   142 0.114   
11 2015款    60 0.0480  
12 2016款    12 0.00961

Question

What variable(s) you are visualizing?

the price of the used car.

What question(s) you are attempting to answer with the visualization?

Does the price of a used car have a certain distribution law?

What conclusions you can make from the visualization?

Used cars with a price of 0-250,000 are more common in the market, and used cars with a price higher than 750,000 are relatively rare in the market.

library(ggplot2)
ggplot(data, aes(price, fill = style)) + geom_histogram()

  labs(title = "price") + 
  theme_bw() +
  facet_wrap(vars(style), scales = "free")

NULL

Question

What variable(s) you are visualizing

Types of used cars and retention rates.

What question(s) you are attempting to answer with the visualization

Is there a relationship between the type of used car and the retention rate?

What conclusions you can make from the visualization

As can be seen from the figure, the more recent the used car, the higher the value hedging rate.

ggplot(data, aes(x=style, y=hedge.ratio)) + geom_boxplot()+
  geom_violin(scale="count",fill="lightblue",alpha=.3)+
  stat_summary(fun = "mean",geom="point")+
  labs(title="Value retention rate of different cars",x="style", y="hedging rate") + 
  theme_bw()

Question

• What questions are left unanswered with your visualizations

no unanswered questions.

• What about the visualizations may be unclear to a naive viewer

I think the visualization is pretty good

• How could you improve the visualizations for the final project

More visualizations will be added to the final project for analysis, and since the average value of the retention rate is visualized, the error will also be visualized later.

Comment on this article Share:

hw_4

Question

Question

Question

Reuse

Citation