hw_4

Visualize data

Xiaotong
2022/4/14
data=read.csv(file=file.choose(),header = T)
data=data[,-1]
names(data)
 [1] "Vehicle.brand"  "location"       "style"         
 [4] "emission"       "gear"           "model"         
 [7] "time"           "kilometres"     "price"         
[10] "original.price" "hedge.ratio"    "group1"        
[13] "group2"        
head(data)
  Vehicle.brand location  style emission gear        model      time
1        others     外国 2011款     6.2L 自动       (进口) 2014/11/1
2        others     外国 2011款     6.2L 自动       (进口) 2014/11/1
3        others     外国 2011款     6.2L 自动       (进口)  2013/7/1
4        others     外国 2011款     6.2L 自动       (进口) 2013/12/1
5        others     外国 2010款     6.0T 自动       (进口)  2010/3/1
6        others     外国 2010款     5.7L 自动 豪华型(进口)  2012/9/1
  kilometres price original.price hedge.ratio   group1   group2
1        2.9  48.0           71.4        0.67 小于10万 高级轿车
2        2.9  48.0           71.4        0.67 小于10万 高级轿车
3        5.0  41.5           71.4        0.58 小于10万 高级轿车
4        5.3  42.5           71.4        0.60 小于10万 高级轿车
5        4.3 100.0          358.0        0.28 小于10万 高级轿车
6        8.0  80.0          171.5        0.47 小于10万 高级轿车
In addition to overall means, medians, and SDs, use group_by() and summarise() to compute
mean/median/SD for any relevant groupings.
library(dplyr)
car_style <- group_by(data, style)
summarise(car_style, mean(price),median(price),sd(price), na.rm = TRUE)
# A tibble: 12 x 5
   style  `mean(price)` `median(price)` `sd(price)` na.rm
   <chr>          <dbl>           <dbl>       <dbl> <lgl>
 1 2005款         38.5             38.5       NA    TRUE 
 2 2006款         11.9             11          1.95 TRUE 
 3 2007款         12.1              8.5        9.07 TRUE 
 4 2008款          6.99             5.8        3.60 TRUE 
 5 2009款          8.81             8.8        5.18 TRUE 
 6 2010款         14.6             13.6       11.4  TRUE 
 7 2011款         16.8             12.6       12.1  TRUE 
 8 2012款         18.4             14.5       12.6  TRUE 
 9 2013款         23.1             19.2       13.2  TRUE 
10 2014款         32.4             30.8       20.7  TRUE 
11 2015款         30.5             18.8       28.4  TRUE 
12 2016款         21.7             17         14.2  TRUE 
data %>%
  group_by(style) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))
# A tibble: 12 x 3
   style      n     freq
   <chr>  <int>    <dbl>
 1 2005款     1 0.000801
 2 2006款     5 0.00400 
 3 2007款    19 0.0152  
 4 2008款    79 0.0633  
 5 2009款   125 0.100   
 6 2010款   158 0.127   
 7 2011款   190 0.152   
 8 2012款   230 0.184   
 9 2013款   228 0.183   
10 2014款   142 0.114   
11 2015款    60 0.0480  
12 2016款    12 0.00961 

Question

What variable(s) you are visualizing?

the price of the used car.

What question(s) you are attempting to answer with the visualization?

Does the price of a used car have a certain distribution law?

What conclusions you can make from the visualization?

Used cars with a price of 0-250,000 are more common in the market, and used cars with a price higher than 750,000 are relatively rare in the market.

library(ggplot2)
ggplot(data, aes(price, fill = style)) + geom_histogram()
  labs(title = "price") + 
  theme_bw() +
  facet_wrap(vars(style), scales = "free")
NULL

Question

What variable(s) you are visualizing

Types of used cars and retention rates.

What question(s) you are attempting to answer with the visualization

Is there a relationship between the type of used car and the retention rate?

What conclusions you can make from the visualization

As can be seen from the figure, the more recent the used car, the higher the value hedging rate.

ggplot(data, aes(x=style, y=hedge.ratio)) + geom_boxplot()+
  geom_violin(scale="count",fill="lightblue",alpha=.3)+
  stat_summary(fun = "mean",geom="point")+
  labs(title="Value retention rate of different cars",x="style", y="hedging rate") + 
  theme_bw()

Question

• What questions are left unanswered with your visualizations

no unanswered questions.

• What about the visualizations may be unclear to a naive viewer

I think the visualization is pretty good

• How could you improve the visualizations for the final project

More visualizations will be added to the final project for analysis, and since the average value of the retention rate is visualized, the error will also be visualized later.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Xiaotong (2022, May 11). Data Analytics and Computational Social Science: hw_4. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httprpubscomtoni901233/

BibTeX citation

@misc{xiaotong2022hw_4,
  author = {Xiaotong, },
  title = {Data Analytics and Computational Social Science: hw_4},
  url = {https://github.com/DACSS/dacss_course_website/posts/httprpubscomtoni901233/},
  year = {2022}
}