HW5 by Guodong Zhang

This is my Homework 5 for DACSS 601.

Guodong Zhang
2022-01-12

1. Descriptive Statistics

Firstly, read in the dataset Chinese Real National Income Data.

incoming_data <- read_csv("C:/Users/zhang/OneDrive - University of Massachusetts/_601/Sample Datasets/ChinaIncome.csv", show_col_types = FALSE)
incoming_data
# A tibble: 37 x 6
   index agriculture commerce construction industry transport
   <dbl>       <dbl>    <dbl>        <dbl>    <dbl>     <dbl>
 1  1952       100       100          100      100       100 
 2  1953       102.      133          138.     134.      120 
 3  1954       103.      136.         133.     159.      136 
 4  1955       112.      138.         152.     169.      140 
 5  1956       116.      147.         262.     219.      164 
 6  1957       120.      147.         243.     244.      176 
 7  1958       120.      156.         367      384.      271.
 8  1959       101.      170.         389.     502.      356.
 9  1960        83.6     164.         394      541.      384.
10  1961        84.7     130.         130.     316.      221.
# ... with 27 more rows

Compute mean for each variable:

summarise(incoming_data, across(everything(), mean))
# A tibble: 1 x 6
  index agriculture commerce construction industry transport
  <dbl>       <dbl>    <dbl>        <dbl>    <dbl>     <dbl>
1  1970        152.     261.         550.    1244.      449.

Compute median for each variable:

summarise(incoming_data, across(everything(), median))
# A tibble: 1 x 6
  index agriculture commerce construction industry transport
  <dbl>       <dbl>    <dbl>        <dbl>    <dbl>     <dbl>
1  1970        140.     199.          421      863      371.

Compute standard deviation for each variable:

summarise(incoming_data, across(everything(), sd))
# A tibble: 1 x 6
  index agriculture commerce construction industry transport
  <dbl>       <dbl>    <dbl>        <dbl>    <dbl>     <dbl>
1  10.8        54.4     176.         448.    1192.      329.

2. Visualization One: Univariate

In this part, I created a new dataset based on the original dataset. The new dataset, named max_income_trade_data, records which trade (agriculture, commerce, construction, industry, or transport) has the highest income for each year.

max_income_trade_data <- apply(select(incoming_data,2:6), 1, which.max)
max_income_trade_data <- data.frame(
    index=select(incoming_data,1),
    trade=max_income_trade_data)
max_income_trade_data[max_income_trade_data==1] <- "agriculture"
max_income_trade_data[max_income_trade_data==2] <- "commerce"
max_income_trade_data[max_income_trade_data==3] <- "construction"
max_income_trade_data[max_income_trade_data==4] <- "industry"
max_income_trade_data[max_income_trade_data==5] <- "transport"

head(max_income_trade_data,5)
  index        trade
1  1952  agriculture
2  1953 construction
3  1954     industry
4  1955     industry
5  1956 construction

And then, I counted the number of years with the highest income for each trade and drew them.

xaxis <- colnames(incoming_data)[2:6]
ggplot(max_income_trade_data["trade"],aes(trade,fill = trade)) +
    xlim(xaxis) +
    geom_bar() +
    labs(title="The number of years with the highest income for each trade") +
    facet_wrap(vars(trade), scales = "free_x") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))

What is missing (if anything) in your analysis process so far?

I think nothing is missing.

What conclusions can you make about your research questions at this point?

In the most years between 1952 and 1988, industry has the highest income. In only a few years, agriculture and construction have the highest income. Commerce and transport never be the trade with the highest income in any year.

What do you think a naive reader would need to fully understand your graphs?

I think a naive reader need to know whose income the data is about, the range of years. In short, he/she should know the description of the dataset.

Is there anything you want to answer with your dataset, but can’t?

I’d like to draw the bars for commerce and transport in the graph, even if the size of them is 0, so that readers can see all the five trades directly.

3. Visualization Two: Bivariate

ggplot() +
    geom_point(data=incoming_data,aes(x=index,y=agriculture,color="agriculture")) +
    geom_smooth(data=incoming_data,aes(x=index,y=agriculture,color="agriculture"),method='loess',formula='y ~ x') +
    geom_point(data=incoming_data,aes(x=index,y=commerce,color="commerce")) +
    geom_smooth(data=incoming_data,aes(x=index,y=commerce,color="commerce"),method='loess',formula='y ~ x') + 
    geom_point(data=incoming_data,aes(x=index,y=construction,color="construction")) +
    geom_smooth(data=incoming_data,aes(x=index,y=construction,color="construction"),method='loess',formula='y ~ x') + 
    geom_point(data=incoming_data,aes(x=index,y=industry,color="industry")) +
    geom_smooth(data=incoming_data,aes(x=index,y=industry,color="industry"),method='loess',formula='y ~ x') + 
    geom_point(data=incoming_data,aes(x=index,y=transport,color="transport")) +
    geom_smooth(data=incoming_data,aes(x=index,y=transport,color="transport"),method='loess',formula='y ~ x') + 
    scale_color_manual("",values=c("agriculture"="red","commerce"="green","construction"="blue","industry"="hotpink","transport"="yellow")) +
    xlab("year") + ylab("income") +
    labs(title="year - income growth graph")

What is missing (if anything) in your analysis process so far?

I think nothing is missing.

What conclusions can you make about your research questions at this point?

The income growth of industry is the fastest, that of construction is the second fastest, that of transport is the third fastest, that of commerce is fourth fastest, that of agriculture is the slowest.

What do you think a naive reader would need to fully understand your graphs?

I think a naive reader only need to know whose income the data is about.

Is there anything you want to answer with your dataset, but can’t?

I’d like to divide the graph into some subgraphs, so that readers can compare one trade with one trade, but I don’t know how.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Zhang (2022, Jan. 14). Data Analytics and Computational Social Science: HW5 by Guodong Zhang. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgdzhangdacss601hw5/

BibTeX citation

@misc{zhang2022hw5,
  author = {Zhang, Guodong},
  title = {Data Analytics and Computational Social Science: HW5 by Guodong Zhang},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgdzhangdacss601hw5/},
  year = {2022}
}