HW3 by Guodong Zhang

This is my Homework 3 for DACSS 601.

Guodong Zhang
2022/1/1

1. Identify the dataset.

The dataset, Chinese Real National Income Data, is from GitHub, and contains time series of real national income in China per section (index with 1952 = 100).

The variables in the dataset are following:

Variable Data type Description
index Number Index of data.
agriculture Number Real national income in agriculture sector.
commerce Number Real national income in commerce sector.
construction Number Real national income in construction sector.
industry Number Real national income in industry sector.
transport Number Real national income in transport sector.

2. Read in the dataset.

incoming_data <- read_csv("C:/Users/zhang/OneDrive - University of Massachusetts/_601/Sample Datasets/ChinaIncome.csv", show_col_types = FALSE)
incoming_data
# A tibble: 37 x 6
   index agriculture commerce construction industry transport
   <dbl>       <dbl>    <dbl>        <dbl>    <dbl>     <dbl>
 1     1       100       100          100      100       100 
 2     2       102.      133          138.     134.      120 
 3     3       103.      136.         133.     159.      136 
 4     4       112.      138.         152.     169.      140 
 5     5       116.      147.         262.     219.      164 
 6     6       120.      147.         243.     244.      176 
 7     7       120.      156.         367      384.      271.
 8     8       101.      170.         389.     502.      356.
 9     9        83.6     164.         394      541.      384.
10    10        84.7     130.         130.     316.      221.
# ... with 27 more rows

3. Research questions.

a. Which field grew the fastest during this period?

incoming_data %>%
    select(2:6) %>%
    apply(1, which.max)
 [1] 1 3 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[34] 4 4 4 4

As we can see, the 4th field, which is industry, is almost always the maximal during this period. Thus, industry grew the fastest.

b. How about the correlation between these fields?

field_data = incoming_data[2:6]
for (i in 1:4) {
    for (j in (i+1):5) {
        x = as.vector(unlist(field_data[i]))
        y = as.vector(unlist(field_data[j]))
        cat("P-value between the ",i,"th field and the ",j,"th field: ",sep="")
        cor(x,y) %>% cat('\n')
    }
}
P-value between the 1th field and the 2th field: 0.9641547 
P-value between the 1th field and the 3th field: 0.955028 
P-value between the 1th field and the 4th field: 0.9670784 
P-value between the 1th field and the 5th field: 0.9521588 
P-value between the 2th field and the 3th field: 0.9880994 
P-value between the 2th field and the 4th field: 0.9883226 
P-value between the 2th field and the 5th field: 0.9881633 
P-value between the 3th field and the 4th field: 0.9873144 
P-value between the 3th field and the 5th field: 0.9936743 
P-value between the 4th field and the 5th field: 0.9927794 

As we can see, every two of the five fields have a strong correlation, but the 3th field, construction, and the 5th field, transport, have the strongest one.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Zhang (2022, Jan. 3). Data Analytics and Computational Social Science: HW3 by Guodong Zhang. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgdzhangdacss601hw3/

BibTeX citation

@misc{zhang2022hw3,
  author = {Zhang, Guodong},
  title = {Data Analytics and Computational Social Science: HW3 by Guodong Zhang},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgdzhangdacss601hw3/},
  year = {2022}
}