Data Analytics and Computational Social Science: HW3 by Guodong Zhang

Guodong Zhang

1. Identify the dataset.

The dataset, Chinese Real National Income Data, is from GitHub, and contains time series of real national income in China per section (index with 1952 = 100).

The variables in the dataset are following:

Variable	Data type	Description
index	Number	Index of data.
agriculture	Number	Real national income in agriculture sector.
commerce	Number	Real national income in commerce sector.
construction	Number	Real national income in construction sector.
industry	Number	Real national income in industry sector.
transport	Number	Real national income in transport sector.

2. Read in the dataset.

incoming_data <- read_csv("C:/Users/zhang/OneDrive - University of Massachusetts/_601/Sample Datasets/ChinaIncome.csv", show_col_types = FALSE)
incoming_data

# A tibble: 37 x 6
   index agriculture commerce construction industry transport
   <dbl>       <dbl>    <dbl>        <dbl>    <dbl>     <dbl>
 1     1       100       100          100      100       100 
 2     2       102.      133          138.     134.      120 
 3     3       103.      136.         133.     159.      136 
 4     4       112.      138.         152.     169.      140 
 5     5       116.      147.         262.     219.      164 
 6     6       120.      147.         243.     244.      176 
 7     7       120.      156.         367      384.      271.
 8     8       101.      170.         389.     502.      356.
 9     9        83.6     164.         394      541.      384.
10    10        84.7     130.         130.     316.      221.
# ... with 27 more rows

3. Research questions.

a. Which field grew the fastest during this period?

incoming_data %>%
    select(2:6) %>%
    apply(1, which.max)

 [1] 1 3 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[34] 4 4 4 4

As we can see, the 4th field, which is industry, is almost always the maximal during this period. Thus, industry grew the fastest.

b. How about the correlation between these fields?

field_data = incoming_data[2:6]
for (i in 1:4) {
    for (j in (i+1):5) {
        x = as.vector(unlist(field_data[i]))
        y = as.vector(unlist(field_data[j]))
        cat("P-value between the ",i,"th field and the ",j,"th field: ",sep="")
        cor(x,y) %>% cat('\n')
    }
}

P-value between the 1th field and the 2th field: 0.9641547 
P-value between the 1th field and the 3th field: 0.955028 
P-value between the 1th field and the 4th field: 0.9670784 
P-value between the 1th field and the 5th field: 0.9521588 
P-value between the 2th field and the 3th field: 0.9880994 
P-value between the 2th field and the 4th field: 0.9883226 
P-value between the 2th field and the 5th field: 0.9881633 
P-value between the 3th field and the 4th field: 0.9873144 
P-value between the 3th field and the 5th field: 0.9936743 
P-value between the 4th field and the 5th field: 0.9927794

As we can see, every two of the five fields have a strong correlation, but the 3th field, construction, and the 5th field, transport, have the strongest one.

Comment on this article Share:

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Zhang (2022, Jan. 3). Data Analytics and Computational Social Science: HW3 by Guodong Zhang. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgdzhangdacss601hw3/

BibTeX citation

@misc{zhang2022hw3,
  author = {Zhang, Guodong},
  title = {Data Analytics and Computational Social Science: HW3 by Guodong Zhang},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgdzhangdacss601hw3/},
  year = {2022}
}