This is my Homework 3 for DACSS 601.
The dataset, Chinese Real National Income Data, is from GitHub, and contains time series of real national income in China per section (index with 1952 = 100).
The variables in the dataset are following:
Variable | Data type | Description |
---|---|---|
index | Number | Index of data. |
agriculture | Number | Real national income in agriculture sector. |
commerce | Number | Real national income in commerce sector. |
construction | Number | Real national income in construction sector. |
industry | Number | Real national income in industry sector. |
transport | Number | Real national income in transport sector. |
incoming_data <- read_csv("C:/Users/zhang/OneDrive - University of Massachusetts/_601/Sample Datasets/ChinaIncome.csv", show_col_types = FALSE)
incoming_data
# A tibble: 37 x 6
index agriculture commerce construction industry transport
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 100 100 100 100 100
2 2 102. 133 138. 134. 120
3 3 103. 136. 133. 159. 136
4 4 112. 138. 152. 169. 140
5 5 116. 147. 262. 219. 164
6 6 120. 147. 243. 244. 176
7 7 120. 156. 367 384. 271.
8 8 101. 170. 389. 502. 356.
9 9 83.6 164. 394 541. 384.
10 10 84.7 130. 130. 316. 221.
# ... with 27 more rows
[1] 1 3 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[34] 4 4 4 4
As we can see, the 4th field, which is industry, is almost always the maximal during this period. Thus, industry grew the fastest.
field_data = incoming_data[2:6]
for (i in 1:4) {
for (j in (i+1):5) {
x = as.vector(unlist(field_data[i]))
y = as.vector(unlist(field_data[j]))
cat("P-value between the ",i,"th field and the ",j,"th field: ",sep="")
cor(x,y) %>% cat('\n')
}
}
P-value between the 1th field and the 2th field: 0.9641547
P-value between the 1th field and the 3th field: 0.955028
P-value between the 1th field and the 4th field: 0.9670784
P-value between the 1th field and the 5th field: 0.9521588
P-value between the 2th field and the 3th field: 0.9880994
P-value between the 2th field and the 4th field: 0.9883226
P-value between the 2th field and the 5th field: 0.9881633
P-value between the 3th field and the 4th field: 0.9873144
P-value between the 3th field and the 5th field: 0.9936743
P-value between the 4th field and the 5th field: 0.9927794
As we can see, every two of the five fields have a strong correlation, but the 3th field, construction, and the 5th field, transport, have the strongest one.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Zhang (2022, Jan. 3). Data Analytics and Computational Social Science: HW3 by Guodong Zhang. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgdzhangdacss601hw3/
BibTeX citation
@misc{zhang2022hw3, author = {Zhang, Guodong}, title = {Data Analytics and Computational Social Science: HW3 by Guodong Zhang}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgdzhangdacss601hw3/}, year = {2022} }