This is my Homework 4 for DACSS 601.
Firstly, read in the dataset Chinese Real National Income Data.
incoming_data <- read_csv("C:/Users/zhang/OneDrive - University of Massachusetts/_601/Sample Datasets/ChinaIncome.csv", show_col_types = FALSE)
incoming_data
# A tibble: 37 x 6
index agriculture commerce construction industry transport
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1952 100 100 100 100 100
2 1953 102. 133 138. 134. 120
3 1954 103. 136. 133. 159. 136
4 1955 112. 138. 152. 169. 140
5 1956 116. 147. 262. 219. 164
6 1957 120. 147. 243. 244. 176
7 1958 120. 156. 367 384. 271.
8 1959 101. 170. 389. 502. 356.
9 1960 83.6 164. 394 541. 384.
10 1961 84.7 130. 130. 316. 221.
# ... with 27 more rows
Code the function to compute mean, median, and standard deviation for numerical variables. There is no categorical variables in the dataset.
Use for loop to deal with every variables in the dataset.
for (colname in names(incoming_data)) {
cat(colname,':\n',sep='')
statistics(incoming_data[[colname]])
}
index:
mean: 1970
mdian: 1970
std: 10.82436
agriculture:
mean: 151.9486
mdian: 139.8
std: 54.39186
commerce:
mean: 260.9865
mdian: 199.2
std: 176.233
construction:
mean: 549.8973
mdian: 421
std: 448.0518
industry:
mean: 1244.238
mdian: 863
std: 1191.844
transport:
mean: 449.4973
mdian: 370.8
std: 328.5462
Following is the method from Tutorial 7 to compute mean, and standard deviation.
Compute mean:
summarize_all(incoming_data, mean)
# A tibble: 1 x 6
index agriculture commerce construction industry transport
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1970 152. 261. 550. 1244. 449.
Compute median:
summarize_all(incoming_data, median)
# A tibble: 1 x 6
index agriculture commerce construction industry transport
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1970 140. 199. 421 863 371.
Compute standard deviation:
summarize_all(incoming_data, sd)
# A tibble: 1 x 6
index agriculture commerce construction industry transport
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10.8 54.4 176. 448. 1192. 329.
fast_fields <- apply(select(incoming_data,2:6), 1, which.max)
fast_fields <- data.frame(field=fast_fields)
ggplot(fast_fields,aes(field)) + geom_histogram(bins=30)
The variable is the field in which the income is the highest in each year, as the original single variable is not suitable to visualize.
The question is which field has the most number of years with the highest income between 1952 and 1988.
The conclusion is the 4th field, industry, has the most number of years with the highest income.
ggplot(incoming_data,aes(transport,construction)) +
geom_point() +
geom_smooth(method='loess',formula='y ~ x')
The variable is transport and industry.
The question is I want to verify the relationship between transport and construction.
The conclusion is that the development of the transport will promote the development of construction
The first visualization is difficult to understand without the text description (maybe even with the text description). But I don’t know how to modify the x-axis and add more details in the figure.
My plan is spending more time on GGPlot and finding better methods to visualizing. Or I’d like to change the questions to some easy to visualizing.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Zhang (2022, Jan. 11). Data Analytics and Computational Social Science: HW4 by Guodong Zhang. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgdzhangdacss601hw4/
BibTeX citation
@misc{zhang2022hw4, author = {Zhang, Guodong}, title = {Data Analytics and Computational Social Science: HW4 by Guodong Zhang}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgdzhangdacss601hw4/}, year = {2022} }