This is my Homework 5 for DACSS 601.
Firstly, read in the dataset Chinese Real National Income Data.
incoming_data <- read_csv("C:/Users/zhang/OneDrive - University of Massachusetts/_601/Sample Datasets/ChinaIncome.csv", show_col_types = FALSE)
incoming_data
# A tibble: 37 x 6
index agriculture commerce construction industry transport
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1952 100 100 100 100 100
2 1953 102. 133 138. 134. 120
3 1954 103. 136. 133. 159. 136
4 1955 112. 138. 152. 169. 140
5 1956 116. 147. 262. 219. 164
6 1957 120. 147. 243. 244. 176
7 1958 120. 156. 367 384. 271.
8 1959 101. 170. 389. 502. 356.
9 1960 83.6 164. 394 541. 384.
10 1961 84.7 130. 130. 316. 221.
# ... with 27 more rows
Compute mean for each variable:
summarise(incoming_data, across(everything(), mean))
# A tibble: 1 x 6
index agriculture commerce construction industry transport
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1970 152. 261. 550. 1244. 449.
Compute median for each variable:
summarise(incoming_data, across(everything(), median))
# A tibble: 1 x 6
index agriculture commerce construction industry transport
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1970 140. 199. 421 863 371.
Compute standard deviation for each variable:
summarise(incoming_data, across(everything(), sd))
# A tibble: 1 x 6
index agriculture commerce construction industry transport
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10.8 54.4 176. 448. 1192. 329.
In this part, I created a new dataset based on the original dataset. The new dataset, named max_income_trade_data, records which trade (agriculture, commerce, construction, industry, or transport) has the highest income for each year.
max_income_trade_data <- apply(select(incoming_data,2:6), 1, which.max)
max_income_trade_data <- data.frame(
index=select(incoming_data,1),
trade=max_income_trade_data)
max_income_trade_data[max_income_trade_data==1] <- "agriculture"
max_income_trade_data[max_income_trade_data==2] <- "commerce"
max_income_trade_data[max_income_trade_data==3] <- "construction"
max_income_trade_data[max_income_trade_data==4] <- "industry"
max_income_trade_data[max_income_trade_data==5] <- "transport"
head(max_income_trade_data,5)
index trade
1 1952 agriculture
2 1953 construction
3 1954 industry
4 1955 industry
5 1956 construction
And then, I counted the number of years with the highest income for each trade and drew them.
xaxis <- colnames(incoming_data)[2:6]
ggplot(max_income_trade_data["trade"],aes(trade,fill = trade)) +
xlim(xaxis) +
geom_bar() +
labs(title="The number of years with the highest income for each trade") +
facet_wrap(vars(trade), scales = "free_x") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))
I think nothing is missing.
In the most years between 1952 and 1988, industry has the highest income. In only a few years, agriculture and construction have the highest income. Commerce and transport never be the trade with the highest income in any year.
I think a naive reader need to know whose income the data is about, the range of years. In short, he/she should know the description of the dataset.
I’d like to draw the bars for commerce and transport in the graph, even if the size of them is 0, so that readers can see all the five trades directly.
ggplot() +
geom_point(data=incoming_data,aes(x=index,y=agriculture,color="agriculture")) +
geom_smooth(data=incoming_data,aes(x=index,y=agriculture,color="agriculture"),method='loess',formula='y ~ x') +
geom_point(data=incoming_data,aes(x=index,y=commerce,color="commerce")) +
geom_smooth(data=incoming_data,aes(x=index,y=commerce,color="commerce"),method='loess',formula='y ~ x') +
geom_point(data=incoming_data,aes(x=index,y=construction,color="construction")) +
geom_smooth(data=incoming_data,aes(x=index,y=construction,color="construction"),method='loess',formula='y ~ x') +
geom_point(data=incoming_data,aes(x=index,y=industry,color="industry")) +
geom_smooth(data=incoming_data,aes(x=index,y=industry,color="industry"),method='loess',formula='y ~ x') +
geom_point(data=incoming_data,aes(x=index,y=transport,color="transport")) +
geom_smooth(data=incoming_data,aes(x=index,y=transport,color="transport"),method='loess',formula='y ~ x') +
scale_color_manual("",values=c("agriculture"="red","commerce"="green","construction"="blue","industry"="hotpink","transport"="yellow")) +
xlab("year") + ylab("income") +
labs(title="year - income growth graph")
I think nothing is missing.
The income growth of industry is the fastest, that of construction is the second fastest, that of transport is the third fastest, that of commerce is fourth fastest, that of agriculture is the slowest.
I think a naive reader only need to know whose income the data is about.
I’d like to divide the graph into some subgraphs, so that readers can compare one trade with one trade, but I don’t know how.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Zhang (2022, Jan. 14). Data Analytics and Computational Social Science: HW5 by Guodong Zhang. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgdzhangdacss601hw5/
BibTeX citation
@misc{zhang2022hw5, author = {Zhang, Guodong}, title = {Data Analytics and Computational Social Science: HW5 by Guodong Zhang}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgdzhangdacss601hw5/}, year = {2022} }