This is my final paper for DACSS 601.
The National Income[1] is the total amount of income accruing to a country from economic activities in a years time. It includes payments made to all resources either in the form of wages, interest, rent, and profits.
In this project, I’d like to explore the nation income of China, figure out which trade plays a major role in its change, and draw my conclusion through visualization. Specifically, the target of this project is to answer the following research questions:
Which trade has the most number of years with the highest national income?
Which trade grows the fastest and which one grows the slowest?
I used R[2] as the programming language to carry out my work. The knowledge of R and statistics I knew is from DACSS 601[3] and the course textbook[4].
The data I used is Chinese Real National Income Data[5], which is from Rdatasets[6].
Rdatasets is a collection of over 1700 datasets that were originally distributed alongside the statistical software environment R and some of its add-on packages. The goal is to make these data more broadly accessible for teaching and statistical software development.[6]
Chinese Real National Income Data records the real national income of five trades (agriculture, industry, construction, transport, and commerce) in China between 1952 and 1988. The data takes the first year, 1952, as the benchmark, of which the values for all the five trades are 100.
Following is the code to import the data:
incoming_data <- read_csv(
"C:/Users/zhang/OneDrive - University of Massachusetts/_601/Sample Datasets/ChinaIncome.csv",
show_col_types = FALSE)
# A tibble: 37 x 6
index agriculture commerce construction industry transport
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 100 100 100 100 100
2 2 102. 133 138. 134. 120
3 3 103. 136. 133. 159. 136
4 4 112. 138. 152. 169. 140
5 5 116. 147. 262. 219. 164
6 6 120. 147. 243. 244. 176
7 7 120. 156. 367 384. 271.
8 8 101. 170. 389. 502. 356.
9 9 83.6 164. 394 541. 384.
10 10 84.7 130. 130. 316. 221.
# ... with 27 more rows
The first column of the original data is only the index of each row of the data. To make the data more clear and easy to understand, I modified the first column from 1-37 to 1952-1988, which is the year of each row of data.
incoming_data$index <- incoming_data$index + 1951
names(incoming_data)[1] = "year"
# A tibble: 37 x 6
year agriculture commerce construction industry transport
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1952 100 100 100 100 100
2 1953 102. 133 138. 134. 120
3 1954 103. 136. 133. 159. 136
4 1955 112. 138. 152. 169. 140
5 1956 116. 147. 262. 219. 164
6 1957 120. 147. 243. 244. 176
7 1958 120. 156. 367 384. 271.
8 1959 101. 170. 389. 502. 356.
9 1960 83.6 164. 394 541. 384.
10 1961 84.7 130. 130. 316. 221.
# ... with 27 more rows
Which trade has the most number of years with the highest national income?
Firstly, I created a new dataset based on the original dataset. The new dataset, named max_income_trade_data, records which trade (agriculture, commerce, construction, industry, or transport) has the highest national income for each year.
max_income_trade_data <- apply(select(incoming_data,2:6), 1, which.max)
max_income_trade_data <- data.frame(
max_income_trade_data[max_income_trade_data==1] <- "agriculture"
max_income_trade_data[max_income_trade_data==2] <- "commerce"
max_income_trade_data[max_income_trade_data==3] <- "construction"
max_income_trade_data[max_income_trade_data==4] <- "industry"
max_income_trade_data[max_income_trade_data==5] <- "transport"
max_income_trade_data %>% head(10)
year trade
1 1952 agriculture
2 1953 construction
3 1954 industry
4 1955 industry
5 1956 construction
6 1957 industry
7 1958 industry
8 1959 industry
9 1960 industry
10 1961 industry
As we can see from the above output, each row of the dataset records the trade with the highest national income in the corresponding year. And then, I counted the number of years with the highest national income for each trade and drew them:
xaxis <- colnames(incoming_data)[2:6]
ggplot(max_income_trade_data["trade"],aes(trade,fill = trade)) +
xlim(xaxis) +
geom_bar() +
labs(title="The number of years with the highest income for each trade") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))
We can learn from the plot: in the most years between 1952 and 1988, industry has the highest national income; in only a few years, agriculture and construction have the highest national income; commerce and transport never be the trade with the highest income in any year.
Which trade grows the fastest and which one grows the slowest?
Firstly, I created a scatterplot to show all the data points from the dataset.
graph <- ggplot() +
scale_color_manual("",values=c("agriculture"="red","commerce"="green","construction"="blue","industry"="hotpink","transport"="yellow")) +
xlab("year") + ylab("national income") +
labs(title="year - national income growth graph") +
geom_point(data=incoming_data,aes(x=year,y=agriculture,color="agriculture")) +
geom_point(data=incoming_data,aes(x=year,y=commerce,color="commerce")) +
geom_point(data=incoming_data,aes(x=year,y=construction,color="construction")) +
geom_point(data=incoming_data,aes(x=year,y=industry,color="industry")) +
Readers are already able to figure out the answer of the research question: industry grows the fastest and agirculture grows the slowest, as the data points of the five trades are clear distinction. However, to make the growth trend clearer, I added smoothing layers for each trade as following:
graph <- graph +
geom_smooth(data=incoming_data,aes(x=year,y=agriculture,color="agriculture"),method='loess',formula='y ~ x') +
geom_smooth(data=incoming_data,aes(x=year,y=commerce,color="commerce"),method='loess',formula='y ~ x') +
geom_smooth(data=incoming_data,aes(x=year,y=construction,color="construction"),method='loess',formula='y ~ x') +
geom_smooth(data=incoming_data,aes(x=year,y=industry,color="industry"),method='loess',formula='y ~ x') +
geom_smooth(data=incoming_data,aes(x=year,y=transport,color="transport"),method='loess',formula='y ~ x')
We can learn from the plot: The national income growth of industry is the fastest, that of construction is the second fastest, that of transport is the third fastest, that of commerce is fourth fastest, that of agriculture is the slowest.
The reason why I choose Chinese Real National Income Data and do data analytics work on it is that, as a student from China, I’d like to verify that the statistics are the same as my daily feels in China. I googled and found the package RDataset, which includes the dataset about Chinese national income, and then I made the decision to explore related issues.
The biggest challenge I met is how to count the number of years in Research Question 1. At first, I tried to get it directly from the original dataset (based on my solution of Homework 3[7]), but it’s too hard for me. And then I had an idea that creating a new dataset to store which trade has the highest nation income for each year and then counting the number of years of each trade. The idea works.
What I wished to know about the Chinese national income is that, whether the data fits my daily feels in China: rapid growth in all sectors of the economy, except agriculture.
Furthermore, there are two points I wanted to know before enrolling DACSS 601: 1. How to program with R and how to use RStudio? 2. The similarities and the differences between R and Python. At the end of this course, I have basically mastered the use of R and RStudio. Furthermore, I think R and Python differ only in grammar, and they have very similar functions and usage for data analytics. However, R is much more single-minded.
After finishing the project, I figured out everything I wished to know.
I’ve always wanted to explore the sequential growth year-on-year of each trade. But this question asks complex R implementation and I don’t have enough time to learn how to finish it. I hope I can learn enough things in the near few months and have the ability to do that.
Combining Research Question 1 and 2, we can verify the following point, which is from the sourse of the dataset[8]: The economic development strategy of the People’s Republic of China during the three decades beginning in the early 1950s is characterized by a high rate of capital accumulation at the expense of consumption and the promotion of industry at the expense of agriculture. Therefore, the growth of industrial national income has been remarkable, the growth of commercial national income is tiny, and the growth if agricultural national income is almost negligible. The point also fits my daily feels in China.
The writer of the dataset did a narmalization work on the dataset The dataset set the initialized value of all the five trades with 100, so that it’s easy to see how many percent the national income grows.
However, the dataset lost the exact value of the national income, wihich may cause problems. For example, as a traditional agricultural power, China may had a great number of agricultural national income stock in 1952, while that of industy is pitifully small, so the growth of industy is much greater than that of agriculture. It needs to future research work to clearify.
[4] Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.
[6] Rdatasets
[8] Chow, G.C. (1993). Capital Formation and Economic Growth in China. Quarterly Journal of Economics, 103, 809–842.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Zhang (2022, Jan. 25). Data Analytics and Computational Social Science: A Study of Chinese Real National Income Data. Retrieved from
BibTeX citation
@misc{zhang2022a, author = {Zhang, Guodong}, title = {Data Analytics and Computational Social Science: A Study of Chinese Real National Income Data}, url = {}, year = {2022} }