Data Analytics and Computational Social Science: A Study of Chinese Real National Income Data

Guodong Zhang

Introduction

The National Income^[1] is the total amount of income accruing to a country from economic activities in a years time. It includes payments made to all resources either in the form of wages, interest, rent, and profits.

In this project, I’d like to explore the nation income of China, figure out which trade plays a major role in its change, and draw my conclusion through visualization. Specifically, the target of this project is to answer the following research questions:

Which trade has the most number of years with the highest national income?
Which trade grows the fastest and which one grows the slowest?

I used R^[2] as the programming language to carry out my work. The knowledge of R and statistics I knew is from DACSS 601^[3] and the course textbook^[4].

Data

The data I used is Chinese Real National Income Data^[5], which is from Rdatasets^[6].

Rdatasets is a collection of over 1700 datasets that were originally distributed alongside the statistical software environment R and some of its add-on packages. The goal is to make these data more broadly accessible for teaching and statistical software development.^[6]

Chinese Real National Income Data records the real national income of five trades (agriculture, industry, construction, transport, and commerce) in China between 1952 and 1988. The data takes the first year, 1952, as the benchmark, of which the values for all the five trades are 100.

Following is the code to import the data:

incoming_data <- read_csv(
    "C:/Users/zhang/OneDrive - University of Massachusetts/_601/Sample Datasets/ChinaIncome.csv",
    show_col_types = FALSE)
incoming_data

# A tibble: 37 x 6
   index agriculture commerce construction industry transport
   <dbl>       <dbl>    <dbl>        <dbl>    <dbl>     <dbl>
 1     1       100       100          100      100       100 
 2     2       102.      133          138.     134.      120 
 3     3       103.      136.         133.     159.      136 
 4     4       112.      138.         152.     169.      140 
 5     5       116.      147.         262.     219.      164 
 6     6       120.      147.         243.     244.      176 
 7     7       120.      156.         367      384.      271.
 8     8       101.      170.         389.     502.      356.
 9     9        83.6     164.         394      541.      384.
10    10        84.7     130.         130.     316.      221.
# ... with 27 more rows

The first column of the original data is only the index of each row of the data. To make the data more clear and easy to understand, I modified the first column from 1-37 to 1952-1988, which is the year of each row of data.

incoming_data$index <- incoming_data$index + 1951
names(incoming_data)[1] = "year"
incoming_data

# A tibble: 37 x 6
    year agriculture commerce construction industry transport
   <dbl>       <dbl>    <dbl>        <dbl>    <dbl>     <dbl>
 1  1952       100       100          100      100       100 
 2  1953       102.      133          138.     134.      120 
 3  1954       103.      136.         133.     159.      136 
 4  1955       112.      138.         152.     169.      140 
 5  1956       116.      147.         262.     219.      164 
 6  1957       120.      147.         243.     244.      176 
 7  1958       120.      156.         367      384.      271.
 8  1959       101.      170.         389.     502.      356.
 9  1960        83.6     164.         394      541.      384.
10  1961        84.7     130.         130.     316.      221.
# ... with 27 more rows

Visualization

Research Question 1

Which trade has the most number of years with the highest national income?

Firstly, I created a new dataset based on the original dataset. The new dataset, named max_income_trade_data, records which trade (agriculture, commerce, construction, industry, or transport) has the highest national income for each year.

max_income_trade_data <- apply(select(incoming_data,2:6), 1, which.max)
max_income_trade_data <- data.frame(
    index=select(incoming_data,1),
    trade=max_income_trade_data)
max_income_trade_data[max_income_trade_data==1] <- "agriculture"
max_income_trade_data[max_income_trade_data==2] <- "commerce"
max_income_trade_data[max_income_trade_data==3] <- "construction"
max_income_trade_data[max_income_trade_data==4] <- "industry"
max_income_trade_data[max_income_trade_data==5] <- "transport"

max_income_trade_data %>% head(10)

   year        trade
1  1952  agriculture
2  1953 construction
3  1954     industry
4  1955     industry
5  1956 construction
6  1957     industry
7  1958     industry
8  1959     industry
9  1960     industry
10 1961     industry

As we can see from the above output, each row of the dataset records the trade with the highest national income in the corresponding year. And then, I counted the number of years with the highest national income for each trade and drew them:

xaxis <- colnames(incoming_data)[2:6]
ggplot(max_income_trade_data["trade"],aes(trade,fill = trade)) +
    xlim(xaxis) +
    geom_bar() +
    labs(title="The number of years with the highest income for each trade") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))

We can learn from the plot: in the most years between 1952 and 1988, industry has the highest national income; in only a few years, agriculture and construction have the highest national income; commerce and transport never be the trade with the highest income in any year.

Research Question 2

Which trade grows the fastest and which one grows the slowest?

Firstly, I created a scatterplot to show all the data points from the dataset.

graph <- ggplot() +
    scale_color_manual("",values=c("agriculture"="red","commerce"="green","construction"="blue","industry"="hotpink","transport"="yellow")) +
    xlab("year") + ylab("national income") +
    labs(title="year - national income growth graph") +
    geom_point(data=incoming_data,aes(x=year,y=agriculture,color="agriculture")) +
    geom_point(data=incoming_data,aes(x=year,y=commerce,color="commerce")) +
    geom_point(data=incoming_data,aes(x=year,y=construction,color="construction")) +
    geom_point(data=incoming_data,aes(x=year,y=industry,color="industry")) +
    geom_point(data=incoming_data,aes(x=year,y=transport,color="transport"))
graph

Readers are already able to figure out the answer of the research question: industry grows the fastest and agirculture grows the slowest, as the data points of the five trades are clear distinction. However, to make the growth trend clearer, I added smoothing layers for each trade as following:

graph <- graph +
    geom_smooth(data=incoming_data,aes(x=year,y=agriculture,color="agriculture"),method='loess',formula='y ~ x') +
    geom_smooth(data=incoming_data,aes(x=year,y=commerce,color="commerce"),method='loess',formula='y ~ x') + 
    geom_smooth(data=incoming_data,aes(x=year,y=construction,color="construction"),method='loess',formula='y ~ x') + 
    geom_smooth(data=incoming_data,aes(x=year,y=industry,color="industry"),method='loess',formula='y ~ x') + 
    geom_smooth(data=incoming_data,aes(x=year,y=transport,color="transport"),method='loess',formula='y ~ x')
graph

We can learn from the plot: The national income growth of industry is the fastest, that of construction is the second fastest, that of transport is the third fastest, that of commerce is fourth fastest, that of agriculture is the slowest.

Reflection

Why This Dataset

The reason why I choose Chinese Real National Income Data and do data analytics work on it is that, as a student from China, I’d like to verify that the statistics are the same as my daily feels in China. I googled and found the package RDataset, which includes the dataset about Chinese national income, and then I made the decision to explore related issues.

Challenge

The biggest challenge I met is how to count the number of years in Research Question 1. At first, I tried to get it directly from the original dataset (based on my solution of Homework 3^[7]), but it’s too hard for me. And then I had an idea that creating a new dataset to store which trade has the highest nation income for each year and then counting the number of years of each trade. The idea works.

Wish To Know

What I wished to know about the Chinese national income is that, whether the data fits my daily feels in China: rapid growth in all sectors of the economy, except agriculture.

Furthermore, there are two points I wanted to know before enrolling DACSS 601: 1. How to program with R and how to use RStudio? 2. The similarities and the differences between R and Python. At the end of this course, I have basically mastered the use of R and RStudio. Furthermore, I think R and Python differ only in grammar, and they have very similar functions and usage for data analytics. However, R is much more single-minded.

After finishing the project, I figured out everything I wished to know.

Next Step

I’ve always wanted to explore the sequential growth year-on-year of each trade. But this question asks complex R implementation and I don’t have enough time to learn how to finish it. I hope I can learn enough things in the near few months and have the ability to do that.

Conclusion

Study Result

Combining Research Question 1 and 2, we can verify the following point, which is from the sourse of the dataset^[8]: The economic development strategy of the People’s Republic of China during the three decades beginning in the early 1950s is characterized by a high rate of capital accumulation at the expense of consumption and the promotion of industry at the expense of agriculture. Therefore, the growth of industrial national income has been remarkable, the growth of commercial national income is tiny, and the growth if agricultural national income is almost negligible. The point also fits my daily feels in China.

Future Work

The writer of the dataset did a narmalization work on the dataset The dataset set the initialized value of all the five trades with 100, so that it’s easy to see how many percent the national income grows.

However, the dataset lost the exact value of the national income, wihich may cause problems. For example, as a traditional agricultural power, China may had a great number of agricultural national income stock in 1952, while that of industy is pitifully small, so the growth of industy is much greater than that of agriculture. It needs to future research work to clearify.

Bibliography

[1] Concept of National Income
[2] The R Project for Statistical Computing
[3] DACSS 601: Data Science Fundamentals
[4] Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.
[5] Chinese Real National Income Data
[6] Rdatasets
[7] HW3 by Guodong Zhang
[8] Chow, G.C. (1993). Capital Formation and Economic Growth in China. Quarterly Journal of Economics, 103, 809–842.

Comment on this article Share:

A Study of Chinese Real National Income Data