This is my Homework 6 for DACSS 601, and an incomplete draft of the final project.
The National Income is the total amount of income accruing to a country from economic activities in a years time. It includes payments made to all resources either in the form of wages, interest, rent, and profits.1
Following are my research questions:
Which trade has the most number of years with the highest national income?
Which trade grows the fastest and which one grows the slowest?
The data I used is Chinese Real National Income Data2, which is from Rdatasets3.
Rdatasets is a collection of over 1700 datasets that were originally distributed alongside the statistical software environment R and some of its add-on packages. The goal is to make these data more broadly accessible for teaching and statistical software development.3
Chinese Real National Income Data records the real national income of five trades (agriculture, industry, construction, transport, and commerce) in China between 1952 and 1988. The data takes the first year, 1952, as the benchmark, of which the values for all the five trades are 100.
Following is the code to import the data:
incoming_data <- read_csv(
"C:/Users/zhang/OneDrive - University of Massachusetts/_601/Sample Datasets/ChinaIncome.csv",
show_col_types = FALSE)
incoming_data
# A tibble: 37 x 6
index agriculture commerce construction industry transport
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 100 100 100 100 100
2 2 102. 133 138. 134. 120
3 3 103. 136. 133. 159. 136
4 4 112. 138. 152. 169. 140
5 5 116. 147. 262. 219. 164
6 6 120. 147. 243. 244. 176
7 7 120. 156. 367 384. 271.
8 8 101. 170. 389. 502. 356.
9 9 83.6 164. 394 541. 384.
10 10 84.7 130. 130. 316. 221.
# ... with 27 more rows
The first column of the original data is only the index of each row of the data. To make the data more clear and easy to understand, I modified the first column from 1-37 to 1952-1988, which is the year of each row of data.
incoming_data$index <- incoming_data$index + 1951
names(incoming_data)[1] = "year"
incoming_data
# A tibble: 37 x 6
year agriculture commerce construction industry transport
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1952 100 100 100 100 100
2 1953 102. 133 138. 134. 120
3 1954 103. 136. 133. 159. 136
4 1955 112. 138. 152. 169. 140
5 1956 116. 147. 262. 219. 164
6 1957 120. 147. 243. 244. 176
7 1958 120. 156. 367 384. 271.
8 1959 101. 170. 389. 502. 356.
9 1960 83.6 164. 394 541. 384.
10 1961 84.7 130. 130. 316. 221.
# ... with 27 more rows
Which trade has the most number of years with the highest national income?
Firstly, I created a new dataset based on the original dataset. The new dataset, named max_income_trade_data, records which trade (agriculture, commerce, construction, industry, or transport) has the highest national income for each year.
max_income_trade_data <- apply(select(incoming_data,2:6), 1, which.max)
max_income_trade_data <- data.frame(
index=select(incoming_data,1),
trade=max_income_trade_data)
max_income_trade_data[max_income_trade_data==1] <- "agriculture"
max_income_trade_data[max_income_trade_data==2] <- "commerce"
max_income_trade_data[max_income_trade_data==3] <- "construction"
max_income_trade_data[max_income_trade_data==4] <- "industry"
max_income_trade_data[max_income_trade_data==5] <- "transport"
max_income_trade_data %>% head(10)
year trade
1 1952 agriculture
2 1953 construction
3 1954 industry
4 1955 industry
5 1956 construction
6 1957 industry
7 1958 industry
8 1959 industry
9 1960 industry
10 1961 industry
As we can see, each row of the dataset records the trade with the highest national income in the corresponding year. And then, I counted the number of years with the highest national income for each trade and drew them:
We can learn from the plot: in the most years between 1952 and 1988, industry has the highest national income; in only a few years, agriculture and construction have the highest national income; commerce and transport never be the trade with the highest income in any year.
Which trade grows the fastest and which one grows the slowest?
ggplot() +
geom_point(data=incoming_data,aes(x=year,y=agriculture,color="agriculture")) +
geom_smooth(data=incoming_data,aes(x=year,y=agriculture,color="agriculture"),method='loess',formula='y ~ x') +
geom_point(data=incoming_data,aes(x=year,y=commerce,color="commerce")) +
geom_smooth(data=incoming_data,aes(x=year,y=commerce,color="commerce"),method='loess',formula='y ~ x') +
geom_point(data=incoming_data,aes(x=year,y=construction,color="construction")) +
geom_smooth(data=incoming_data,aes(x=year,y=construction,color="construction"),method='loess',formula='y ~ x') +
geom_point(data=incoming_data,aes(x=year,y=industry,color="industry")) +
geom_smooth(data=incoming_data,aes(x=year,y=industry,color="industry"),method='loess',formula='y ~ x') +
geom_point(data=incoming_data,aes(x=year,y=transport,color="transport")) +
geom_smooth(data=incoming_data,aes(x=year,y=transport,color="transport"),method='loess',formula='y ~ x') +
scale_color_manual("",values=c("agriculture"="red","commerce"="green","construction"="blue","industry"="hotpink","transport"="yellow")) +
xlab("year") + ylab("national income") +
labs(title="year - national income growth graph")
Combining Research Question 1 and 2, we can verify the following point, which is from the sourse of the datasetX: The economic development strategy of the People’s Republic of China during the three decades beginning in the early 1950s is characterized by a high rate of capital accumulation at the expense of consumption and the promotion of industry at the expense of agriculture. Therefore, the growth of industrial national income has been remarkable, the growth of commercial national income is tiny, and the growth if agricultural national income is almost negligible.
The writer of the dataset did a narmalization work on the dataset The dataset set the initialized value of all the five trades with 100, so that it’s easy to see how many percent the national income grows.
However, the dataset lost the exact value of the national income, wihich may cause problems. For example, as a traditional agricultural power, China may had a great number of agricultural national income stock in 1952, while that of industy is pitifully small, so the growth of industy is much greater than that of agriculture. It needs to future research work to clearify.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Zhang (2022, Jan. 20). Data Analytics and Computational Social Science: HW6 by Guodong Zhang. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgdzhangdacss601hw6/
BibTeX citation
@misc{zhang2022hw6, author = {Zhang, Guodong}, title = {Data Analytics and Computational Social Science: HW6 by Guodong Zhang}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgdzhangdacss601hw6/}, year = {2022} }