Data Analytics and Computational Social Science: HW6 by Guodong Zhang

Guodong Zhang

Introduction

The National Income is the total amount of income accruing to a country from economic activities in a years time. It includes payments made to all resources either in the form of wages, interest, rent, and profits.¹

Following are my research questions:

Which trade has the most number of years with the highest national income?
Which trade grows the fastest and which one grows the slowest?

Data

The data I used is Chinese Real National Income Data², which is from Rdatasets³.

Rdatasets is a collection of over 1700 datasets that were originally distributed alongside the statistical software environment R and some of its add-on packages. The goal is to make these data more broadly accessible for teaching and statistical software development.³

Chinese Real National Income Data records the real national income of five trades (agriculture, industry, construction, transport, and commerce) in China between 1952 and 1988. The data takes the first year, 1952, as the benchmark, of which the values for all the five trades are 100.

Following is the code to import the data:

incoming_data <- read_csv(
    "C:/Users/zhang/OneDrive - University of Massachusetts/_601/Sample Datasets/ChinaIncome.csv",
    show_col_types = FALSE)
incoming_data

# A tibble: 37 x 6
   index agriculture commerce construction industry transport
   <dbl>       <dbl>    <dbl>        <dbl>    <dbl>     <dbl>
 1     1       100       100          100      100       100 
 2     2       102.      133          138.     134.      120 
 3     3       103.      136.         133.     159.      136 
 4     4       112.      138.         152.     169.      140 
 5     5       116.      147.         262.     219.      164 
 6     6       120.      147.         243.     244.      176 
 7     7       120.      156.         367      384.      271.
 8     8       101.      170.         389.     502.      356.
 9     9        83.6     164.         394      541.      384.
10    10        84.7     130.         130.     316.      221.
# ... with 27 more rows

The first column of the original data is only the index of each row of the data. To make the data more clear and easy to understand, I modified the first column from 1-37 to 1952-1988, which is the year of each row of data.

incoming_data$index <- incoming_data$index + 1951
names(incoming_data)[1] = "year"
incoming_data

# A tibble: 37 x 6
    year agriculture commerce construction industry transport
   <dbl>       <dbl>    <dbl>        <dbl>    <dbl>     <dbl>
 1  1952       100       100          100      100       100 
 2  1953       102.      133          138.     134.      120 
 3  1954       103.      136.         133.     159.      136 
 4  1955       112.      138.         152.     169.      140 
 5  1956       116.      147.         262.     219.      164 
 6  1957       120.      147.         243.     244.      176 
 7  1958       120.      156.         367      384.      271.
 8  1959       101.      170.         389.     502.      356.
 9  1960        83.6     164.         394      541.      384.
10  1961        84.7     130.         130.     316.      221.
# ... with 27 more rows

Visualization

Research Question 1

Which trade has the most number of years with the highest national income?

Firstly, I created a new dataset based on the original dataset. The new dataset, named max_income_trade_data, records which trade (agriculture, commerce, construction, industry, or transport) has the highest national income for each year.

max_income_trade_data <- apply(select(incoming_data,2:6), 1, which.max)
max_income_trade_data <- data.frame(
    index=select(incoming_data,1),
    trade=max_income_trade_data)
max_income_trade_data[max_income_trade_data==1] <- "agriculture"
max_income_trade_data[max_income_trade_data==2] <- "commerce"
max_income_trade_data[max_income_trade_data==3] <- "construction"
max_income_trade_data[max_income_trade_data==4] <- "industry"
max_income_trade_data[max_income_trade_data==5] <- "transport"

max_income_trade_data %>% head(10)

   year        trade
1  1952  agriculture
2  1953 construction
3  1954     industry
4  1955     industry
5  1956 construction
6  1957     industry
7  1958     industry
8  1959     industry
9  1960     industry
10 1961     industry

As we can see, each row of the dataset records the trade with the highest national income in the corresponding year. And then, I counted the number of years with the highest national income for each trade and drew them:

xaxis <- colnames(incoming_data)[2:6]
ggplot(max_income_trade_data["trade"],aes(trade,fill = trade)) +
    xlim(xaxis) +
    geom_bar() +
    labs(title="The number of years with the highest income for each trade") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))

We can learn from the plot: in the most years between 1952 and 1988, industry has the highest national income; in only a few years, agriculture and construction have the highest national income; commerce and transport never be the trade with the highest income in any year.

Research Question 2

Which trade grows the fastest and which one grows the slowest?

ggplot() +
    geom_point(data=incoming_data,aes(x=year,y=agriculture,color="agriculture")) +
    geom_smooth(data=incoming_data,aes(x=year,y=agriculture,color="agriculture"),method='loess',formula='y ~ x') +
    geom_point(data=incoming_data,aes(x=year,y=commerce,color="commerce")) +
    geom_smooth(data=incoming_data,aes(x=year,y=commerce,color="commerce"),method='loess',formula='y ~ x') + 
    geom_point(data=incoming_data,aes(x=year,y=construction,color="construction")) +
    geom_smooth(data=incoming_data,aes(x=year,y=construction,color="construction"),method='loess',formula='y ~ x') + 
    geom_point(data=incoming_data,aes(x=year,y=industry,color="industry")) +
    geom_smooth(data=incoming_data,aes(x=year,y=industry,color="industry"),method='loess',formula='y ~ x') + 
    geom_point(data=incoming_data,aes(x=year,y=transport,color="transport")) +
    geom_smooth(data=incoming_data,aes(x=year,y=transport,color="transport"),method='loess',formula='y ~ x') + 
    scale_color_manual("",values=c("agriculture"="red","commerce"="green","construction"="blue","industry"="hotpink","transport"="yellow")) +
    xlab("year") + ylab("national income") +
    labs(title="year - national income growth graph")

Reflection

Conclusion

Combining Research Question 1 and 2, we can verify the following point, which is from the sourse of the dataset^X: The economic development strategy of the People’s Republic of China during the three decades beginning in the early 1950s is characterized by a high rate of capital accumulation at the expense of consumption and the promotion of industry at the expense of agriculture. Therefore, the growth of industrial national income has been remarkable, the growth of commercial national income is tiny, and the growth if agricultural national income is almost negligible.

Future Work

The writer of the dataset did a narmalization work on the dataset The dataset set the initialized value of all the five trades with 100, so that it’s easy to see how many percent the national income grows.

However, the dataset lost the exact value of the national income, wihich may cause problems. For example, as a traditional agricultural power, China may had a great number of agricultural national income stock in 1952, while that of industy is pitifully small, so the growth of industy is much greater than that of agriculture. It needs to future research work to clearify.

Bibliography

Comment on this article Share:

HW6 by Guodong Zhang

Introduction

Data

Visualization

Research Question 1

Research Question 2

Reflection

Conclusion

Conclusion

Future Work

Bibliography

Reuse

Citation