Data Analytics and Computational Social Science: HW4 by Guodong Zhang

Guodong Zhang

1. Descriptive Statistics

Firstly, read in the dataset Chinese Real National Income Data.

incoming_data <- read_csv("C:/Users/zhang/OneDrive - University of Massachusetts/_601/Sample Datasets/ChinaIncome.csv", show_col_types = FALSE)
incoming_data

# A tibble: 37 x 6
   index agriculture commerce construction industry transport
   <dbl>       <dbl>    <dbl>        <dbl>    <dbl>     <dbl>
 1  1952       100       100          100      100       100 
 2  1953       102.      133          138.     134.      120 
 3  1954       103.      136.         133.     159.      136 
 4  1955       112.      138.         152.     169.      140 
 5  1956       116.      147.         262.     219.      164 
 6  1957       120.      147.         243.     244.      176 
 7  1958       120.      156.         367      384.      271.
 8  1959       101.      170.         389.     502.      356.
 9  1960        83.6     164.         394      541.      384.
10  1961        84.7     130.         130.     316.      221.
# ... with 27 more rows

Code the function to compute mean, median, and standard deviation for numerical variables. There is no categorical variables in the dataset.

statistics <- function(coldata) {
    cat("mean:\t",mean(coldata),'\n')
    cat("mdian:\t",median(coldata),'\n')
    cat("std:\t",sd(coldata),'\n')
}

Use for loop to deal with every variables in the dataset.

for (colname in names(incoming_data)) {
    cat(colname,':\n',sep='')
    statistics(incoming_data[[colname]])
}

index:
mean:    1970 
mdian:   1970 
std:     10.82436 
agriculture:
mean:    151.9486 
mdian:   139.8 
std:     54.39186 
commerce:
mean:    260.9865 
mdian:   199.2 
std:     176.233 
construction:
mean:    549.8973 
mdian:   421 
std:     448.0518 
industry:
mean:    1244.238 
mdian:   863 
std:     1191.844 
transport:
mean:    449.4973 
mdian:   370.8 
std:     328.5462

Following is the method from Tutorial 7 to compute mean, and standard deviation.

Compute mean:

summarize_all(incoming_data, mean)

# A tibble: 1 x 6
  index agriculture commerce construction industry transport
  <dbl>       <dbl>    <dbl>        <dbl>    <dbl>     <dbl>
1  1970        152.     261.         550.    1244.      449.

Compute median:

summarize_all(incoming_data, median)

# A tibble: 1 x 6
  index agriculture commerce construction industry transport
  <dbl>       <dbl>    <dbl>        <dbl>    <dbl>     <dbl>
1  1970        140.     199.          421      863      371.

Compute standard deviation:

summarize_all(incoming_data, sd)

# A tibble: 1 x 6
  index agriculture commerce construction industry transport
  <dbl>       <dbl>    <dbl>        <dbl>    <dbl>     <dbl>
1  10.8        54.4     176.         448.    1192.      329.

2. Visualization One: Univariate

fast_fields <- apply(select(incoming_data,2:6), 1, which.max)
fast_fields <- data.frame(field=fast_fields)
ggplot(fast_fields,aes(field)) + geom_histogram(bins=30)

What variable(s) you are visualizing?

The variable is the field in which the income is the highest in each year, as the original single variable is not suitable to visualize.

What question(s) you are attempting to answer with the visualization?

The question is which field has the most number of years with the highest income between 1952 and 1988.

What conclusions you can make from the visualization?

The conclusion is the 4th field, industry, has the most number of years with the highest income.

3. Visualization Two: Bivariate

ggplot(incoming_data,aes(transport,construction)) +
    geom_point() +
    geom_smooth(method='loess',formula='y ~ x')

What variable(s) you are visualizing?

The variable is transport and industry.

What question(s) you are attempting to answer with the visualization?

The question is I want to verify the relationship between transport and construction.

What conclusions you can make from the visualization?

The conclusion is that the development of the transport will promote the development of construction

4. Limitations

What about the visualizations may be unclear to a naive viewer?

The first visualization is difficult to understand without the text description (maybe even with the text description). But I don’t know how to modify the x-axis and add more details in the figure.

How could you improve the visualizations for the final project?

My plan is spending more time on GGPlot and finding better methods to visualizing. Or I’d like to change the questions to some easy to visualizing.

Comment on this article Share:

HW4 by Guodong Zhang

1. Descriptive Statistics

2. Visualization One: Univariate

What variable(s) you are visualizing?

What question(s) you are attempting to answer with the visualization?

What conclusions you can make from the visualization?

3. Visualization Two: Bivariate

What variable(s) you are visualizing?

What question(s) you are attempting to answer with the visualization?

What conclusions you can make from the visualization?

4. Limitations

What about the visualizations may be unclear to a naive viewer?

How could you improve the visualizations for the final project?

Reuse

Citation