Data Analytics and Computational Social Science: Final Project

Hanyu

Introduction

With the popular application of information technology and the rapid development of the Internet, e-commerce has gradually become a new model for people to conduct business activities. E-commerce makes the user establish contact with the chosen business through the computer or cell phone for fast and convenient transaction activities, while the business can market goods and services all over the country through the Internet website at a fairly low cost, and smoothly develop a global target market.

E-commerce is an important element of strategic emerging industries and modern circulation methods and is an important grasp and realization of science and technology and cultural innovation. The vigorous development of e-commerce has significantly helped to promote healthy economic and social development, expand consumer demand, and enhance the city’s high-end resource pooling, industrial chain restructuring, and technological innovation. Especially during the epidemic, the real economy suffered huge losses, and at this time, the advantages of e-commerce came out, with the advantages of high operational efficiency, low transaction costs and wide audience coverage, and the close integration of e-commerce with the logistics industry and technological, financial and cultural innovation. In recent years, the amount of Internet transactions in China has been maintaining the momentum of rapid development. As an important basic platform for China’s economic and social development, the Internet has further increased its influence on promoting national economic growth and the progress of social life. It has become the most active area in China’s social development today.

E-commerce platforms are oriented to consumer groups, and merchants, they need to differentiate between worthless consumers and high-value consumers through consumption data to gain more benefits. Consumers, also need to receive more personalized services to increase their experience and satisfaction. Therefore, the platform develops optimized personalized service plans for consumers with different values, and formulates corresponding marketing strategies for preference recommendation and promotion, focusing limited marketing resources on high-value consumers and providing personalized services to different consumers to achieve the goal of maximizing profits for the platform and merchants.

Based on this situation, this project analyzes the consumption data of the 11.11 shopping festival to achieve the following objectives.

Classify and analyze the sales of products with the help of Taobao Mall consumer data.
To analyze the characteristics of different consumers and compare the consumption value of different categories of consumers.
To study the consumption habits of different consumers in order to provide personalized services to different consumer categories.

Data

Read Data

##Read data
data = read.csv(file = "business_data.csv")
summary(data)

    User_ID         Product_ID           Gender         
 Min.   :1000001   Length:537577      Length:537577     
 1st Qu.:1001495   Class :character   Class :character  
 Median :1003031   Mode  :character   Mode  :character  
 Mean   :1002992                                        
 3rd Qu.:1004417                                        
 Max.   :1006040                                        
                                                        
     Age              Occupation     City_Category     
 Length:537577      Min.   : 0.000   Length:537577     
 Class :character   1st Qu.: 2.000   Class :character  
 Mode  :character   Median : 7.000   Mode  :character  
                    Mean   : 8.083                     
                    3rd Qu.:14.000                     
                    Max.   :20.000                     
                                                       
 Stay_In_Current_City_Years Marital_Status   Product_Category_1
 Length:537577              Min.   :0.0000   Min.   : 1.000    
 Class :character           1st Qu.:0.0000   1st Qu.: 1.000    
 Mode  :character           Median :0.0000   Median : 5.000    
                            Mean   :0.4088   Mean   : 5.296    
                            3rd Qu.:1.0000   3rd Qu.: 8.000    
                            Max.   :1.0000   Max.   :18.000    
                                                               
 Product_Category_2 Product_Category_3    Purchase    
 Min.   : 2.00      Min.   : 3.0       Min.   :  185  
 1st Qu.: 5.00      1st Qu.: 9.0       1st Qu.: 5866  
 Median : 9.00      Median :14.0       Median : 8062  
 Mean   : 9.84      Mean   :12.7       Mean   : 9334  
 3rd Qu.:15.00      3rd Qu.:16.0       3rd Qu.:12073  
 Max.   :18.00      Max.   :18.0       Max.   :23961  
 NA's   :166986     NA's   :373299

Clean Data

First,drop the Product_Category_3 column because there is too many missing values.

Second,fill in the missing values of Product_Category_2 with 0.

data = data[,-11]       
data[is.na(data)] <- 0
data = data[,-1]
head(data)

  Product_ID Gender   Age Occupation City_Category
1  P00069042      F  0-17         10             A
2  P00248942      F  0-17         10             A
3  P00087842      F  0-17         10             A
4  P00085442      F  0-17         10             A
5  P00285442      M   55+         16             C
6  P00193542      M 26-35         15             A
  Stay_In_Current_City_Years Marital_Status Product_Category_1
1                          2              0                  3
2                          2              0                  1
3                          2              0                 12
4                          2              0                 12
5                         4+              0                  8
6                          3              0                  1
  Product_Category_2 Purchase
1                  0     8370
2                  6    15200
3                  0     1422
4                 14     1057
5                  0     7969
6                  2    15227

Display Data

Makes it easier to view data in transposed views.

glimpse(data)

Rows: 537,577
Columns: 10
$ Product_ID                 <chr> "P00069042", "P00248942", "P00087…
$ Gender                     <chr> "F", "F", "F", "F", "M", "M", "M"…
$ Age                        <chr> "0-17", "0-17", "0-17", "0-17", "…
$ Occupation                 <int> 10, 10, 10, 10, 16, 15, 7, 7, 7, …
$ City_Category              <chr> "A", "A", "A", "A", "C", "A", "B"…
$ Stay_In_Current_City_Years <chr> "2", "2", "2", "2", "4+", "3", "2…
$ Marital_Status             <int> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, …
$ Product_Category_1         <int> 3, 1, 12, 12, 8, 1, 1, 1, 1, 8, 5…
$ Product_Category_2         <dbl> 0, 6, 0, 14, 0, 2, 8, 15, 16, 0, …
$ Purchase                   <int> 8370, 15200, 1422, 1057, 7969, 15…

The datasets have 12 variables and a total of 537,577 data. These are 12 variable.

User_ID: the shopper’s code.
Product_ID: Product code.
Gender: the gender of the shopper.
Age: the age of the shopper.
Ccupation: Occupation.
City_Category: The shopper’s place of residence.
Stay_In_Current_City_Years: The number of years spent in the current city.
Marital_Status: The shopper’s marital status.
Product_Category_1: The category of the product purchased.
Product_Category_2: The product may belong to other categories.
Product_Category_3: Products may belong to other categories.
Purchase: The amount of the purchase.

Visualization

Statistics for Age

age_group <- group_by(data, Age) %>%
  summarise(n = n(), 
            Mea_Pur = mean(Purchase),
            Med_Pur = median(Purchase),
            Sd_Pur  = sd(Purchase), 
            na.rm = TRUE) %>%
  mutate(freq = n / sum(n))
age_group

# A tibble: 7 × 7
  Age        n Mea_Pur Med_Pur Sd_Pur na.rm   freq
  <chr>  <int>   <dbl>   <dbl>  <dbl> <lgl>  <dbl>
1 0-17   14707   9020.    8009  5060. TRUE  0.0274
2 18-25  97634   9235.    8041  4996. TRUE  0.182 
3 26-35 214690   9315.    8043  4974. TRUE  0.399 
4 36-45 107499   9401.    8076  4978. TRUE  0.200 
5 46-50  44526   9285.    8050  4921. TRUE  0.0828
6 51-55  37618   9621.    8172  5035. TRUE  0.0700
7 55+    20903   9454.    8127  4939. TRUE  0.0389

Statistics for City Category

city_group <- group_by(data, City_Category) %>%
  summarise(n = n(), 
            Mea_Pur = mean(Purchase),
            Med_Pur = median(Purchase),
            Sd_Pur  = sd(Purchase), 
            na.rm = TRUE) %>%
  mutate(freq = n / sum(n))
city_group

# A tibble: 3 × 7
  City_Category      n Mea_Pur Med_Pur Sd_Pur na.rm  freq
  <chr>          <int>   <dbl>   <dbl>  <dbl> <lgl> <dbl>
1 A             144638   8958.    7941  4867. TRUE  0.269
2 B             226493   9199.    8015  4927. TRUE  0.421
3 C             166446   9844.    8618  5109. TRUE  0.310

Purchase Amount Distribution

ggplot(data, aes(Purchase, fill = Age)) + 
  geom_histogram(color = "black", binwidth = 500)+
  labs(title = "Purchase Distribution",x = "purchase", y="count") + 
  theme(axis.text=element_text(size = 7)) +
  facet_wrap(vars(Gender), scales = "free")

I visualize two variables, amount spent and quantity spent. The visualization allows us to find trends in the amount of money people spend on the day of the shopping festival. I found that people’s spending amounts basically fit a normal distribution, with women consuming almost three times as much as men. However, regardless of gender, the people who spent the most amount of money at 7500.

However,in this visualization graph, both men and women have abnormally high values at 15000, but abnormally low values around 15000. But it is hard to analyze the reason behind it from the data, so it is hard to answer.

Consumer Favorite Product Category

PC_group<-data %>% 
  group_by(Product_Category_1) %>%
  count()
ggplot(PC_group, aes(x = reorder(Product_Category_1,n),
                     y = n,
                     fill = "orange")) +
  geom_col(aes(factor(Product_Category_1)))+
  labs(x ="Product Category",
       y = "Count",
       title = "Consumer Favorite Product Category")

The image shows that consumers’ favorite product category number is No. 1, which is significantly higher than other products; followed by No. 2, No. 5, and No. 8, which are generally more average; while No. 10, No. 11, No. 12, and No. 13 products are almost unpopular with consumers.

Different Gender’s Preference for Various Products

G_P_group <- data %>%
  group_by(Gender,Product_Category_1) %>%
  count()
  ggplot(G_P_group, aes(x = as.factor(Product_Category_1),
         y = n,
         fill = as.factor(Gender)))+
  geom_bar(stat = "identity", position = "dodge")+
  labs(x="Product Category",y="Count",fill="gender",
       title="Different gender's preference for various products")

Among the preferences of different genders for each product, male consumers are generally higher than female, with No. 1 being the most significant.

Average Consumer Price by Category

#Average consumer price by category
  ggplot(data, aes(x = reorder(as.factor(Product_Category_1),Purchase),
                  y = Purchase))+  
  geom_boxplot(fill = "pink", width = .3)+
  geom_violin(fill = "blue",size = .4)+
  labs(title = "Average consumer price by category",
       x = "Product Category",
       y = "purchase")

The image shows that the average consumption of each category, the price of category 10 is the highest, and the distribution is significantly asymmetric.

Conclusion

Overall, the project analyzed data from the 11.11 shopping day consumer purchase dataset. For the consumer data, the relationship between these attributes and consumer purchases was determined through a joint analysis of consumer age, gender, occupation, residence, and marital status. For products, “bestseller” data was identified, and various metrics were determined for the 11.11 shopping day, including average customer spending and total purchase amounts across multiple categories.

This data analysis makes a basis for merchants to support the preparation of the shopping festival and provide personalized service solutions for which customer groups to better gain profit.

Reflection

In this predictive analysis case, the sample size is large, but the attributes are small and all are more one-sided, which cannot be better joint for completeness analysis. At the same time, the lack of model evaluation steps makes it impossible to judge the suitability of the final model built.

In addition, the way the results are presented in this case is not intuitive enough for the merchants to know the final required information at a glance. Therefore, a result dataset can be established to match information and improve the readability of the prediction results by indexing columns corresponding to the test results.

It is very important for an engineering student to learn how to handle data. This was my first exposure to a data science-related course, and since I had studied other programming languages, R was not new to me. However, R is a powerful language with many efficient libraries, and it takes a long time to learn and practice to use it proficiently. Through this course, I gradually mastered some basic usage of R and tried to analyze a database. I encountered a lot of confusion and difficulties in this process, but when I tried to solve them, I found that my data analysis ability has improved greatly.

Bibliography

Wickham, H. & Grolemund, G. (n.d.). R for data science [eBook edition]. O’Reilly. https://r4ds.had.co.nz/index.html

Wickham, H., François, R., Henry, L., & Müller, K. (n.d.). Programming with dplyr. dplyr. https://dplyr.tidyverse.org/articles/programming.html

Kabacoff, R. (2020). Data visualizations with R. Quantitative Analysis Center, Wesleyan University. https://rkabacoff.github.io/datavis/index.html

Comment on this article Share:

Final Project