Shopping Festival Consumption Data Analysis
With the popular application of information technology and the rapid development of the Internet, e-commerce has gradually become a new model for people to conduct business activities. E-commerce makes the user establish contact with the chosen business through the computer or cell phone for fast and convenient transaction activities, while the business can market goods and services all over the country through the Internet website at a fairly low cost, and smoothly develop a global target market.
E-commerce is an important element of strategic emerging industries and modern circulation methods and is an important grasp and realization of science and technology and cultural innovation. The vigorous development of e-commerce has significantly helped to promote healthy economic and social development, expand consumer demand, and enhance the city’s high-end resource pooling, industrial chain restructuring, and technological innovation. Especially during the epidemic, the real economy suffered huge losses, and at this time, the advantages of e-commerce came out, with the advantages of high operational efficiency, low transaction costs and wide audience coverage, and the close integration of e-commerce with the logistics industry and technological, financial and cultural innovation. In recent years, the amount of Internet transactions in China has been maintaining the momentum of rapid development. As an important basic platform for China’s economic and social development, the Internet has further increased its influence on promoting national economic growth and the progress of social life. It has become the most active area in China’s social development today.
E-commerce platforms are oriented to consumer groups, and merchants, they need to differentiate between worthless consumers and high-value consumers through consumption data to gain more benefits. Consumers, also need to receive more personalized services to increase their experience and satisfaction. Therefore, the platform develops optimized personalized service plans for consumers with different values, and formulates corresponding marketing strategies for preference recommendation and promotion, focusing limited marketing resources on high-value consumers and providing personalized services to different consumers to achieve the goal of maximizing profits for the platform and merchants.
Based on this situation, this project analyzes the consumption data of the 11.11 shopping festival to achieve the following objectives.
User_ID Product_ID Gender
Min. :1000001 Length:537577 Length:537577
1st Qu.:1001495 Class :character Class :character
Median :1003031 Mode :character Mode :character
Mean :1002992
3rd Qu.:1004417
Max. :1006040
Age Occupation City_Category
Length:537577 Min. : 0.000 Length:537577
Class :character 1st Qu.: 2.000 Class :character
Mode :character Median : 7.000 Mode :character
Mean : 8.083
3rd Qu.:14.000
Max. :20.000
Stay_In_Current_City_Years Marital_Status Product_Category_1
Length:537577 Min. :0.0000 Min. : 1.000
Class :character 1st Qu.:0.0000 1st Qu.: 1.000
Mode :character Median :0.0000 Median : 5.000
Mean :0.4088 Mean : 5.296
3rd Qu.:1.0000 3rd Qu.: 8.000
Max. :1.0000 Max. :18.000
Product_Category_2 Product_Category_3 Purchase
Min. : 2.00 Min. : 3.0 Min. : 185
1st Qu.: 5.00 1st Qu.: 9.0 1st Qu.: 5866
Median : 9.00 Median :14.0 Median : 8062
Mean : 9.84 Mean :12.7 Mean : 9334
3rd Qu.:15.00 3rd Qu.:16.0 3rd Qu.:12073
Max. :18.00 Max. :18.0 Max. :23961
NA's :166986 NA's :373299
First,drop the Product_Category_3 column because there is too many missing values.
Second,fill in the missing values of Product_Category_2 with 0.
Product_ID Gender Age Occupation City_Category
1 P00069042 F 0-17 10 A
2 P00248942 F 0-17 10 A
3 P00087842 F 0-17 10 A
4 P00085442 F 0-17 10 A
5 P00285442 M 55+ 16 C
6 P00193542 M 26-35 15 A
Stay_In_Current_City_Years Marital_Status Product_Category_1
1 2 0 3
2 2 0 1
3 2 0 12
4 2 0 12
5 4+ 0 8
6 3 0 1
Product_Category_2 Purchase
1 0 8370
2 6 15200
3 0 1422
4 14 1057
5 0 7969
6 2 15227
Makes it easier to view data in transposed views.
glimpse(data)
Rows: 537,577
Columns: 10
$ Product_ID <chr> "P00069042", "P00248942", "P00087…
$ Gender <chr> "F", "F", "F", "F", "M", "M", "M"…
$ Age <chr> "0-17", "0-17", "0-17", "0-17", "…
$ Occupation <int> 10, 10, 10, 10, 16, 15, 7, 7, 7, …
$ City_Category <chr> "A", "A", "A", "A", "C", "A", "B"…
$ Stay_In_Current_City_Years <chr> "2", "2", "2", "2", "4+", "3", "2…
$ Marital_Status <int> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, …
$ Product_Category_1 <int> 3, 1, 12, 12, 8, 1, 1, 1, 1, 8, 5…
$ Product_Category_2 <dbl> 0, 6, 0, 14, 0, 2, 8, 15, 16, 0, …
$ Purchase <int> 8370, 15200, 1422, 1057, 7969, 15…
The datasets have 12 variables and a total of 537,577 data. These are 12 variable.
age_group <- group_by(data, Age) %>%
summarise(n = n(),
Mea_Pur = mean(Purchase),
Med_Pur = median(Purchase),
Sd_Pur = sd(Purchase),
na.rm = TRUE) %>%
mutate(freq = n / sum(n))
age_group
# A tibble: 7 × 7
Age n Mea_Pur Med_Pur Sd_Pur na.rm freq
<chr> <int> <dbl> <dbl> <dbl> <lgl> <dbl>
1 0-17 14707 9020. 8009 5060. TRUE 0.0274
2 18-25 97634 9235. 8041 4996. TRUE 0.182
3 26-35 214690 9315. 8043 4974. TRUE 0.399
4 36-45 107499 9401. 8076 4978. TRUE 0.200
5 46-50 44526 9285. 8050 4921. TRUE 0.0828
6 51-55 37618 9621. 8172 5035. TRUE 0.0700
7 55+ 20903 9454. 8127 4939. TRUE 0.0389
city_group <- group_by(data, City_Category) %>%
summarise(n = n(),
Mea_Pur = mean(Purchase),
Med_Pur = median(Purchase),
Sd_Pur = sd(Purchase),
na.rm = TRUE) %>%
mutate(freq = n / sum(n))
city_group
# A tibble: 3 × 7
City_Category n Mea_Pur Med_Pur Sd_Pur na.rm freq
<chr> <int> <dbl> <dbl> <dbl> <lgl> <dbl>
1 A 144638 8958. 7941 4867. TRUE 0.269
2 B 226493 9199. 8015 4927. TRUE 0.421
3 C 166446 9844. 8618 5109. TRUE 0.310
ggplot(data, aes(Purchase, fill = Age)) +
geom_histogram(color = "black", binwidth = 500)+
labs(title = "Purchase Distribution",x = "purchase", y="count") +
theme(axis.text=element_text(size = 7)) +
facet_wrap(vars(Gender), scales = "free")
I visualize two variables, amount spent and quantity spent. The visualization allows us to find trends in the amount of money people spend on the day of the shopping festival. I found that people’s spending amounts basically fit a normal distribution, with women consuming almost three times as much as men. However, regardless of gender, the people who spent the most amount of money at 7500.
However,in this visualization graph, both men and women have abnormally high values at 15000, but abnormally low values around 15000. But it is hard to analyze the reason behind it from the data, so it is hard to answer.
PC_group<-data %>%
group_by(Product_Category_1) %>%
count()
ggplot(PC_group, aes(x = reorder(Product_Category_1,n),
y = n,
fill = "orange")) +
geom_col(aes(factor(Product_Category_1)))+
labs(x ="Product Category",
y = "Count",
title = "Consumer Favorite Product Category")
The image shows that consumers’ favorite product category number is No. 1, which is significantly higher than other products; followed by No. 2, No. 5, and No. 8, which are generally more average; while No. 10, No. 11, No. 12, and No. 13 products are almost unpopular with consumers.
G_P_group <- data %>%
group_by(Gender,Product_Category_1) %>%
count()
ggplot(G_P_group, aes(x = as.factor(Product_Category_1),
y = n,
fill = as.factor(Gender)))+
geom_bar(stat = "identity", position = "dodge")+
labs(x="Product Category",y="Count",fill="gender",
title="Different gender's preference for various products")
Among the preferences of different genders for each product, male consumers are generally higher than female, with No. 1 being the most significant.
#Average consumer price by category
ggplot(data, aes(x = reorder(as.factor(Product_Category_1),Purchase),
y = Purchase))+
geom_boxplot(fill = "pink", width = .3)+
geom_violin(fill = "blue",size = .4)+
labs(title = "Average consumer price by category",
x = "Product Category",
y = "purchase")
The image shows that the average consumption of each category, the price of category 10 is the highest, and the distribution is significantly asymmetric.
Overall, the project analyzed data from the 11.11 shopping day consumer purchase dataset. For the consumer data, the relationship between these attributes and consumer purchases was determined through a joint analysis of consumer age, gender, occupation, residence, and marital status. For products, “bestseller” data was identified, and various metrics were determined for the 11.11 shopping day, including average customer spending and total purchase amounts across multiple categories.
This data analysis makes a basis for merchants to support the preparation of the shopping festival and provide personalized service solutions for which customer groups to better gain profit.
In this predictive analysis case, the sample size is large, but the attributes are small and all are more one-sided, which cannot be better joint for completeness analysis. At the same time, the lack of model evaluation steps makes it impossible to judge the suitability of the final model built.
In addition, the way the results are presented in this case is not intuitive enough for the merchants to know the final required information at a glance. Therefore, a result dataset can be established to match information and improve the readability of the prediction results by indexing columns corresponding to the test results.
It is very important for an engineering student to learn how to handle data. This was my first exposure to a data science-related course, and since I had studied other programming languages, R was not new to me. However, R is a powerful language with many efficient libraries, and it takes a long time to learn and practice to use it proficiently. Through this course, I gradually mastered some basic usage of R and tried to analyze a database. I encountered a lot of confusion and difficulties in this process, but when I tried to solve them, I found that my data analysis ability has improved greatly.
Wickham, H. & Grolemund, G. (n.d.). R for data science [eBook edition]. O’Reilly. https://r4ds.had.co.nz/index.html
Wickham, H., François, R., Henry, L., & Müller, K. (n.d.). Programming with dplyr. dplyr. https://dplyr.tidyverse.org/articles/programming.html
Kabacoff, R. (2020). Data visualizations with R. Quantitative Analysis Center, Wesleyan University. https://rkabacoff.github.io/datavis/index.html
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hanyu (2022, May 19). Data Analytics and Computational Social Science: Final Project. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomhanyu902265/
BibTeX citation
@misc{hanyu2022final, author = {Hanyu, }, title = {Data Analytics and Computational Social Science: Final Project}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomhanyu902265/}, year = {2022} }