Draft of Final Report
User_ID Product_ID Gender
Min. :1000001 Length:537577 Length:537577
1st Qu.:1001495 Class :character Class :character
Median :1003031 Mode :character Mode :character
Mean :1002992
3rd Qu.:1004417
Max. :1006040
Age Occupation City_Category
Length:537577 Min. : 0.000 Length:537577
Class :character 1st Qu.: 2.000 Class :character
Mode :character Median : 7.000 Mode :character
Mean : 8.083
3rd Qu.:14.000
Max. :20.000
Stay_In_Current_City_Years Marital_Status Product_Category_1
Length:537577 Min. :0.0000 Min. : 1.000
Class :character 1st Qu.:0.0000 1st Qu.: 1.000
Mode :character Median :0.0000 Median : 5.000
Mean :0.4088 Mean : 5.296
3rd Qu.:1.0000 3rd Qu.: 8.000
Max. :1.0000 Max. :18.000
Product_Category_2 Product_Category_3 Purchase
Min. : 2.00 Min. : 3.0 Min. : 185
1st Qu.: 5.00 1st Qu.: 9.0 1st Qu.: 5866
Median : 9.00 Median :14.0 Median : 8062
Mean : 9.84 Mean :12.7 Mean : 9334
3rd Qu.:15.00 3rd Qu.:16.0 3rd Qu.:12073
Max. :18.00 Max. :18.0 Max. :23961
NA's :166986 NA's :373299
#clean data
data = data[,-11] #Drop this column because Product_Category_3 has too many missing values
data[is.na(data)] <- 0 #Fill in the missing values of Product_Category_2 with 0
data = data[,-1]
head(data)
Product_ID Gender Age Occupation City_Category
1 P00069042 F 0-17 10 A
2 P00248942 F 0-17 10 A
3 P00087842 F 0-17 10 A
4 P00085442 F 0-17 10 A
5 P00285442 M 55+ 16 C
6 P00193542 M 26-35 15 A
Stay_In_Current_City_Years Marital_Status Product_Category_1
1 2 0 3
2 2 0 1
3 2 0 12
4 2 0 12
5 4+ 0 8
6 3 0 1
Product_Category_2 Purchase
1 0 8370
2 6 15200
3 0 1422
4 14 1057
5 0 7969
6 2 15227
##AGE & PURCHASE##
age_group <- group_by(data, Age) %>%
summarise(n = n(),
Mea_Pur = mean(Purchase),
Med_Pur = median(Purchase),
Sd_Pur = sd(Purchase),
na.rm = TRUE) %>%
mutate(freq = n / sum(n))
age_group
# A tibble: 7 × 7
Age n Mea_Pur Med_Pur Sd_Pur na.rm freq
<chr> <int> <dbl> <dbl> <dbl> <lgl> <dbl>
1 0-17 14707 9020. 8009 5060. TRUE 0.0274
2 18-25 97634 9235. 8041 4996. TRUE 0.182
3 26-35 214690 9315. 8043 4974. TRUE 0.399
4 36-45 107499 9401. 8076 4978. TRUE 0.200
5 46-50 44526 9285. 8050 4921. TRUE 0.0828
6 51-55 37618 9621. 8172 5035. TRUE 0.0700
7 55+ 20903 9454. 8127 4939. TRUE 0.0389
city_group <- group_by(data, City_Category) %>%
summarise(n = n(),
Mea_Pur = mean(Purchase),
Med_Pur = median(Purchase),
Sd_Pur = sd(Purchase),
na.rm = TRUE) %>%
mutate(freq = n / sum(n))
city_group
# A tibble: 3 × 7
City_Category n Mea_Pur Med_Pur Sd_Pur na.rm freq
<chr> <int> <dbl> <dbl> <dbl> <lgl> <dbl>
1 A 144638 8958. 7941 4867. TRUE 0.269
2 B 226493 9199. 8015 4927. TRUE 0.421
3 C 166446 9844. 8618 5109. TRUE 0.310
#Purchase Amount Distribution
ggplot(data, aes(Purchase, fill = Age)) +
geom_histogram(color = "black", binwidth = 500)+
labs(title = "Purchase Distribution",x = "purchase", y="count") +
theme(axis.text=element_text(size = 7)) +
facet_wrap(vars(Gender), scales = "free")
I visualize two variables, amount spent and quantity spent. The visualization allows us to find trends in the amount of money people spend on the day of the shopping festival. I found that people’s spending amounts basically fit a normal distribution, with women consuming almost three times as much as men. However, regardless of gender, the people who spent the most amount of money at 7500.
#Average consumer price by category
ggplot(data, aes(x = reorder(as.factor(Product_Category_1),Purchase),
y = Purchase))+
geom_boxplot(fill = "pink", width = .3)+
geom_violin(fill = "blue",size = .4)+
labs(title = "Average consumer price by category",
x = "Product Category",
y = "purchase")
I visualize the product category with the purchase amount. The image shows that the most preferred product category number by consumers is #1, which is significantly higher than the other products; followed by #2, #5, and #8, which are generally more average; while #10, #11, #12, and #13 products are almost unpopular with consumers. In the average consumption of each category, the price of category 10 is the highest, and the distribution is significantly asymmetric.
G_P_group <- data %>%
group_by(Gender,Product_Category_1) %>%
count()
ggplot(G_P_group, aes(x = as.factor(Product_Category_1),
y = n,
fill = as.factor(Gender)))+
geom_bar(stat = "identity", position = "dodge")+
labs(x="Product Category",y="Count",fill="gender",title="Different gender's preference for various products")
In the preference of each product by gender, male consumers are generally higher than females, with the No. 1 product being the most significant.
For the final report, there is also a lack of background on the problem, a detailed analysis of the data, and a summary.
I hope that in the remaining time I can complete all the missing parts and organize and improve them in detail.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hanyu (2022, May 19). Data Analytics and Computational Social Science: Homework6. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomhanyu902262/
BibTeX citation
@misc{hanyu2022homework6, author = {Hanyu, }, title = {Data Analytics and Computational Social Science: Homework6}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomhanyu902262/}, year = {2022} }