Data Visualization
User_ID Product_ID Gender
Min. :1000001 Length:537577 Length:537577
1st Qu.:1001495 Class :character Class :character
Median :1003031 Mode :character Mode :character
Mean :1002992
3rd Qu.:1004417
Max. :1006040
Age Occupation City_Category
Length:537577 Min. : 0.000 Length:537577
Class :character 1st Qu.: 2.000 Class :character
Mode :character Median : 7.000 Mode :character
Mean : 8.083
3rd Qu.:14.000
Max. :20.000
Stay_In_Current_City_Years Marital_Status Product_Category_1
Length:537577 Min. :0.0000 Min. : 1.000
Class :character 1st Qu.:0.0000 1st Qu.: 1.000
Mode :character Median :0.0000 Median : 5.000
Mean :0.4088 Mean : 5.296
3rd Qu.:1.0000 3rd Qu.: 8.000
Max. :1.0000 Max. :18.000
Product_Category_2 Product_Category_3 Purchase
Min. : 2.00 Min. : 3.0 Min. : 185
1st Qu.: 5.00 1st Qu.: 9.0 1st Qu.: 5866
Median : 9.00 Median :14.0 Median : 8062
Mean : 9.84 Mean :12.7 Mean : 9334
3rd Qu.:15.00 3rd Qu.:16.0 3rd Qu.:12073
Max. :18.00 Max. :18.0 Max. :23961
NA's :166986 NA's :373299
#Clean data
data = data[,-11] #Drop this column because Product_Category_3 has too many missing values
data[is.na(data)] <- 0 #Fill in the missing values of Product_Category_2 with 0
data = data[,-1]
head(data)
Product_ID Gender Age Occupation City_Category
1 P00069042 F 0-17 10 A
2 P00248942 F 0-17 10 A
3 P00087842 F 0-17 10 A
4 P00085442 F 0-17 10 A
5 P00285442 M 55+ 16 C
6 P00193542 M 26-35 15 A
Stay_In_Current_City_Years Marital_Status Product_Category_1
1 2 0 3
2 2 0 1
3 2 0 12
4 2 0 12
5 4+ 0 8
6 3 0 1
Product_Category_2 Purchase
1 0 8370
2 6 15200
3 0 1422
4 14 1057
5 0 7969
6 2 15227
age_group <- group_by(data, Age) %>%
summarise(n = n(),
Mea_Pur = mean(Purchase),
Med_Pur = median(Purchase),
Sd_Pur = sd(Purchase),
na.rm = TRUE) %>%
mutate(freq = n / sum(n))
age_group
# A tibble: 7 × 7
Age n Mea_Pur Med_Pur Sd_Pur na.rm freq
<chr> <int> <dbl> <dbl> <dbl> <lgl> <dbl>
1 0-17 14707 9020. 8009 5060. TRUE 0.0274
2 18-25 97634 9235. 8041 4996. TRUE 0.182
3 26-35 214690 9315. 8043 4974. TRUE 0.399
4 36-45 107499 9401. 8076 4978. TRUE 0.200
5 46-50 44526 9285. 8050 4921. TRUE 0.0828
6 51-55 37618 9621. 8172 5035. TRUE 0.0700
7 55+ 20903 9454. 8127 4939. TRUE 0.0389
city_group <- group_by(data, City_Category) %>%
summarise(n = n(),
Mea_Pur = mean(Purchase),
Med_Pur = median(Purchase),
Sd_Pur = sd(Purchase),
na.rm = TRUE) %>%
mutate(freq = n / sum(n))
city_group
# A tibble: 3 × 7
City_Category n Mea_Pur Med_Pur Sd_Pur na.rm freq
<chr> <int> <dbl> <dbl> <dbl> <lgl> <dbl>
1 A 144638 8958. 7941 4867. TRUE 0.269
2 B 226493 9199. 8015 4927. TRUE 0.421
3 C 166446 9844. 8618 5109. TRUE 0.310
#Purchase Amount Distribution
ggplot(data, aes(Purchase, fill =Age)) +
geom_histogram(color="black")+
labs(title = "Purchase") +
theme(axis.text=element_text(size=7)) +
facet_wrap(vars(Gender), scales = "free")
I visualize two variables, amount spent and quantity spent. The visualization allows us to find trends in the amount of money people spend on the day of the shopping festival. I found that people’s spending amounts basically fit a normal distribution, with women consuming almost three times as much as men. However, regardless of gender, the people who spent the most amount of money at 7500.
#Average consumer price by category
ggplot(data,aes(x=reorder(as.factor(Product_Category_1),Purchase),y=Purchase))+
geom_boxplot()+
geom_violin(scale="count",fill="lightblue",alpha=.3)+
labs(title = "Average consumer price by category")
I visualize the product category with the purchase amount. The image shows that the most preferred product category number by consumers is #1, which is significantly higher than the other products; followed by #2, #5, and #8, which are generally more average; while #10, #11, #12, and #13 products are almost unpopular with consumers. In the average consumption of each category, the price of category 10 is the highest, and the distribution is significantly asymmetric.
In the first visualization, both males and females have anomalously high values at 15000, but anomalously low values around 15000 for unknown reasons. Also, the relationship between marital status and spending amount is not reflected in the graph. For the box-line graphs, it is true that some necessary illustrations may lead to difficulties in understanding.
In subsequent work, I may include more illustrations and more visual graphs to better show the relationship between the data.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hanyu (2022, May 19). Data Analytics and Computational Social Science: Homework4. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomhanyu902258/
BibTeX citation
@misc{hanyu2022homework4, author = {Hanyu, }, title = {Data Analytics and Computational Social Science: Homework4}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomhanyu902258/}, year = {2022} }