Data Analytics and Computational Social Science: Homework6

Hanyu

Research questions

Classify and analyze the sales of products with the help of Taobao Mall consumer data.
To analyze the characteristics of different consumers and compare the consumption value of different categories of consumers.
To study the consumption habits of different consumers in order to provide personalized services to different consumer categories.

READ DATA & CLEAN DATA

##Read data
data = read.csv(file = "business_data.csv")
summary(data)

    User_ID         Product_ID           Gender         
 Min.   :1000001   Length:537577      Length:537577     
 1st Qu.:1001495   Class :character   Class :character  
 Median :1003031   Mode  :character   Mode  :character  
 Mean   :1002992                                        
 3rd Qu.:1004417                                        
 Max.   :1006040                                        
                                                        
     Age              Occupation     City_Category     
 Length:537577      Min.   : 0.000   Length:537577     
 Class :character   1st Qu.: 2.000   Class :character  
 Mode  :character   Median : 7.000   Mode  :character  
                    Mean   : 8.083                     
                    3rd Qu.:14.000                     
                    Max.   :20.000                     
                                                       
 Stay_In_Current_City_Years Marital_Status   Product_Category_1
 Length:537577              Min.   :0.0000   Min.   : 1.000    
 Class :character           1st Qu.:0.0000   1st Qu.: 1.000    
 Mode  :character           Median :0.0000   Median : 5.000    
                            Mean   :0.4088   Mean   : 5.296    
                            3rd Qu.:1.0000   3rd Qu.: 8.000    
                            Max.   :1.0000   Max.   :18.000    
                                                               
 Product_Category_2 Product_Category_3    Purchase    
 Min.   : 2.00      Min.   : 3.0       Min.   :  185  
 1st Qu.: 5.00      1st Qu.: 9.0       1st Qu.: 5866  
 Median : 9.00      Median :14.0       Median : 8062  
 Mean   : 9.84      Mean   :12.7       Mean   : 9334  
 3rd Qu.:15.00      3rd Qu.:16.0       3rd Qu.:12073  
 Max.   :18.00      Max.   :18.0       Max.   :23961  
 NA's   :166986     NA's   :373299

#clean data
data = data[,-11]       #Drop this column because Product_Category_3 has too many missing values
data[is.na(data)] <- 0  #Fill in the missing values of Product_Category_2 with 0
data = data[,-1]
head(data)

  Product_ID Gender   Age Occupation City_Category
1  P00069042      F  0-17         10             A
2  P00248942      F  0-17         10             A
3  P00087842      F  0-17         10             A
4  P00085442      F  0-17         10             A
5  P00285442      M   55+         16             C
6  P00193542      M 26-35         15             A
  Stay_In_Current_City_Years Marital_Status Product_Category_1
1                          2              0                  3
2                          2              0                  1
3                          2              0                 12
4                          2              0                 12
5                         4+              0                  8
6                          3              0                  1
  Product_Category_2 Purchase
1                  0     8370
2                  6    15200
3                  0     1422
4                 14     1057
5                  0     7969
6                  2    15227

Statistics for age

##AGE & PURCHASE##
age_group <- group_by(data, Age) %>%
  summarise(n = n(), 
            Mea_Pur = mean(Purchase),
            Med_Pur = median(Purchase),
            Sd_Pur  = sd(Purchase), 
            na.rm = TRUE) %>%
  mutate(freq = n / sum(n))
age_group

# A tibble: 7 × 7
  Age        n Mea_Pur Med_Pur Sd_Pur na.rm   freq
  <chr>  <int>   <dbl>   <dbl>  <dbl> <lgl>  <dbl>
1 0-17   14707   9020.    8009  5060. TRUE  0.0274
2 18-25  97634   9235.    8041  4996. TRUE  0.182 
3 26-35 214690   9315.    8043  4974. TRUE  0.399 
4 36-45 107499   9401.    8076  4978. TRUE  0.200 
5 46-50  44526   9285.    8050  4921. TRUE  0.0828
6 51-55  37618   9621.    8172  5035. TRUE  0.0700
7 55+    20903   9454.    8127  4939. TRUE  0.0389

Statistics for city category

city_group <- group_by(data, City_Category) %>%
  summarise(n = n(), 
            Mea_Pur = mean(Purchase),
            Med_Pur = median(Purchase),
            Sd_Pur  = sd(Purchase), 
            na.rm = TRUE) %>%
  mutate(freq = n / sum(n))
city_group

# A tibble: 3 × 7
  City_Category      n Mea_Pur Med_Pur Sd_Pur na.rm  freq
  <chr>          <int>   <dbl>   <dbl>  <dbl> <lgl> <dbl>
1 A             144638   8958.    7941  4867. TRUE  0.269
2 B             226493   9199.    8015  4927. TRUE  0.421
3 C             166446   9844.    8618  5109. TRUE  0.310

Purchase Amount Distribution

#Purchase Amount Distribution
ggplot(data, aes(Purchase, fill = Age)) + 
  geom_histogram(color = "black", binwidth = 500)+
  labs(title = "Purchase Distribution",x = "purchase", y="count") + 
  theme(axis.text=element_text(size = 7)) +
  facet_wrap(vars(Gender), scales = "free")

Visualization explain

I visualize two variables, amount spent and quantity spent. The visualization allows us to find trends in the amount of money people spend on the day of the shopping festival. I found that people’s spending amounts basically fit a normal distribution, with women consuming almost three times as much as men. However, regardless of gender, the people who spent the most amount of money at 7500.

Ave_consumer price by category

#Average consumer price by category
  ggplot(data, aes(x = reorder(as.factor(Product_Category_1),Purchase),
                  y = Purchase))+  
  geom_boxplot(fill = "pink", width = .3)+
  geom_violin(fill = "blue",size = .4)+
  labs(title = "Average consumer price by category",
       x = "Product Category",
       y = "purchase")

Visualization explain

I visualize the product category with the purchase amount. The image shows that the most preferred product category number by consumers is #1, which is significantly higher than the other products; followed by #2, #5, and #8, which are generally more average; while #10, #11, #12, and #13 products are almost unpopular with consumers. In the average consumption of each category, the price of category 10 is the highest, and the distribution is significantly asymmetric.

Different gender’s preference for various products

G_P_group <- data %>%
  group_by(Gender,Product_Category_1) %>%
  count()
  ggplot(G_P_group, aes(x = as.factor(Product_Category_1),
         y = n,
         fill = as.factor(Gender)))+
  geom_bar(stat = "identity", position = "dodge")+
  labs(x="Product Category",y="Count",fill="gender",title="Different gender's preference for various products")

Visualization explain

In the preference of each product by gender, male consumers are generally higher than females, with the No. 1 product being the most significant.

Answer

For the final report, there is also a lack of background on the problem, a detailed analysis of the data, and a summary.

I hope that in the remaining time I can complete all the missing parts and organize and improve them in detail.

Comment on this article Share:

Homework6

Research questions

READ DATA & CLEAN DATA

Statistics for age

Statistics for city category

Purchase Amount Distribution

Visualization explain

Ave_consumer price by category

Visualization explain

Different gender’s preference for various products

Visualization explain

Answer

Reuse

Citation