Data Analytics and Computational Social Science: Homework5

Hanyu

READ DATA & CLEAN DATA

##Read data
data = read.csv(file = "business_data.csv")
summary(data)

    User_ID         Product_ID           Gender         
 Min.   :1000001   Length:537577      Length:537577     
 1st Qu.:1001495   Class :character   Class :character  
 Median :1003031   Mode  :character   Mode  :character  
 Mean   :1002992                                        
 3rd Qu.:1004417                                        
 Max.   :1006040                                        
                                                        
     Age              Occupation     City_Category     
 Length:537577      Min.   : 0.000   Length:537577     
 Class :character   1st Qu.: 2.000   Class :character  
 Mode  :character   Median : 7.000   Mode  :character  
                    Mean   : 8.083                     
                    3rd Qu.:14.000                     
                    Max.   :20.000                     
                                                       
 Stay_In_Current_City_Years Marital_Status   Product_Category_1
 Length:537577              Min.   :0.0000   Min.   : 1.000    
 Class :character           1st Qu.:0.0000   1st Qu.: 1.000    
 Mode  :character           Median :0.0000   Median : 5.000    
                            Mean   :0.4088   Mean   : 5.296    
                            3rd Qu.:1.0000   3rd Qu.: 8.000    
                            Max.   :1.0000   Max.   :18.000    
                                                               
 Product_Category_2 Product_Category_3    Purchase    
 Min.   : 2.00      Min.   : 3.0       Min.   :  185  
 1st Qu.: 5.00      1st Qu.: 9.0       1st Qu.: 5866  
 Median : 9.00      Median :14.0       Median : 8062  
 Mean   : 9.84      Mean   :12.7       Mean   : 9334  
 3rd Qu.:15.00      3rd Qu.:16.0       3rd Qu.:12073  
 Max.   :18.00      Max.   :18.0       Max.   :23961  
 NA's   :166986     NA's   :373299

#clean data
data = data[,-11]       #Drop this column because Product_Category_3 has too many missing values
data[is.na(data)] <- 0  #Fill in the missing values of Product_Category_2 with 0
data = data[,-1]
head(data)

  Product_ID Gender   Age Occupation City_Category
1  P00069042      F  0-17         10             A
2  P00248942      F  0-17         10             A
3  P00087842      F  0-17         10             A
4  P00085442      F  0-17         10             A
5  P00285442      M   55+         16             C
6  P00193542      M 26-35         15             A
  Stay_In_Current_City_Years Marital_Status Product_Category_1
1                          2              0                  3
2                          2              0                  1
3                          2              0                 12
4                          2              0                 12
5                         4+              0                  8
6                          3              0                  1
  Product_Category_2 Purchase
1                  0     8370
2                  6    15200
3                  0     1422
4                 14     1057
5                  0     7969
6                  2    15227

Statistics for age

##AGE & PURCHASE##
age_group <- group_by(data, Age) %>%
  summarise(n = n(), 
            Mea_Pur = mean(Purchase),
            Med_Pur = median(Purchase),
            Sd_Pur  = sd(Purchase), 
            na.rm = TRUE) %>%
  mutate(freq = n / sum(n))
age_group

# A tibble: 7 × 7
  Age        n Mea_Pur Med_Pur Sd_Pur na.rm   freq
  <chr>  <int>   <dbl>   <dbl>  <dbl> <lgl>  <dbl>
1 0-17   14707   9020.    8009  5060. TRUE  0.0274
2 18-25  97634   9235.    8041  4996. TRUE  0.182 
3 26-35 214690   9315.    8043  4974. TRUE  0.399 
4 36-45 107499   9401.    8076  4978. TRUE  0.200 
5 46-50  44526   9285.    8050  4921. TRUE  0.0828
6 51-55  37618   9621.    8172  5035. TRUE  0.0700
7 55+    20903   9454.    8127  4939. TRUE  0.0389

Statistics for city category

city_group <- group_by(data, City_Category) %>%
  summarise(n = n(), 
            Mea_Pur = mean(Purchase),
            Med_Pur = median(Purchase),
            Sd_Pur  = sd(Purchase), 
            na.rm = TRUE) %>%
  mutate(freq = n / sum(n))
city_group

# A tibble: 3 × 7
  City_Category      n Mea_Pur Med_Pur Sd_Pur na.rm  freq
  <chr>          <int>   <dbl>   <dbl>  <dbl> <lgl> <dbl>
1 A             144638   8958.    7941  4867. TRUE  0.269
2 B             226493   9199.    8015  4927. TRUE  0.421
3 C             166446   9844.    8618  5109. TRUE  0.310

Visualization for univariate

#Purchase Amount Distribution
ggplot(data, aes(Purchase, fill = Age)) + 
  geom_histogram(color = "black", binwidth = 500)+
  labs(title = "Purchase Distribution",x = "purchase", y="count") + 
  theme(axis.text=element_text(size = 7)) +
  facet_wrap(vars(Gender), scales = "free")

Visualization explain

I visualize two variables, amount spent and quantity spent. The visualization allows us to find trends in the amount of money people spend on the day of the shopping festival. I found that people’s spending amounts basically fit a normal distribution, with women consuming almost three times as much as men. However, regardless of gender, the people who spent the most amount of money at 7500.

Visualization for bivariate

#Average consumer price by category
  ggplot(data,aes(x = reorder(as.factor(Product_Category_1),Purchase),
                  y = Purchase))+  
  geom_boxplot(fill = "pink", width = .3)+
  geom_violin(fill = "blue",size = .4)+
  labs(title = "Average consumer price by category",
       x = "Product Category",
       y = "purchase")

Visualization explain

I visualize the product category with the purchase amount. The image shows that the most preferred product category number by consumers is #1, which is significantly higher than the other products; followed by #2, #5, and #8, which are generally more average; while #10, #11, #12, and #13 products are almost unpopular with consumers. In the average consumption of each category, the price of category 10 is the highest, and the distribution is significantly asymmetric.

Limitations of visualization

So far, my analysis process is missing how much different genders like each type of product. I will add it in a later study.

Since the dataset is related to what people spend in shopping festivals, the statistics are not new to the naive reader. Also, I have detailed labels for each graph so it is easy to read.

In the first visualization graph, both men and women have abnormally high values at 15000, but abnormally low values around 15000, but it is hard to analyze the reason behind it from the data, so it is hard to answer.

Comment on this article Share:

Homework5

READ DATA & CLEAN DATA

Statistics for age

Statistics for city category

Visualization for univariate

Visualization explain

Visualization for bivariate

Visualization explain

Limitations of visualization

Reuse

Citation