Data Analytics and Computational Social Science: Homework4

Hanyu

READ DATA & CLEAN DATA

#Read data
data = read.csv(file = "business_data.csv")
summary(data)

    User_ID         Product_ID           Gender         
 Min.   :1000001   Length:537577      Length:537577     
 1st Qu.:1001495   Class :character   Class :character  
 Median :1003031   Mode  :character   Mode  :character  
 Mean   :1002992                                        
 3rd Qu.:1004417                                        
 Max.   :1006040                                        
                                                        
     Age              Occupation     City_Category     
 Length:537577      Min.   : 0.000   Length:537577     
 Class :character   1st Qu.: 2.000   Class :character  
 Mode  :character   Median : 7.000   Mode  :character  
                    Mean   : 8.083                     
                    3rd Qu.:14.000                     
                    Max.   :20.000                     
                                                       
 Stay_In_Current_City_Years Marital_Status   Product_Category_1
 Length:537577              Min.   :0.0000   Min.   : 1.000    
 Class :character           1st Qu.:0.0000   1st Qu.: 1.000    
 Mode  :character           Median :0.0000   Median : 5.000    
                            Mean   :0.4088   Mean   : 5.296    
                            3rd Qu.:1.0000   3rd Qu.: 8.000    
                            Max.   :1.0000   Max.   :18.000    
                                                               
 Product_Category_2 Product_Category_3    Purchase    
 Min.   : 2.00      Min.   : 3.0       Min.   :  185  
 1st Qu.: 5.00      1st Qu.: 9.0       1st Qu.: 5866  
 Median : 9.00      Median :14.0       Median : 8062  
 Mean   : 9.84      Mean   :12.7       Mean   : 9334  
 3rd Qu.:15.00      3rd Qu.:16.0       3rd Qu.:12073  
 Max.   :18.00      Max.   :18.0       Max.   :23961  
 NA's   :166986     NA's   :373299

#Clean data
data = data[,-11]       #Drop this column because Product_Category_3 has too many missing values
data[is.na(data)] <- 0  #Fill in the missing values of Product_Category_2 with 0
data = data[,-1]
head(data)

  Product_ID Gender   Age Occupation City_Category
1  P00069042      F  0-17         10             A
2  P00248942      F  0-17         10             A
3  P00087842      F  0-17         10             A
4  P00085442      F  0-17         10             A
5  P00285442      M   55+         16             C
6  P00193542      M 26-35         15             A
  Stay_In_Current_City_Years Marital_Status Product_Category_1
1                          2              0                  3
2                          2              0                  1
3                          2              0                 12
4                          2              0                 12
5                         4+              0                  8
6                          3              0                  1
  Product_Category_2 Purchase
1                  0     8370
2                  6    15200
3                  0     1422
4                 14     1057
5                  0     7969
6                  2    15227

Statistics for age

age_group <- group_by(data, Age) %>%
  summarise(n = n(), 
            Mea_Pur = mean(Purchase),
            Med_Pur = median(Purchase),
            Sd_Pur  = sd(Purchase), 
            na.rm = TRUE) %>%
  mutate(freq = n / sum(n))
age_group

# A tibble: 7 × 7
  Age        n Mea_Pur Med_Pur Sd_Pur na.rm   freq
  <chr>  <int>   <dbl>   <dbl>  <dbl> <lgl>  <dbl>
1 0-17   14707   9020.    8009  5060. TRUE  0.0274
2 18-25  97634   9235.    8041  4996. TRUE  0.182 
3 26-35 214690   9315.    8043  4974. TRUE  0.399 
4 36-45 107499   9401.    8076  4978. TRUE  0.200 
5 46-50  44526   9285.    8050  4921. TRUE  0.0828
6 51-55  37618   9621.    8172  5035. TRUE  0.0700
7 55+    20903   9454.    8127  4939. TRUE  0.0389

Statistics for city category

city_group <- group_by(data, City_Category) %>%
  summarise(n = n(), 
            Mea_Pur = mean(Purchase),
            Med_Pur = median(Purchase),
            Sd_Pur  = sd(Purchase), 
            na.rm = TRUE) %>%
  mutate(freq = n / sum(n))
city_group

# A tibble: 3 × 7
  City_Category      n Mea_Pur Med_Pur Sd_Pur na.rm  freq
  <chr>          <int>   <dbl>   <dbl>  <dbl> <lgl> <dbl>
1 A             144638   8958.    7941  4867. TRUE  0.269
2 B             226493   9199.    8015  4927. TRUE  0.421
3 C             166446   9844.    8618  5109. TRUE  0.310

Visualization for univariate

#Purchase Amount Distribution
ggplot(data, aes(Purchase, fill =Age)) + 
  geom_histogram(color="black")+
  labs(title = "Purchase") + 
  theme(axis.text=element_text(size=7)) +
  facet_wrap(vars(Gender), scales = "free")

Visualization explain

I visualize two variables, amount spent and quantity spent. The visualization allows us to find trends in the amount of money people spend on the day of the shopping festival. I found that people’s spending amounts basically fit a normal distribution, with women consuming almost three times as much as men. However, regardless of gender, the people who spent the most amount of money at 7500.

Visualization for bivariate

#Average consumer price by category
  ggplot(data,aes(x=reorder(as.factor(Product_Category_1),Purchase),y=Purchase))+
  geom_boxplot()+
  geom_violin(scale="count",fill="lightblue",alpha=.3)+
  labs(title = "Average consumer price by category")

Visualization explain

I visualize the product category with the purchase amount. The image shows that the most preferred product category number by consumers is #1, which is significantly higher than the other products; followed by #2, #5, and #8, which are generally more average; while #10, #11, #12, and #13 products are almost unpopular with consumers. In the average consumption of each category, the price of category 10 is the highest, and the distribution is significantly asymmetric.

Limitations of visualization

In the first visualization, both males and females have anomalously high values at 15000, but anomalously low values around 15000 for unknown reasons. Also, the relationship between marital status and spending amount is not reflected in the graph. For the box-line graphs, it is true that some necessary illustrations may lead to difficulties in understanding.

In subsequent work, I may include more illustrations and more visual graphs to better show the relationship between the data.

Comment on this article Share:

Homework4

READ DATA & CLEAN DATA

Statistics for age

Statistics for city category

Visualization for univariate

Visualization explain

Visualization for bivariate

Visualization explain

Limitations of visualization

Reuse

Citation