##Read data
data = read.csv(file = "business_data.csv")
#clean data
data = data[,-11]       #Drop this column because Product_Category_3 has too many missing values
data[is.na(data)] <- 0  #Fill in the missing values of Product_Category_2 with 0
data = data[,-1]
Statistics for age

age_group <- group_by(data, Age) %>%
  summarise(n = n(), 
            Mea_Pur = mean(Purchase),
            Med_Pur = median(Purchase),
            Sd_Pur  = sd(Purchase), 
            na.rm = TRUE) %>%
  mutate(freq = n / sum(n))
Visualization for univariate

#Purchase Amount Distribution
ggplot(data, aes(Purchase, fill = Age)) + 
  geom_histogram(color = "black", binwidth = 500)+
  labs(title = "Purchase Distribution",x = "purchase", y="count") + 
  theme(axis.text=element_text(size = 7)) +
  facet_wrap(vars(Gender), scales = "free")

Visualization explain

I visualize two variables, amount spent and quantity spent. The visualization allows us to find trends in the amount of money people spend on the day of the shopping festival. I found that people’s spending amounts basically fit a normal distribution, with women consuming almost three times as much as men. However, regardless of gender, the people who spent the most amount of money at 7500.

Visualization for bivariate

#Average consumer price by category
  ggplot(data,aes(x = reorder(as.factor(Product_Category_1),Purchase),
                  y = Purchase))+  
  geom_boxplot(fill = "pink", width = .3)+
  geom_violin(fill = "blue",size = .4)+
  labs(title = "Average consumer price by category",
       x = "Product Category",
       y = "purchase")

Visualization explain

I visualize the product category with the purchase amount. The image shows that the most preferred product category number by consumers is #1, which is significantly higher than the other products; followed by #2, #5, and #8, which are generally more average; while #10, #11, #12, and #13 products are almost unpopular with consumers. In the average consumption of each category, the price of category 10 is the highest, and the distribution is significantly asymmetric.

Limitations of visualization

So far, my analysis process is missing how much different genders like each type of product. I will add it in a later study.

Since the dataset is related to what people spend in shopping festivals, the statistics are not new to the naive reader. Also, I have detailed labels for each graph so it is easy to read.

In the first visualization graph, both men and women have abnormally high values at 15000, but abnormally low values around 15000, but it is hard to analyze the reason behind it from the data, so it is hard to answer.


