Data Analytics and Computational Social Science: HW6

Haoyan Xiang

Final project checkpoint report

Introduction

Halloween in America would not be the same without all of the chocolates and candies that drive kids crazy. With so many different kinds of sweets being given out and kids trading one for the other, it can make you wonder: Which one is the most popular candy? Which features make a piece of candy more desirable?

The dataset that I will be working with was made available on kaggle.com by Walt Hickey and FiveThirtyEight. It features several binary and continuous/percentage variables for each of the listed candies, all of which describe the candy’s physical traits (e.g. taste, texture, ingredients) and the candy’s logistical traits (e.g. price percentile relative to other candies, whether or not it comes in a bag/box). Lastly, the attribute that we will treat as the main response variable is the “win percent.” This win percent was formed via Hickey’s opinion poll, which presented viewers two images of candy and asked them to select which of the two candies they would prefer to receive. Thus, the win percent is the proportion of times a particular candy won over the other contestant. There were more than 269 thousand casted in this poll to create this response variable. Detailed information on the variables will be expalined in the next session.

In this report, I am more specifically interested in which group of candy characteristics makes it more popular than the others. I am also interested in which attributes, both individually (e.g. chocolate attribute) and jointly (e.g. chocolate and caramel interaction variable), are statistically significant in predicting the popularity level of a candy. To do this, we will construct several visualizations and statistical test.

Data

Load the cleaned data.

# reading in the dataset from Kaggle 
knitr::opts_chunk$set(echo = TRUE)
setwd("~/Downloads")
library(ggplot2)
library(dplyr)
library(corrplot)
candy = read.csv("candy-data.csv")
head(candy)

  competitorname chocolate fruity caramel peanutyalmondy nougat
1      100 Grand         1      0       1              0      0
2   3 Musketeers         1      0       0              0      1
3       One dime         0      0       0              0      0
4    One quarter         0      0       0              0      0
5      Air Heads         0      1       0              0      0
6     Almond Joy         1      0       0              1      0
  crispedricewafer hard bar pluribus sugarpercent pricepercent
1                1    0   1        0        0.732        0.860
2                0    0   1        0        0.604        0.511
3                0    0   0        0        0.011        0.116
4                0    0   0        0        0.011        0.511
5                0    0   0        0        0.906        0.511
6                0    0   1        0        0.465        0.767
  winpercent
1   66.97173
2   67.60294
3   32.26109
4   46.11650
5   52.34146
6   50.34755

I am working with quite a small dataset here that contains certain attributes about the candies included in the internet based survey. Our dataset contains 85 different candy competitors and 12 attributes, excluding candy name. The good thing is that there are no missing values in our dataset. Among those 12 attributes, 9 of them are binary variables that can take values of either 1 (yes) or 0 (no). Those binary variables are: chocolate (does it contain chocolate?), fruity (is it fruit flavored?), caramel (is there caramel in the candy?), peanutalmondy (does it contain peanuts, peanut butter or almonds?), nougat (does it contain nougat?), crispedricewafer (does it contain crisped rice, wafers, or a cookie component?), hard (is it a hard candy?), bar (is it a candy bar?) and pluribus (is it one of many candies in a bag or box?).

The other 3 variables are continuous, which are sugarpercent (percentile of sugar it falls under within the data set), pricepercent (unit price percentile relative to the data set) and winpercent (overall win percentage according to the 269,000 matchups). I will be mainly focus on the winpercent variable as it reflects the popularity of candies.

Visualization

I will now take a peek at the rankings of each candy based on their winning percentage. Reese’s has dominated our top 10 candies by winpercent, and most of those candies in the top 10 contain chocolate. The chocolate component seems to have some impact on candies’ popularity, so I will investigate more about chocolate’s influence later on. So the table below answers my first research question, that is what is the most popular halloween candy.

# Getting the top 10 candies based on winpercent
knitr::opts_chunk$set(echo = TRUE)
top10 = head(candy[order(candy$winpercent,decreasing = TRUE),],n=10)
top10

                competitorname chocolate fruity caramel
53   ReeseÕs Peanut Butter cup         1      0       0
52          ReeseÕs Miniatures         1      0       0
80                        Twix         1      0       1
29                     Kit Kat         1      0       0
65                    Snickers         1      0       1
54              ReeseÕs pieces         1      0       0
37                   Milky Way         1      0       1
55 ReeseÕs stuffed with pieces         1      0       0
33         Peanut butter M&MÕs         1      0       0
43         Nestle Butterfinger         1      0       0
   peanutyalmondy nougat crispedricewafer hard bar pluribus
53              1      0                0    0   0        0
52              1      0                0    0   0        0
80              0      0                1    0   1        0
29              0      0                1    0   1        0
65              1      1                0    0   1        0
54              1      0                0    0   0        1
37              0      1                0    0   1        0
55              1      0                0    0   0        0
33              1      0                0    0   0        1
43              1      0                0    0   1        0
   sugarpercent pricepercent winpercent
53        0.720        0.651   84.18029
52        0.034        0.279   81.86626
80        0.546        0.906   81.64291
29        0.313        0.511   76.76860
65        0.546        0.651   76.67378
54        0.406        0.651   73.43499
37        0.604        0.651   73.09956
55        0.988        0.651   72.88790
33        0.825        0.651   71.46505
43        0.604        0.767   70.73564

Now, I will move on to the next question, which group of candy characteristics makes it more popular than the others. First of all, I graph the correlation plot of all the variables which will tell me how correlated those variables are with each other. If I first focus on the relationship between all the categorical variables and winpercent, I see that chocolate, peanutalmondy, and bar are positively related to winpercent, where fruity and hard are negatively related. Next, I will focus on the correlation between those categorical variables. In this case, there are lots of correlated variables. Just to name a few here, chocolate is positively related to bar, chocolate is negatively related to fruity, and bar is positively related to nougat. This would indicate that there are some interactions between variables, In other words, it is likely for a chocolate candy to be in bar shape and less likely to be fruit flavored.

The advantage of using a correlation plot at the beginning of an analysis is to give you an overall picture of the relationship between your variables. It helps you identify some important/interesting variables that you might look into later in the analysis.

knitr::opts_chunk$set(echo = TRUE)
# removing the first column, candy name
candydatacor<-cor(candy[,-1])
corrplot(candydatacor)

Since I get a sense of the relationship between variables, I will start graphing individual plots below.

First, I will start with chocolate, as it is strongly related to winpercent. This can be verified by the plot below, as the winpercent of chocolate flavored candies is much higher than non-chocolate flavored candies.

knitr::opts_chunk$set(echo = TRUE)
new_candy <- candy %>% 
      mutate_at(c(2:10), recode, `1`= "TRUE", `0`="FALSE")
ggplot(data = new_candy) +
    geom_bar(mapping = aes(x = chocolate, y = winpercent), stat = "identity", position = "dodge")+
    labs(title="Winpercent based on chocolate",
         x = "Is the candy chocolate?")

I tried to identify the interaction between chocolate and fruity. It is clear from the plot below that candies with chocolate only are more welcomed that candies with both chocolate and fruit flavored. One possible reason for this is that the taste is not good if you add both chocolate and fruit flavor in a candy.

knitr::opts_chunk$set(echo = TRUE)
ggplot(data = new_candy) +
    geom_bar(mapping = aes(x = chocolate, y = winpercent,fill = fruity), stat = "identity", position = "dodge")+
    facet_wrap(c("fruity")) +
    labs(title="Candies popularity based on chocolate and fruity",
         x = "Is the candy chocolate?")

I am also wondering if chocolate bar candies are more welcomed that others so I plot the graph below. Unfortunately, I do not see a clear indicator of this. Meaning that I do not enough evidence to tell if chocolate bar candies are more popular.

knitr::opts_chunk$set(echo = TRUE)
ggplot(data = new_candy) +
    geom_bar(mapping = aes(x = chocolate, y = winpercent,fill = bar), stat = "identity", position = "dodge")+
    facet_wrap(c("bar")) +
    labs(title="Candies popularity based on chocolate and fruity",
         x = "Is the candy chocolate?")

In the next few plots, I investigate the relationships between numerical variables where a scatterplot with a fitted line will be useful. Below is a plot for pricepercent and winpercent. I can see that pricepercent is a little positively related to winpercent. In otherwords, a higher price would indicate a more popular candy.

knitr::opts_chunk$set(echo = TRUE)
ggplot(data = candy, aes(x = pricepercent, y = winpercent,label = competitorname)) +
    geom_point() + 
    geom_smooth(method = "lm") +
    geom_text(check_overlap = T, vjust = "bottom", nudge_y = 0.01, angle = 30, size = 2)

Here, I have sugarpercent versus winpercent. I will say that there is little relationship between them, that is adding more sugar to a candy does not necessary make it more popular.

knitr::opts_chunk$set(echo = TRUE)
ggplot(data = candy, aes(x = sugarpercent, y = winpercent,label = competitorname)) +
    geom_point() + 
    geom_smooth(method = "lm") +
    geom_text(check_overlap = T, vjust = "bottom", nudge_y = 0.01, angle = 30, size = 2)

Lastly, I plot the scatterplot for sugarpercent and pricepercent below. I will say that there is a positive relationship between them. It makes sense in this case because adding more sugar in a candy will increase the cost which will increase the sale price.

knitr::opts_chunk$set(echo = TRUE)
ggplot(data = candy, aes(x = sugarpercent, y = pricepercent,label = competitorname)) +
    geom_point() + 
    geom_smooth(method = "lm") +
    geom_text(check_overlap = T, vjust = "bottom", nudge_y = 0.01, angle = 30, size = 2)

Now that I have plotted several visualization for the variables, I start to wonder if there is a more rigorous approach to identify variables that are actually related to winpercent. Because some relations could be caused by chance, which is something I would like to remove. After some researching, I found that Two-way analysis of variance (ANOVA) examines the influence of mutiple categorical variables on one continuous dependent variable, which is winpercent in my case.

knitr::opts_chunk$set(echo = TRUE)
res.aov <- aov(winpercent ~ chocolate + fruity + caramel + peanutyalmondy + nougat +crispedricewafer + hard +
                 bar + pluribus, data = candy)
# Summary of the analysis
summary(res.aov)

                 Df Sum Sq Mean Sq F value   Pr(>F)    
chocolate         1   7369    7369  62.628 1.74e-11 ***
fruity            1    336     336   2.857  0.09514 .  
caramel           1    147     147   1.253  0.26664    
peanutyalmondy    1    861     861   7.319  0.00844 ** 
nougat            1      0       0   0.002  0.96898    
crispedricewafer  1    407     407   3.455  0.06698 .  
hard              1    240     240   2.043  0.15704    
bar               1      2       2   0.019  0.88962    
pluribus          1      0       0   0.003  0.95779    
Residuals        75   8824     118                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

knitr::opts_chunk$set(echo = TRUE)
ggplot(data = new_candy) +
    geom_bar(mapping = aes(x = peanutyalmondy, y = winpercent), stat = "identity", position = "dodge")+
    labs(title="Winpercent based on chocolate",
         x = "Is the candy chocolate?")

Reflection

This is my first project using R Markdown and I have learned a lot from it. Picking a dataset that is interesting to me is important. I searched for lots of datasets on Kaggle and finally used the Halloween Candy for my final project. After going over the dataset, I am curious about what are some properties for a best-seller candy, and this led to my research questions, that is what are the best-sellers and what kind of attributes make a candy popular for Halloween.

To answer the second question, I initially planned to plot the winpercent based on the categorical variables. For instance, are candies with chocolate flavor more popular than those without? Do people prefer hard candies or soft candies? How about candies in bar shape? However, there are 11 categorical variables and making all 11 plots might not be ideal. Therefore, I used a correlation plot to filter out variables that do not closely related to popularity. For instance, adding more sugar to a candy does not make it more popular. So now I had a subset of variables that had some impact on candy’s popularity, I plotted bar plots to verify my hypothesis. One of my hypothesis was chocolate flavored candies are more welcomed than non-chocolate flavored. I also did a similar work for numeric variables (sugarpercent and pricepercent) where I plotted scatterplot to see if there is a linear correlation with winpercent.

So far, I have examined correlation between one independent variable and one dependent variable (winperent) but did not look at interactions. I plotted some bar plots with a 2nd variable using facet_wrap but found it challenging to identify interactions through graphs when they are not that obvious.

Lastly, I performed a statistical test called Two-way analysis of variance (ANOVA) that basically tells me if there exist a clear relation between a categorical variable and numerical variable. I wish I would have known this earlier so that I would not spend time trying to modify my dataset and visualizations to check for statistical significance.

For my next steps, I would take this visualization as a exploratory analysis and build some statistical models to understand variables that make a candy popular is a more scientific way.

Conclusion

In this report, I have found that ReeseÕs Peanut Butter cup is the best-seller for Halloween candies and ReeseÕs has multiple spots in the top 10 most popular candies. It seems like people love ReeseÕs candies very much. I also found that people prefer chocolate candy more and do not like fruit flavor to be added to a chocolate candy. I did not see that chocolate bar candies are more popular. Lastly, a more expensive candy is more popular than a cheaper one.

One question that is still unanswered is that what kind of combinations of variables influence the popularity of a candy. Given that there are 11 categorical variables, there are 55 different combinations and having someone to plot 55 graphs and draw conclusions does not seem like a good practice. So one of my next step will be to find a scientific method to answer this question.

Bibliography

R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
FiveThirtyEight. (2017, October 31). The ultimate halloween candy power ranking. Kaggle. Retrieved January 13, 2022, from https://www.kaggle.com/fivethirtyeight/the-ultimate-halloween-candy-power-ranking
Wikimedia Foundation. (2022, January 10). Two-way analysis of variance. Wikipedia. Retrieved January 13, 2022, from https://en.wikipedia.org/wiki/Two-way_analysis_of_variance
Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.

Comment on this article Share:

HW6