Data Analytics and Computational Social Science: Final Paper

Haoyan Xiang

Final paper

Introduction

Halloween in America would not be the same without all of the chocolates and candies that drive kids crazy. With so many different kinds of sweets being given out and kids trading one for the other, it can make you wonder: Which one is the most popular candy? Which features make a piece of candy more desirable?

The dataset that I will be working with was made available on kaggle.com by Walt Hickey and FiveThirtyEight. It features several binary and continuous/percentage variables for each of the listed candies, all of which describe the candy’s physical traits (e.g. taste, texture, ingredients) and the candy’s logistical traits (e.g. price percentile relative to other candies, whether or not it comes in a bag/box). Lastly, the attribute that we will treat as the main response variable is the “win percent.” This win percent was formed via Hickey’s opinion poll, which presented viewers two images of candy and asked them to select which of the two candies they would prefer to receive. Thus, the win percent is the proportion of times a particular candy won over the other contestant. There were more than 269 thousand casted in this poll to create this response variable. Detailed information on the variables will be expalined in the next session.

In this report, I am more specifically interested in which group of candy characteristics makes it more popular than the others. I am also interested in which attributes, both individually (e.g. chocolate attribute) and jointly (e.g. chocolate and caramel interaction variable), are statistically significant in predicting the popularity level of a candy. To do this, we will construct several visualizations and statistical test.

Data

Load the cleaned data.

# reading in the dataset from Kaggle 
knitr::opts_chunk$set(echo = TRUE)
setwd("~/Downloads")
library(ggplot2)
library(ggrepel)
library(dplyr)
library(corrplot)
candy = read.csv("candy-data.csv")
head(candy)

  competitorname chocolate fruity caramel peanutyalmondy nougat
1      100 Grand         1      0       1              0      0
2   3 Musketeers         1      0       0              0      1
3       One dime         0      0       0              0      0
4    One quarter         0      0       0              0      0
5      Air Heads         0      1       0              0      0
6     Almond Joy         1      0       0              1      0
  crispedricewafer hard bar pluribus sugarpercent pricepercent
1                1    0   1        0        0.732        0.860
2                0    0   1        0        0.604        0.511
3                0    0   0        0        0.011        0.116
4                0    0   0        0        0.011        0.511
5                0    0   0        0        0.906        0.511
6                0    0   1        0        0.465        0.767
  winpercent
1   66.97173
2   67.60294
3   32.26109
4   46.11650
5   52.34146
6   50.34755

I am working with quite a small dataset here that contains certain attributes about the candies included in the internet based survey. Our dataset contains 85 different candy competitors and 12 attributes, excluding candy name. The good thing is that there are no missing values in our dataset. Among those 12 attributes, 9 of them are binary variables that can take values of either 1 (yes) or 0 (no). Those binary variables are: chocolate (does it contain chocolate?), fruity (is it fruit flavored?), caramel (is there caramel in the candy?), peanutalmondy (does it contain peanuts, peanut butter or almonds?), nougat (does it contain nougat?), crispedricewafer (does it contain crisped rice, wafers, or a cookie component?), hard (is it a hard candy?), bar (is it a candy bar?) and pluribus (is it one of many candies in a bag or box?).

The other 3 variables are continuous, which are sugarpercent (percentile of sugar it falls under within the data set), pricepercent (unit price percentile relative to the data set) and winpercent (overall win percentage according to the 269,000 matchups). I will be mainly focus on the winpercent variable as it reflects the popularity of candies.

Visualization

I will now take a peek at the rankings of each candy based on their winning percentage. Reese’s has dominated our top 10 candies by winpercent, and most of those candies in the top 10 contain chocolate. The chocolate component seems to have some impact on candies’ popularity, so I will investigate more about chocolate’s influence later on. So the table below answers my first research question, that is what is the most popular halloween candy.

# Getting the top 10 candies based on winpercent
knitr::opts_chunk$set(echo = TRUE)
top10 = head(candy[order(candy$winpercent,decreasing = TRUE),],n=10)
top10

                competitorname chocolate fruity caramel
53   ReeseÕs Peanut Butter cup         1      0       0
52          ReeseÕs Miniatures         1      0       0
80                        Twix         1      0       1
29                     Kit Kat         1      0       0
65                    Snickers         1      0       1
54              ReeseÕs pieces         1      0       0
37                   Milky Way         1      0       1
55 ReeseÕs stuffed with pieces         1      0       0
33         Peanut butter M&MÕs         1      0       0
43         Nestle Butterfinger         1      0       0
   peanutyalmondy nougat crispedricewafer hard bar pluribus
53              1      0                0    0   0        0
52              1      0                0    0   0        0
80              0      0                1    0   1        0
29              0      0                1    0   1        0
65              1      1                0    0   1        0
54              1      0                0    0   0        1
37              0      1                0    0   1        0
55              1      0                0    0   0        0
33              1      0                0    0   0        1
43              1      0                0    0   1        0
   sugarpercent pricepercent winpercent
53        0.720        0.651   84.18029
52        0.034        0.279   81.86626
80        0.546        0.906   81.64291
29        0.313        0.511   76.76860
65        0.546        0.651   76.67378
54        0.406        0.651   73.43499
37        0.604        0.651   73.09956
55        0.988        0.651   72.88790
33        0.825        0.651   71.46505
43        0.604        0.767   70.73564

Now, I will move on to the next question, which group of candy characteristics makes it more popular than the others. First of all, I graph the correlation plot of all the variables which will tell me how correlated those variables are with each other. If I first focus on the relationship between all the categorical variables and winpercent, I see that chocolate, peanutalmondy, and bar are positively related to winpercent, where fruity and hard are negatively related. Next, I will focus on the correlation between those categorical variables. In this case, there are lots of correlated variables. Just to name a few here, chocolate is positively related to bar, chocolate is negatively related to fruity, and bar is positively related to nougat. This would indicate that there are some interactions between variables, In other words, it is likely for a chocolate candy to be in bar shape and less likely to be fruit flavored.

The advantage of using a correlation plot at the beginning of an analysis is to give you an overall picture of the relationship between your variables. It helps you identify some important/interesting variables that you might look into later in the analysis.

knitr::opts_chunk$set(echo = TRUE)
# removing the first column, candy name
candydatacor<-cor(candy[,-1])
corrplot(candydatacor)

Since I get a sense of the relationship between variables, I will start graphing individual plots below.

First, I will start with chocolate, as it is strongly related to winpercent. This can be verified by the plot below, as the winpercent of chocolate flavored candies is much higher than non-chocolate flavored candies.

knitr::opts_chunk$set(echo = TRUE)
candy$pricepercent = candy$pricepercent * 100
candy$sugarpercent = candy$sugarpercent * 100

new_candy <- candy %>% mutate(chocolate=recode(chocolate, 
                         `1`="Chocolate",
                         `0`="Non-Chocolate"),
                         fruity=recode(fruity, 
                         `1`="Fruity",
                         `0`="Non-Fruity"),
                          bar=recode(bar, 
                         `1`="Bar",
                         `0`="Non-Bar"))
                 

ggplot(data = new_candy) +
    geom_bar(mapping = aes(x = chocolate, y = winpercent), stat = "identity", position = "dodge")+
    labs(title="Winpercent based on chocolate",
         x = "Is the candy chocolate?")+
    scale_y_continuous(limits=c(0,100))

I tried to identify the interaction between chocolate and fruity. It is clear from the plot below that candies with chocolate only are more welcomed that candies with both chocolate and fruit flavored. One possible reason for this is that the taste is not good if you add both chocolate and fruit flavor in a candy.

knitr::opts_chunk$set(echo = TRUE)
ggplot(data = new_candy) +
    geom_bar(mapping = aes(x = chocolate, y = winpercent,fill = fruity), stat = "identity", position = "dodge")+
    facet_wrap(c("fruity")) +
    labs(title="Candies popularity based on chocolate and fruity",
         x = "Is the candy chocolate?") +
    scale_y_continuous(limits=c(0,100))

I am also wondering if chocolate bar candies are more welcomed that others so I plot the graph below. Unfortunately, I do not see a clear indicator of this. Meaning that I do not enough evidence to tell if chocolate bar candies are more popular.

knitr::opts_chunk$set(echo = TRUE)
ggplot(data = new_candy) +
    geom_bar(mapping = aes(x = chocolate, y = winpercent,fill = bar), stat = "identity", position = "dodge")+
    facet_wrap(c("bar")) +
    labs(title="Candies popularity based on chocolate and fruity",
         x = "Is the candy chocolate?") +
  scale_y_continuous(limits=c(0,100))

In the next few plots, I investigate the relationships between numerical variables where a scatterplot with a fitted line will be useful. Below is a plot for pricepercent and winpercent. I can see that pricepercent is a little positively related to winpercent. In otherwords, a higher price would indicate a more popular candy.

knitr::opts_chunk$set(echo = TRUE)
ggplot(data = candy, aes(x = pricepercent, y = winpercent,label = competitorname)) +
    geom_point(alpha=0.6) + 
    geom_smooth(method = "lm") +
    scale_y_continuous(limits=c(0,100)) +
    coord_fixed() # add tables for top & bottom win/price

knitr::opts_chunk$set(echo = TRUE)
# Top 5 candies based on winpercent
win = head(candy[order(candy$winpercent,decreasing = TRUE),],n=5)
win[,c('competitorname','winpercent','pricepercent')]

              competitorname winpercent pricepercent
53 ReeseÕs Peanut Butter cup   84.18029         65.1
52        ReeseÕs Miniatures   81.86626         27.9
80                      Twix   81.64291         90.6
29                   Kit Kat   76.76860         51.1
65                  Snickers   76.67378         65.1

# Bottom 5 candies based on winpercent
win = head(candy[order(candy$winpercent,decreasing = FALSE),],n=5)
win[,c('competitorname','winpercent','pricepercent')]

       competitorname winpercent pricepercent
45          Nik L Nip   22.44534         97.6
8  Boston Baked Beans   23.41782         51.1
13           Chiclets   24.52499         32.5
73       Super Bubble   27.30386         11.6
27         Jawbusters   28.12744         51.1

# Top 5 candies based on pricepercent
win = head(candy[order(candy$pricepercent,decreasing = TRUE),],n=5)
win[,c('competitorname','winpercent','pricepercent')]

             competitorname winpercent pricepercent
45                Nik L Nip   22.44534         97.6
63          Nestle Smarties   37.88719         97.6
56                 Ring pop   35.29076         96.5
24        HersheyÕs Krackel   62.28448         91.8
25 HersheyÕs Milk Chocolate   56.49050         91.8

# Bottom 5 candies based on pricepercent
win = head(candy[order(candy$pricepercent,decreasing = FALSE),],n=5)
win[,c('competitorname','winpercent','pricepercent')]

         competitorname winpercent pricepercent
77 Tootsie Roll Midgies   45.73675          1.1
49         Pixie Sticks   37.72234          2.3
15             Dum Dums   39.46056          3.4
16          Fruit Chews   43.08892          3.4
70  Strawberry bon bons   34.57899          5.8

Here, I have sugarpercent versus winpercent. I will say that there is little relationship between them, that is adding more sugar to a candy does not necessary make it more popular.

knitr::opts_chunk$set(echo = TRUE)
ggplot(data = candy, aes(x = sugarpercent, y = winpercent,label = competitorname)) +
    geom_point() + 
    geom_smooth(method = "lm") +
    scale_y_continuous(limits=c(0,100)) +
    coord_fixed()

knitr::opts_chunk$set(echo = TRUE)
# Top 5 candies based on winpercent
win = head(candy[order(candy$winpercent,decreasing = TRUE),],n=5)
win[,c('competitorname','winpercent','sugarpercent')]

              competitorname winpercent sugarpercent
53 ReeseÕs Peanut Butter cup   84.18029         72.0
52        ReeseÕs Miniatures   81.86626          3.4
80                      Twix   81.64291         54.6
29                   Kit Kat   76.76860         31.3
65                  Snickers   76.67378         54.6

# Bottom 5 candies based on winpercent
win = head(candy[order(candy$winpercent,decreasing = FALSE),],n=5)
win[,c('competitorname','winpercent','sugarpercent')]

       competitorname winpercent sugarpercent
45          Nik L Nip   22.44534         19.7
8  Boston Baked Beans   23.41782         31.3
13           Chiclets   24.52499          4.6
73       Super Bubble   27.30386         16.2
27         Jawbusters   28.12744          9.3

# Top 5 candies based on sugarpercent
win = head(candy[order(candy$sugarpercent,decreasing = TRUE),],n=5)
win[,c('competitorname','winpercent','sugarpercent')]

                competitorname winpercent sugarpercent
55 ReeseÕs stuffed with pieces   72.88790         98.8
39    Milky Way Simply Caramel   64.35334         96.5
71                Sugar Babies   33.43755         96.5
61           Skittles original   63.08514         94.1
62          Skittles wildberry   55.10370         94.1

# Bottom 5 candies based on sugarpercent
win = head(candy[order(candy$sugarpercent,decreasing = FALSE),],n=5)
win[,c('competitorname','winpercent','sugarpercent')]

       competitorname winpercent sugarpercent
3            One dime   32.26109          1.1
4         One quarter   46.11650          1.1
52 ReeseÕs Miniatures   81.86626          3.4
13           Chiclets   24.52499          4.6
31          Lemonhead   39.14106          4.6

Lastly, I plot the scatterplot for sugarpercent and pricepercent below. I will say that there is a positive relationship between them. It makes sense in this case because adding more sugar in a candy will increase the cost which will increase the sale price.

knitr::opts_chunk$set(echo = TRUE)
ggplot(data = candy, aes(x = sugarpercent, y = pricepercent,label = competitorname)) +
    geom_point() + 
    geom_smooth(method = "lm") +
    scale_y_continuous(limits=c(0,100)) +
    coord_fixed()

knitr::opts_chunk$set(echo = TRUE)
# Top 5 candies based on pricepercent
win = head(candy[order(candy$pricepercent,decreasing = TRUE),],n=5)
win[,c('competitorname','pricepercent','sugarpercent')]

             competitorname pricepercent sugarpercent
45                Nik L Nip         97.6         19.7
63          Nestle Smarties         97.6         26.7
56                 Ring pop         96.5         73.2
24        HersheyÕs Krackel         91.8         43.0
25 HersheyÕs Milk Chocolate         91.8         43.0

# Bottom 5 candies based on pricepercent
win = head(candy[order(candy$pricepercent,decreasing = FALSE),],n=5)
win[,c('competitorname','pricepercent','sugarpercent')]

         competitorname pricepercent sugarpercent
77 Tootsie Roll Midgies          1.1         17.4
49         Pixie Sticks          2.3          9.3
15             Dum Dums          3.4         73.2
16          Fruit Chews          3.4         12.7
70  Strawberry bon bons          5.8         56.9

# Top 5 candies based on sugarpercent
win = head(candy[order(candy$sugarpercent,decreasing = TRUE),],n=5)
win[,c('competitorname','pricepercent','sugarpercent')]

                competitorname pricepercent sugarpercent
55 ReeseÕs stuffed with pieces         65.1         98.8
39    Milky Way Simply Caramel         86.0         96.5
71                Sugar Babies         76.7         96.5
61           Skittles original         22.0         94.1
62          Skittles wildberry         22.0         94.1

# Bottom 5 candies based on sugarpercent
win = head(candy[order(candy$sugarpercent,decreasing = FALSE),],n=5)
win[,c('competitorname','pricepercent','sugarpercent')]

       competitorname pricepercent sugarpercent
3            One dime         11.6          1.1
4         One quarter         51.1          1.1
52 ReeseÕs Miniatures         27.9          3.4
13           Chiclets         32.5          4.6
31          Lemonhead         10.4          4.6

Now that I have plotted several visualization for the variables, I start to wonder if there is a more rigorous approach to identify variables that are actually related to winpercent. Because some relations could be caused by chance, which is something I would like to remove. After some researching, I found that Two-way analysis of variance (ANOVA) examines the influence of mutiple categorical variables on one continuous dependent variable, which is winpercent in my case.

knitr::opts_chunk$set(echo = TRUE)
res.aov <- aov(winpercent ~ chocolate + fruity + peanutyalmondy, data = candy)
# Summary of the analysis
summary(res.aov)

               Df Sum Sq Mean Sq F value   Pr(>F)    
chocolate       1   7369    7369  61.603 1.49e-11 ***
fruity          1    336     336   2.810   0.0975 .  
peanutyalmondy  1    794     794   6.635   0.0118 *  
Residuals      81   9689     120                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reflection

This is my first project using R Markdown and I have learned a lot from it. Picking a dataset that is interesting to me is important. I searched for lots of datasets on Kaggle and finally used the Halloween Candy for my final project. After going over the dataset, I am curious about what are some properties for a best-seller candy, and this led to my research questions, that is what are the best-sellers and what kind of attributes make a candy popular for Halloween.

To answer the second question, I initially planned to plot the winpercent based on the categorical variables. For instance, are candies with chocolate flavor more popular than those without? Do people prefer hard candies or soft candies? How about candies in bar shape? However, there are 11 categorical variables and making all 11 plots might not be ideal. Therefore, I used a correlation plot to filter out variables that do not closely related to popularity. For instance, adding more sugar to a candy does not make it more popular. So now I had a subset of variables that had some impact on candy’s popularity, I plotted bar plots to verify my hypothesis. One of my hypothesis was chocolate flavored candies are more welcomed than non-chocolate flavored. I also did a similar work for numeric variables (sugarpercent and pricepercent) where I plotted scatterplot to see if there is a linear correlation with winpercent.

So far, I have examined correlation between one independent variable and one dependent variable (winperent) but did not look at interactions. I plotted some bar plots with a 2nd variable using facet_wrap but found it challenging to identify interactions through graphs when they are not that obvious.

Lastly, I performed a statistical test called Two-way analysis of variance (ANOVA) that basically tells me if there exist a clear relation between a categorical variable and numerical variable. I wish I would have known this earlier so that I would not spend time trying to modify my dataset and visualizations to check for statistical significance.

For my next steps, I would take this visualization as a exploratory analysis and build some statistical models to understand variables that make a candy popular is a more scientific way.

Conclusion

In this report, I have found that ReeseÕs Peanut Butter cup is the best-seller for Halloween candies and ReeseÕs has multiple spots in the top 10 most popular candies. It seems like people love ReeseÕs candies very much. I also found that people prefer chocolate candy more and do not like fruit flavor to be added to a chocolate candy. I did not see that chocolate bar candies are more popular. Lastly, a more expensive candy is more popular than a cheaper one.

One question that is still unanswered is that what kind of combinations of variables influence the popularity of a candy. Given that there are 11 categorical variables, there are 55 different combinations and having someone to plot 55 graphs and draw conclusions does not seem like a good practice. So one of my next step will be to find a scientific method to answer this question.

Bibliography

R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
FiveThirtyEight. (2017, October 31). The ultimate halloween candy power ranking. Kaggle. Retrieved January 13, 2022, from https://www.kaggle.com/fivethirtyeight/the-ultimate-halloween-candy-power-ranking
Wikimedia Foundation. (2022, January 10). Two-way analysis of variance. Wikipedia. Retrieved January 13, 2022, from https://en.wikipedia.org/wiki/Two-way_analysis_of_variance
Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.

Comment on this article Share:

Final Paper