HW5 for Haoyan
Loading the data as usual.
knitr::opts_chunk$set(echo = TRUE)
setwd("~/Downloads")
library(ggplot2)
library(dplyr)
candy = read.csv("candy-data.csv")
head(candy)
competitorname chocolate fruity caramel peanutyalmondy nougat
1 100 Grand 1 0 1 0 0
2 3 Musketeers 1 0 0 0 1
3 One dime 0 0 0 0 0
4 One quarter 0 0 0 0 0
5 Air Heads 0 1 0 0 0
6 Almond Joy 1 0 0 1 0
crispedricewafer hard bar pluribus sugarpercent pricepercent
1 1 0 1 0 0.732 0.860
2 0 0 1 0 0.604 0.511
3 0 0 0 0 0.011 0.116
4 0 0 0 0 0.011 0.511
5 0 0 0 0 0.906 0.511
6 0 0 1 0 0.465 0.767
winpercent
1 66.97173
2 67.60294
3 32.26109
4 46.11650
5 52.34146
6 50.34755
Plotting the distribution of winpercent based on whether a candy has chocolate. We can see that adding chocolate to a candy will increase the winpercent of a candy.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
ggplot(candy, aes(winpercent))+
geom_histogram()+
labs(title="Winpercent based on chocolate")+
theme_bw()+
facet_wrap(~chocolate, scales = "free")
To verify our hypothesis, I plotted the winpercent of candies based on whether it is chocolate flavored. From the graph below, we can conclude that chocolate flavored candies are more popular.
knitr::opts_chunk$set(echo = TRUE)
# re-code columns with "1" = "TRUE" and "0" = "FALSE"
new_candy <- candy %>%
mutate_at(c(2:10), recode, `1`= "TRUE", `0`="FALSE")
ggplot(data = new_candy) +
geom_bar(mapping = aes(x = chocolate, y = winpercent), stat = "identity", position = "dodge")+
labs(title="Winpercent based on chocolate")
Lastly, we can add another variable, say caramel, which is whether a candy has caramel in it. Notice that here we are only graphing the frequency, not winpercent as we did in the previous graph. From the graph below, we can say that caramel is added more frequently when chocolate is not. One possible reason for this could be that adding both to a candy could make it too sweet.
knitr::opts_chunk$set(echo = TRUE)
ggplot(new_candy, aes(x = chocolate,
fill = caramel)) +
geom_bar()+
labs(title="Candies count based on chocolate and caramel")
Answer the following questions:
– What is missing (if anything) in your analysis process so far?
I am missing a correlation plot and a statistic test that could tell me what kind of features make a candy popular. Instead of drawing conclusions from graphs, it would be more rigorous to rely on a statistical test.
– What conclusions can you make about your research questions at this point?
So far, I can say that chocolate is definitely one feature that makes a candy popular, i.e. people prefer chocolate favored candies more.
– What do you think a naive reader would need to fully understand your graphs?
The graphs should be pretty self-explainable to a naive reader.
– Is there anything you want to answer with your dataset, but can’t?
It would be good to know the charateristic of people that vote for their favorite candies. For instance, there could be a specific group of pepole that enjoy one kind of flavor or texture, say women under 20 prefer hard candies than soft candies, or perhaps men between 20 and 25 perfer candies in bar shape.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Xiang (2022, Jan. 14). Data Analytics and Computational Social Science: HW5. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomjamesxiang11854838/
BibTeX citation
@misc{xiang2022hw5, author = {Xiang, Haoyan}, title = {Data Analytics and Computational Social Science: HW5}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomjamesxiang11854838/}, year = {2022} }