HW4

HW4 for Haoyan

Haoyan Xiang
2022-01-08

Halloween Candy Project - HW4

knitr::opts_chunk$set(echo = TRUE)
setwd("~/Downloads")
library(ggplot2)
library(dplyr)
candy = read.csv("candy-data.csv")
head(candy)
  competitorname chocolate fruity caramel peanutyalmondy nougat
1      100 Grand         1      0       1              0      0
2   3 Musketeers         1      0       0              0      1
3       One dime         0      0       0              0      0
4    One quarter         0      0       0              0      0
5      Air Heads         0      1       0              0      0
6     Almond Joy         1      0       0              1      0
  crispedricewafer hard bar pluribus sugarpercent pricepercent
1                1    0   1        0        0.732        0.860
2                0    0   1        0        0.604        0.511
3                0    0   0        0        0.011        0.116
4                0    0   0        0        0.011        0.511
5                0    0   0        0        0.906        0.511
6                0    0   1        0        0.465        0.767
  winpercent
1   66.97173
2   67.60294
3   32.26109
4   46.11650
5   52.34146
6   50.34755

Getting the statistics (mean,median, and sd) of numerical features.

knitr::opts_chunk$set(echo = TRUE)
summarize(candy, mean.sugar = 100*mean(`sugarpercent`), mean.price = 100*mean(`pricepercent`),mean.win = mean(`winpercent`))
  mean.sugar mean.price mean.win
1   47.86471   46.88824 50.31676
knitr::opts_chunk$set(echo = TRUE)
summarize(candy, median.sugar = 100*median(`sugarpercent`), median.price = 100*median(`pricepercent`),median.win = median(`winpercent`))
  median.sugar median.price median.win
1         46.5         46.5   47.82975
knitr::opts_chunk$set(echo = TRUE)
summarize(candy, sd.sugar = 100*sd(`sugarpercent`), sd.price = 100*sd(`pricepercent`),sd.win = sd(`winpercent`))
  sd.sugar sd.price   sd.win
1 28.27779 28.57396 14.71436

We can see from below that chocolate flavored candies are much more welcomed than those without chocolate.

knitr::opts_chunk$set(echo = TRUE)
candy %>%
  group_by(chocolate) %>%
  select('winpercent') %>%
  summarize_all(median, na.rm = TRUE)
# A tibble: 2 × 2
  chocolate winpercent
      <int>      <dbl>
1         0       41.6
2         1       60.8

Histrogram

- What variable(s) you are visualizing
For the histogram, the variable used is pricepercent, which is the percentile of candy price.
- What question(s) you are attempting to answer with the visualization
This graph can help us understand the overall distribution of price.
- What conclusions you can make from the visualization
We can tell from the graph below that the prices are evenly distributed with only a few that are relatively expensive, which make sense for a candy.

- What questions are left unanswered with your visualizations
As this histrogram only tells you about the distribution of pricepercent, it does not tell you the distribution of the other numerical variables such as winpercent.
- What about the visualizations may be unclear to a naive viewer
In my opinion, this visualization does not require any domain knowledge therefore could be easily understood by any audience. Please note that I assume the readers will read through the definition of each variables before looking at the plot.
- How could you improve the visualizations for the final project
If I were to use this visualization for the final project, I would change the bin width as there are blanks/gaps in the histogram.

knitr::opts_chunk$set(echo = TRUE)
ggplot(candy, aes(pricepercent)) +
  geom_histogram() 

Scatterplot

- What variable(s) you are visualizing
For the scatterplot, the variabls used are sugarpercent and winpercent.

- What question(s) you are attempting to answer with the visualization
This graph can help us understand the relationship between sugar and popularity for the candies.
- What conclusions you can make from the visualization
We can tell from the graph below that there is no clear relationship between sugar and popularity. In other words, a sweeter candy does not indicate it is more welcomed.

- What questions are left unanswered with your visualizations
We could look at other variables such as hard, which is a binary variable where 1 indicates it is a hard candy and 0 means it is a soft candy, and see if there is a relationship between hard and winpercent.
- What about the visualizations may be unclear to a naive viewer
In my opinion, this visualization does not require any domain knowledge therefore could be easily understood by any audience.
- How could you improve the visualizations for the final project
If I were to use this visualization for the final project, I would add a title and a fitted line to it.

knitr::opts_chunk$set(echo = TRUE)
ggplot(candy, aes(sugarpercent, winpercent)) + geom_point()

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Xiang (2022, Jan. 8). Data Analytics and Computational Social Science: HW4. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomjamesxiang11854080/

BibTeX citation

@misc{xiang2022hw4,
  author = {Xiang, Haoyan},
  title = {Data Analytics and Computational Social Science: HW4},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomjamesxiang11854080/},
  year = {2022}
}