library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Final Project Assignment#1: Anirudh Lakkaraju
Important Formatting & Submission Notes:
Use this file as the template to work on: start your own writing from Section “Part.1”
Please make the following changes to the above YAML header:
Change the “title” to “Final Project Assignment#1: First Name Last Name”;
Change the “author” to your name;
Change the “date” to the current date in the “MM-DD-YYYY” format;
Submission:
- Delete the unnecessary sections (“Overview”, “Tasks”, “Special Note”, and “Evaluation”).
- In the posts folder of your local 601_Spring_2023 project, create a folder named “FirstNameLastName_FinalProjectData”, and save your final project dataset(s) in this folder. DO NOT save the dataset(s) to the _data folder which stores the dataset(s) for challenges.
- Render and submit the file to the blog post like a regular challenge.
Overview of the Final Project
The goal is to tell a coherent and focused story with your data, which answers a question (or questions) that a researcher, or current or future employer, might want to have answered. The goal might be to understand a source of covariance, make a recommendation, or understand change over time. We don’t expect you to reach a definitive conclusion in this analysis. Still, you are expected to tell a data-driven story using evidence to support the claims you are making on the basis of the exploratory analysis conducted over the past term.
In this final project, statistical analyses are not required, but any students who wish to include these may do so. However, your primary analysis should center around visualization rather than inferential statistics. Many scientists only compute statistics after a careful process of exploratory data analysis and data visualization. Statistics are a way to gauge your certainty in your results - NOT A WAY TO DISCOVER MEANINGFUL DATA PATTERNS. Do not run a multiple regression with numerous predictors and report which predictors are significant!!
Tasks of Assignment#1
This assignment is the first component of your final project. Together with the later assignments, it make up a short paper/report. In this assignment, you should introduce a dataset(s) and how you plan to present this dataset(s). This assignment should include the following components:
A clear description of the dataset(s) that you are using.
What “story” do you want to present to the audience? In other words, what “question(s)” do you like to answer with this dataset(s)?
The Plan for Further Analysis and Visualization.
We will have a special class meeting on April 12 to review and discuss students’ proposed datasets for the final project. If you want your project being discussed in the class, please submit this assignment before April 12.
Part 1. Introduction
In this part, you should introduce the dataset(s) and your research questions.
Dataset(s) Introduction:
The datset “pokemon.csv” is obtained from the Serebii website and hosted on Kaggle. As a long-time fan of Pokemon and someone who has played all the games, I am interested in exploring the idea of building the “best” team of six Pokemon. With over 800 Pokemon across seven generations, selecting the optimal team is a daunting task.
My project aims to understand how the moves, types, and stats of each of the 801 Pokemon determine their value in a playthrough. Specifically, I will investigate how a Pokemon’s base stats and types affect its strength. Each Pokemon type has its strengths and weaknesses against other types, so I plan to explore the idea of creating a “best” team with the least amount of weaknesses.
Furthermore, I will understand how a Pokemon’s base stats and types impact its strength, and whether its physical attributes, such as size and features, are consistent with its base stats. For instance, in the real world, a tiger’s bulk makes it stronger than a cheetah, which is more agile and lithe. I want to determine if a similar relationship exists in the Pokemon world.
In addition, I will explore the concept of Legendary Pokemon, which are rare and have the highest base stats in the game. I am interested in determining if it is possible to categorize a Pokemon as legendary based on its base stats and if there is a type that is most likely to be legendary.
What questions do you like to answer with this dataset(s)?
Which type is the strongest overall, and which is the weakest?
Can a Pokemon dream team be built? A team of six Pokemon that inflicts the most damage while remaining relatively impervious to any other team of six Pokemon.
How does a Pokemon’s height and weight correlate with its various base stats?
Which type is most likely to be a legendary Pokemon?
Is it feasible to construct a classifier that can identify legendary Pokemon?
Part 2. Describe the data set(s)
<- read_csv("AnirudhLakkaraju_FinalProjectData/pokemon.csv")
data view(data)
dim(data)
[1] 801 41
head(data)
summary(data)
abilities against_bug against_dark against_dragon
Length:801 Min. :0.2500 Min. :0.250 Min. :0.0000
Class :character 1st Qu.:0.5000 1st Qu.:1.000 1st Qu.:1.0000
Mode :character Median :1.0000 Median :1.000 Median :1.0000
Mean :0.9963 Mean :1.057 Mean :0.9688
3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:1.0000
Max. :4.0000 Max. :4.000 Max. :2.0000
against_electric against_fairy against_fight against_fire
Min. :0.000 Min. :0.250 Min. :0.000 Min. :0.250
1st Qu.:0.500 1st Qu.:1.000 1st Qu.:0.500 1st Qu.:0.500
Median :1.000 Median :1.000 Median :1.000 Median :1.000
Mean :1.074 Mean :1.069 Mean :1.066 Mean :1.135
3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:2.000
Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
against_flying against_ghost against_grass against_ground
Min. :0.250 Min. :0.000 Min. :0.250 Min. :0.000
1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.500 1st Qu.:1.000
Median :1.000 Median :1.000 Median :1.000 Median :1.000
Mean :1.193 Mean :0.985 Mean :1.034 Mean :1.098
3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
against_ice against_normal against_poison against_psychic
Min. :0.250 Min. :0.000 Min. :0.0000 Min. :0.000
1st Qu.:0.500 1st Qu.:1.000 1st Qu.:0.5000 1st Qu.:1.000
Median :1.000 Median :1.000 Median :1.0000 Median :1.000
Mean :1.208 Mean :0.887 Mean :0.9753 Mean :1.005
3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.000
Max. :4.000 Max. :1.000 Max. :4.0000 Max. :4.000
against_rock against_steel against_water attack
Min. :0.25 Min. :0.2500 Min. :0.250 Min. : 5.00
1st Qu.:1.00 1st Qu.:0.5000 1st Qu.:0.500 1st Qu.: 55.00
Median :1.00 Median :1.0000 Median :1.000 Median : 75.00
Mean :1.25 Mean :0.9835 Mean :1.058 Mean : 77.86
3rd Qu.:2.00 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:100.00
Max. :4.00 Max. :4.0000 Max. :4.000 Max. :185.00
base_egg_steps base_happiness base_total capture_rate
Min. : 1280 Min. : 0.00 Min. :180.0 Length:801
1st Qu.: 5120 1st Qu.: 70.00 1st Qu.:320.0 Class :character
Median : 5120 Median : 70.00 Median :435.0 Mode :character
Mean : 7191 Mean : 65.36 Mean :428.4
3rd Qu.: 6400 3rd Qu.: 70.00 3rd Qu.:505.0
Max. :30720 Max. :140.00 Max. :780.0
classfication defense experience_growth height_m
Length:801 Min. : 5.00 Min. : 600000 Min. : 0.100
Class :character 1st Qu.: 50.00 1st Qu.:1000000 1st Qu.: 0.600
Mode :character Median : 70.00 Median :1000000 Median : 1.000
Mean : 73.01 Mean :1054996 Mean : 1.164
3rd Qu.: 90.00 3rd Qu.:1059860 3rd Qu.: 1.500
Max. :230.00 Max. :1640000 Max. :14.500
NA's :20
hp japanese_name name percentage_male
Min. : 1.00 Length:801 Length:801 Min. : 0.00
1st Qu.: 50.00 Class :character Class :character 1st Qu.: 50.00
Median : 65.00 Mode :character Mode :character Median : 50.00
Mean : 68.96 Mean : 55.16
3rd Qu.: 80.00 3rd Qu.: 50.00
Max. :255.00 Max. :100.00
NA's :98
pokedex_number sp_attack sp_defense speed
Min. : 1 Min. : 10.00 Min. : 20.00 Min. : 5.00
1st Qu.:201 1st Qu.: 45.00 1st Qu.: 50.00 1st Qu.: 45.00
Median :401 Median : 65.00 Median : 66.00 Median : 65.00
Mean :401 Mean : 71.31 Mean : 70.91 Mean : 66.33
3rd Qu.:601 3rd Qu.: 91.00 3rd Qu.: 90.00 3rd Qu.: 85.00
Max. :801 Max. :194.00 Max. :230.00 Max. :180.00
type1 type2 weight_kg generation
Length:801 Length:801 Min. : 0.10 Min. :1.00
Class :character Class :character 1st Qu.: 9.00 1st Qu.:2.00
Mode :character Mode :character Median : 27.30 Median :4.00
Mean : 61.38 Mean :3.69
3rd Qu.: 64.80 3rd Qu.:5.00
Max. :999.90 Max. :7.00
NA's :20
is_legendary
Min. :0.00000
1st Qu.:0.00000
Median :0.00000
Mean :0.08739
3rd Qu.:0.00000
Max. :1.00000
This dataset contains information on all 802 Pokemon from all Seven Generations of Pokemon. The information contained in this dataset include Base Stats, Performance against Other Types, Height, Weight, Classification, Egg Steps, Experience Points, Abilities, etc. The information was scraped from http://serebii.net/.
Each row contains the complete information of one Pokemon and the following are the meanings of each column -
name: The English name of the Pokemon
japanese_name: The Original Japanese name of the Pokemon
pokedex_number: The entry number of the Pokemon in the National Pokedex
percentage_male: The percentage of the species that are male. Blank if the Pokemon is genderless.
type1: The Primary Type of the Pokemon
type2: The Secondary Type of the Pokemon
classification: The Classification of the Pokemon as described by the Sun and Moon Pokedex
height_m: Height of the Pokemon in metres
weight_kg: The Weight of the Pokemon in kilograms
capture_rate: Capture Rate of the Pokemon
base_egg_steps: The number of steps required to hatch an egg of the Pokemon
abilities: A stringified list of abilities that the Pokemon is capable of having
experience_growth: The Experience Growth of the Pokemon
base_happiness: Base Happiness of the Pokemon
against_?: Eighteen features that denote the amount of damage taken against an attack of a particular type
hp: The Base HP of the Pokemon
attack: The Base Attack of the Pokemon
defense: The Base Defense of the Pokemon
sp_attack: The Base Special Attack of the Pokemon
sp_defense: The Base Special Defense of the Pokemon
speed: The Base Speed of the Pokemon
generation: The numbered generation which the Pokemon was first introduced
is_legendary: Denotes if the Pokemon is legendary.
3. The Tentative Plan for Visualization
Briefly describe what data analyses (please the special note on statistics in the next section) and visualizations you plan to conduct to answer the research questions you proposed above.
I will be plotting different histograms, box plots, scatter plots and linear graphs to understand the correlation between select variables that are relevant to the research questions I have posed. For example - I can plot a histogram of legendary status vs primary and secondary types to understand which type has the most likeliness of being legendary.
I will also be conducting correlation analysis to identify the strength of the relationships between variables such as height, weight, type match ups, etc.
Explain why you choose to conduct these specific data analyses and visualizations. In other words, how do such types of statistics or graphs (see the R Gallery) help you answer specific questions? For example, how can a bivariate visualization reveal the relationship between two variables, or how does a linear graph of variables over time present the pattern of development?
Histogram: Histograms can be used to visualize the distributions of numerical variables. For this dataset, we can use it to identify which type is most likely to be a legendary Pokemon.
Scatter Plots: Scatter plots are useful in understanding and visualizing the relationship between two numerical values. We can use these plots to figure out how to create the best dream team of six Pokemon that are strongest and have the least amount of type weaknesses. This can be done by plotting a scatter plot using the type matchups data and the base stat data. We can also make teams that include legendary Pokemon and teams that don’t.
Box Plots: These are useful for visualizing distributions of a numerical variable across different attributes. In the case of Pokemon data, we can see which type is the strongest and understand the factors that affect the strength of a type.
If you plan to conduct specific data analyses and visualizations, describe how do you need to process and prepare the tidy data.
Drop some irrelevant columns like Japanese name, Pokedex id, base_happiness, etc as they are not that relevant to out study.
Create a new variable that determines the probability of a pokemon being legendary status as a factor of it’s base stats.
Introduce new columns like how many types the pokemon is weak/strong against.
Regarding missing data or NAs, we need to first identify the variables with missing values and the extent of the missingness. We can use functions like is.na() or summary() to identify the missing data.