library(plyr)library(tidyverse)library(readr)library(summarytools)library(psych)library(lattice)library(FSA)library(kableExtra)library(ggplot2)library(stargazer)library(MPV)# reading in the data setVideo_Game_Sales <-read_csv("_data/final_project/Video_Game_Sales_as_of_Jan_2017.csv")knitr::opts_chunk$set(echo = T)
Introduction
The current state of video games supports massive online and local communities, generating millions in revenue a year. No longer can the perception of video games continue to be that of a counter-culture media, but something more widely accepted, definitively present, and enjoyed among diverse communities.
Stated eloquently by Nielsen, Smith & Tosca in, Understanding Video Games: The Essential Introduction (2012), No cultural form exists in isolation; rather, it is integrated within a complex system of meanings shaped by society and its institutions. Compared to other cultural forms, such as literature, the medium of the video game is a new member of this fascinating ecology. It is certainly true that the history of cultural media shows an almost instinctive skepticism leveled at new media. It has been true of radio, it has been true of movies, and it has certainly been true of television, which has long fought against the perception that its role was to entertain, rather than to enlighten.
Now, over 10 years later, we can see that the skepticism once surrounding video games has waned, so much so that engaging with them has now been accepted as a common hobby and even profession. I believe that video games are here to stay and rather than reject the new norm, we should grow with it. Video games will only continue to evolve in their application, especially when taking into consideration recent developments in virtual reality technology. Understanding and utilizing video games as the unique source of cultural media that they are can provide us insight into the nature of popular media and the very real weight of their market. What makes a game good? Does a game have to be good to be “successful”? Does fighting among platform communities contribute to the success, or lack there of, of a video game? Do critic ratings play a role? How about user ratings? Or maybe, does it come down to the publisher and whether they would have a large enough budget to advertise their games to the masses? These are all questions I’d like to address during the scope of my project. However, the main objective I’ve tasked myself with is to address the following research question:
What impacts do Platform & Genre have on the commercial success of a video game?
Hypothesis
While there exists a diverse range of articles and blog posts related to console wars, game of the year announcements, loot box market structures, etc. there has been a noticeable oversight regarding the correlation between public critique and generated revenue. An often overlooked predicament of video game development occurs post-release. A game may be applauded for all the characteristics the gaming community has grown to love, but if that game’s sales aren’t comparable to the costs it generated, then what reward is there for the developers? Would that game still be considered successful? Would it be lucrative for developers to switch to a more widely accepted genre or platform to guarantee economic success?
These questions have led to the development of the following hypotheses:
H1: Platform and Genre significantly impact Global Sales.
H2: Nintendo and Sony consoles host the most financially successful video games.
H3: The Shooter genre is the most financially successful genre when compared to all others.
From personal knowledge, it’s known that popularity by platform may fluctuate over the course of a series’ lifespan. The Nintendo manufactured Gamecube and Wii were widely popular upon release and are still commonly used today. The same can be said for Sony’s Playstation 1 & 2 or Microsoft’s Xbox 360. Similarly, there are also platforms like the Wii U, Playstation 3 and Xbox One that had tumultuous receptions upon release and led to committed users switching to other platform series. A common example would be the transition from Playstation to Xbox and vice versa. With this in mind, I hypothesize that Nintendo and Sony based platforms have the highest impact on sales for video games based on prior knowledge concerning the success of select platforms like the Playstation 2, Gamecube and Wii.
When it comes to genre, the unrelenting success of the Call of Duty series within the time frame of this data serves as the core component of my belief that the genre Shooters will have the largest impact on sales among the list video game genres. More and more shooters continue to be made in attempts to emulate the success achieved with games like Call of Duty: Modern Warfare 2 and Black Ops. I also want to acknowledge other major successes such as Grand Theft Auto for example, that are major staples within their own given genres too (in the case of GTA, it’s the Role Playing genre). However, it’s known from personal experience that first person shooters revolutionized the gaming industry after the release of Call of Duty 4: Modern Warfare. I hypothesize that while the modern gaming market may be over-saturated with Shooters, they continue to play a large role in the commercial success of video games upon release.
Descriptive Statistics
Description and Summary of the Data
This data set was pulled from the Kaggle online database and its description reads as follows,
This data set contains a list of video games with sales greater than 100,000 copies along with critic and user ratings.
With this updated data set provided by the collector, we are given 15 variables and approximately 17,500 entries. The variables are as follows:
Name [game’s name]
Platform [platform of game release]
Year of Release [game’s release date]
Genre [genre of game]
Publisher [publisher of game]
NA Sales [sales in North America in millions of USD]
EU Sales [sales in Europe in millions of USD]
JPN Sales [sales in Japan in millions of USD]
Other Sales [sales in rest of the world in millions of USD]
Global Sales [total worldwide sales in millions of USD]
Critic Score [aggregate score compiled by Metacritic staff]
Critic Count [the number of critis used in creating the critic score]
User Score [score according to Metacritic subscribers]
User Count [number of users who gave the user score]
Referencing the data set’s description once again, it states that,
It is a combined web scrape from VGChartz and Metacritic along with manually entered year of release values for most games with a missing year of release.
The original code the collector utilized was created by Rush Kirubi, but it’s made apparent that the original set limited the data to only include a subset of video game platforms. Additionally, not all the listed video games have information on Metacritic, so there are a significant amount of missing values under the critic & user scores/counts variables.
This provides valuable context concerning Metacritic, the forum utilized by critics and users to rate their favorite games, and the numerous missing values within the data frame. Metacritic was established in 1999. As a result, all entries pre-dating early 2000 lack critic and user scores, as it had not been as well established at the time. These values will end up being filtered out to accommodate the controls I would like to use in the models I later create.
Code
# summarizing our datasummary(Video_Game_Sales)
Name Platform Year_of_Release Genre
Length:17416 Length:17416 Min. :1976 Length:17416
Class :character Class :character 1st Qu.:2003 Class :character
Mode :character Mode :character Median :2008 Mode :character
Mean :2007
3rd Qu.:2011
Max. :2017
NA's :8
Publisher NA_Sales EU_Sales JP_Sales
Length:17416 Min. : 0.0000 Min. : 0.0000 Min. : 0.00000
Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00000
Mode :character Median : 0.0700 Median : 0.0200 Median : 0.00000
Mean : 0.2545 Mean : 0.1407 Mean : 0.07502
3rd Qu.: 0.2300 3rd Qu.: 0.1000 3rd Qu.: 0.03000
Max. :41.3600 Max. :28.9600 Max. :10.22000
Other_Sales Global_Sales Critic_Score Critic_Count
Min. : 0.00000 Min. : 0.0100 Min. :13.00 Min. : 3.00
1st Qu.: 0.00000 1st Qu.: 0.0500 1st Qu.:60.00 1st Qu.: 11.00
Median : 0.01000 Median : 0.1600 Median :71.00 Median : 21.00
Mean : 0.04591 Mean : 0.5165 Mean :68.91 Mean : 26.19
3rd Qu.: 0.03000 3rd Qu.: 0.4500 3rd Qu.:79.00 3rd Qu.: 36.00
Max. :10.57000 Max. :82.5400 Max. :98.00 Max. :113.00
NA's :9080 NA's :9080
User_Score User_Count Rating
Min. :0.000 Min. : 4.0 Length:17416
1st Qu.:6.400 1st Qu.: 10.0 Class :character
Median :7.500 Median : 25.0 Mode :character
Mean :7.117 Mean : 162.7
3rd Qu.:8.200 3rd Qu.: 81.0
Max. :9.700 Max. :10766.0
NA's :9618 NA's :9618
Summarizing the data shows that 9,080 entries lack critic scores and 9,618 entries lack user scores. Even with 9,618 entries omitted, there are still over 7,000 complete entries to analyze and I do not fear that the omission will negatively impact the analysis.
Variables of Interest
Of the 15 variables provided, 7 will be heavily utilized throughout the scope of this project. Those 7 have been classified below.
2 main independent variables:
Platform
Genre
4 confounding variables:
Publisher
Rating
Critic Scores
User Scores
1 dependent variable:
Global Sales
Modifying and Visualizing the Data
Modifications
My goal in this section is to acknowledge the steps I took to mold and form the data set I’ll be using for this project into something conducive to an analysis. To start, I want to take a glimpse at and summarize my data to get a visual representation of the numbers that I’ll be working with.
Code
head(Video_Game_Sales)
# A tibble: 6 × 15
Name Platform Year_of_Release Genre Publisher NA_Sales EU_Sales JP_Sales
<chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 Wii Sports Wii 2006 Spor… Nintendo 41.4 29.0 3.77
2 Super Mar… NES 1985 Plat… Nintendo 29.1 3.58 6.81
3 Mario Kar… Wii 2008 Raci… Nintendo 15.7 12.8 3.79
4 Wii Sport… Wii 2009 Spor… Nintendo 15.6 11.0 3.28
5 Pokemon R… G 1996 Role… Nintendo 11.3 8.89 10.2
6 Tetris G 1989 Puzz… Nintendo 23.2 2.26 4.22
# … with 7 more variables: Other_Sales <dbl>, Global_Sales <dbl>,
# Critic_Score <dbl>, Critic_Count <dbl>, User_Score <dbl>, User_Count <dbl>,
# Rating <chr>
Code
dfSummary(Video_Game_Sales)
Data Frame Summary
Video_Game_Sales
Dimensions: 17416 x 15
Duplicates: 0
----------------------------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- ----------------- ------------------------------ --------------------- --------------------- ---------- ---------
1 Name 1. Need for Speed: Most Want 12 ( 0.1%) 17416 0
[character] 2. FIFA 14 9 ( 0.1%) (100.0%) (0.0%)
3. LEGO Marvel Super Heroes 9 ( 0.1%)
4. Madden NFL 07 9 ( 0.1%)
5. Madden NFL 08 9 ( 0.1%)
6. Ratatouille 9 ( 0.1%)
7. Angry Birds Star Wars 8 ( 0.0%)
8. Cars 8 ( 0.0%)
9. FIFA 15 8 ( 0.0%)
10. FIFA Soccer 13 8 ( 0.0%)
[ 12070 others ] 17327 (99.5%) IIIIIIIIIIIIIIIIIII
2 Platform 1. DS 2251 (12.9%) II 17416 0
[character] 2. PS2 2206 (12.7%) II (100.0%) (0.0%)
3. PS3 1362 ( 7.8%) I
4. Wii 1359 ( 7.8%) I
5. PSP 1304 ( 7.5%) I
6. X360 1298 ( 7.5%) I
7. PS 1200 ( 6.9%) I
8. PC 1128 ( 6.5%) I
9. GBA 844 ( 4.8%)
10. X 833 ( 4.8%)
[ 21 others ] 3631 (20.8%) IIII
3 Year_of_Release Mean (sd) : 2006.6 (5.9) 42 distinct values : : 17408 8
[numeric] min < med < max: : : (100.0%) (0.0%)
1976 < 2008 < 2017 : : : .
IQR (CV) : 8 (0) . : : : :
. : : : : :
4 Genre 1. Action 3503 (20.1%) IIII 17416 0
[character] 2. Sports 2408 (13.8%) II (100.0%) (0.0%)
3. Misc 1813 (10.4%) II
4. Role-Playing 1545 ( 8.9%) I
5. Adventure 1478 ( 8.5%) I
6. Shooter 1349 ( 7.7%) I
7. Racing 1282 ( 7.4%) I
8. Simulation 925 ( 5.3%) I
9. Platform 900 ( 5.2%) I
10. Fighting 864 ( 5.0%)
[ 2 others ] 1349 ( 7.7%) I
5 Publisher 1. Electronic Arts 1380 ( 7.9%) I 17416 0
[character] 2. Activision 1005 ( 5.8%) I (100.0%) (0.0%)
3. Namco Bandai Games 972 ( 5.6%) I
4. Ubisoft 970 ( 5.6%) I
5. Konami Digital Entertainm 865 ( 5.0%)
6. THQ 728 ( 4.2%)
7. Nintendo 722 ( 4.1%)
8. Sony Computer Entertainme 704 ( 4.0%)
9. Sega 660 ( 3.8%)
10. Take-Two Interactive 433 ( 2.5%)
[ 617 others ] 8977 (51.5%) IIIIIIIIII
6 NA_Sales Mean (sd) : 0.3 (0.8) 399 distinct values : 17416 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 0.1 < 41.4 :
IQR (CV) : 0.2 (3.1) :
:
7 EU_Sales Mean (sd) : 0.1 (0.5) 306 distinct values : 17416 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 0 < 29 :
IQR (CV) : 0.1 (3.5) :
:
8 JP_Sales Mean (sd) : 0.1 (0.3) 245 distinct values : 17416 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 0 < 10.2 :
IQR (CV) : 0 (4) :
:
9 Other_Sales Mean (sd) : 0 (0.2) 157 distinct values : 17416 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 0 < 10.6 :
IQR (CV) : 0 (4) :
:
10 Global_Sales Mean (sd) : 0.5 (1.5) 627 distinct values : 17416 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 0.2 < 82.5 :
IQR (CV) : 0.4 (3) :
:
11 Critic_Score Mean (sd) : 68.9 (14) 82 distinct values : 8336 9080
[numeric] min < med < max: . : : (47.9%) (52.1%)
13 < 71 < 98 : : : :
IQR (CV) : 19 (0.2) . : : : :
. : : : : : : :
12 Critic_Count Mean (sd) : 26.2 (19) 106 distinct values : 8336 9080
[numeric] min < med < max: : : (47.9%) (52.1%)
3 < 21 < 113 : : .
IQR (CV) : 25 (0.7) : : : .
: : : : : . .
13 User_Score Mean (sd) : 7.1 (1.5) 95 distinct values : 7798 9618
[numeric] min < med < max: : : (44.8%) (55.2%)
0 < 7.5 < 9.7 : :
IQR (CV) : 1.8 (0.2) . : : : .
. . : : : : :
14 User_Count Mean (sd) : 162.7 (562.8) 903 distinct values : 7798 9618
[numeric] min < med < max: : (44.8%) (55.2%)
4 < 25 < 10766 :
IQR (CV) : 71 (3.5) :
:
15 Rating 1. AO 1 ( 0.0%) 10252 7164
[character] 2. E 4120 (40.2%) IIIIIIII (58.9%) (41.1%)
3. E10+ 1473 (14.4%) II
4. EC 8 ( 0.1%)
5. K-A 3 ( 0.0%)
6. M 1599 (15.6%) III
7. RP 3 ( 0.0%)
8. T 3045 (29.7%) IIIII
----------------------------------------------------------------------------------------------------------------------
Looking at the data, I immediately know that there are a couple of adjustments I want to make. The first being a couple of changes to the Platform variable. First, I’d like to extract all unique values to get a complete list of platforms included in the data set.
Next, I’ll duplicate the Platform column and re-code the values under the new variable so that they pertain to their respective manufacturer. This will clean up the data a bit and make analysis easier in the future.
Code
# Creating new variable (Manufacturer)VGS$Manufacturer <- VGS$Platform# Re-coding values under Manufacturer to accommodate analysisVGS <- VGS %>%mutate(Manufacturer=recode(Manufacturer, 'PS4'='Sony','PS3'='Sony','PS2'='Sony','PS'='Sony','PSV'='Sony','PSP'='Sony','NES'='Nintendo','SNES'='Nintendo','N64'='Nintendo','GC'='Nintendo','DS'='Nintendo','Wii'='Nintendo','WiiU'='Nintendo','GBA'='Nintendo','3DS'='Nintendo','G'='Nintendo','GEN'='Sega','SCD'='Sega','GG'='Sega','SAT'='Sega','DC'='Sega','X'='Microsoft','X360'='Microsoft','XOne'='Microsoft','TG16'='NEC','PCFX'='NEC'))
Now, I’d like draw out a list of unique values for the rest of the variables at my disposal to see what other adjustments need to be made to the data set.
The next list I’d like to generate is that of the unique values under the Genre variable, taking into account its importance in my hypothesis.
It’s shown that there are a total of 12 different genres within the applicable variable. Like Platform & Manufacturer, Genre would also be considered a categorical variable. Unlike the aforementioned variables, I do not believe Genre requires any further adjustment.
The next list I’d like to define is the one for the Publisher variable.
[[1]]
# A tibble: 627 × 1
Publisher
<chr>
1 Nintendo
2 Microsoft Game Studios
3 Take-Two Interactive
4 Sony Computer Entertainment
5 Activision
6 Ubisoft
7 Electronic Arts
8 Bethesda Softworks
9 Sega
10 SquareSoft
# … with 617 more rows
It’s apparent that there are quite a few unique values under the Publisher variable, 627 to be exact. Simply put, there are too many unique values to efficiently conduct an analysis and control for the Publisher variable later on. In other words, the Publisher variable will over-complicate future models.
In order to accommodate this, I’ve decided to again create a new variable (Publisher_Code) and re-code the unique values to fit into 3 separate categories. Those 3 categories are defined by the size and scale of the respective publishing studio. The first level is equal to 1, defining single A (or independent) studios. The second is equal 2 for AA (or mid-size) studios and the last refers to AAA (or large-scale) studios, coded as 3.
[[1]]
# A tibble: 8 × 1
Rating
<chr>
1 E
2 M
3 T
4 E10+
5 K-A
6 AO
7 EC
8 RP
Shown here, there are a total of 8 different ratings for games within the data set that I’ll be utilizing. These values include ratings currently utilized among the ESRB (Entertainment Software Rating Board) and those that existed prior to its formation. Those ratings are defined as:
RP = Rating Pending
EC = Early Childhood
E = Everyone
K-A = Kids through Adults (Replaced by the E rating after the formation of the ESRB)
E10+ = Everyone age 10 and up
T = Teens
M = Mature
AO = Adults Only
While simple and categorical in nature, I’d like to re-code this variable to be ordinal so that it can reflect the progressive inclusion of mature content through the ratings. The new variable, Rating_Code, will output numeric values associated with each rating. 1 for RP, 2 for EC, 3 for E & K-A, 4 for E10+, 5 for T, 6 for M and 7 for AO. I decided to join the E & K-A rating, because they effectively define the same thing. This also results in a decrease from 8 to 7 unique values.
With all the categorical variables now re-coded into something appropriate for an analysis, all that remains are the Critic_Score and User_Score variables. Both are numerical, continuous variables that I believe require no further transformations.
Since there are no further adjustments I’d like to make, I want to generate the descriptive statistics once again, this time utilizing the newly created variables.
Next, I’d like to draw up my explanatory and control variables to visualize the distribution of each and see if there are any last-minute adjustments that I’d like to make. The first variable that I will be visualizing is the Manufacturer variable.
Code
# Simple Bar Plot for Manufacturer FrequencyM_counts <-table(VGS$Manufacturer)barplot(M_counts, main ="Manufacturer Distribution",xlab ="Manufacturer",ylab ="Frequency",ylim =c(0, 7000))
It looks like the existing data for the 2600, 3DO, NEC, NG, Sega and WS manufacturers is so small in comparison to the others that it’s practically negligible. I don’t want this to adversely affect my analysis so I’ll remove those rows from the data frame.
With the aforementioned values excluded, I see that a majority of games within my data set are released among Nintendo and Sony manufactured platforms. While this is supportive of my hypothesis (Nintendo and Sony hosting the most financially successful games), I’ll continue to visualize the remaining variables and refrain from making conclusions until my analysis is complete.
Now it’s time to draw up the Genre variable.
Code
G_counts <-table(VGS2$Genre)barplot(G_counts, main ="Genre Distribution",xlab ="Genre",ylab ="Frequency",ylim =c(0, 3500))
In this case, it seems all genres have adequate data and no further changes need to be made. However, I’d like to note that the Shooter genre is not one of the most frequently occurring within the data set. While this is not immediately indicative of financial success within the genre, like Manufacturer, I’ll refrain from making any direct conclusions until the analysis is over.
Next, I’d like to visualize the newly created Publisher_Code variable.
According to this distribution, it seems that games released by AAA publishers are the most frequently occurring within the data frame. With that in mind, there is now the potential that AAA publishers are more likely to release financially successful games when compared to other smaller publishers. However, again, definitive conclusions will be withheld until after the analysis is conducted.
I’d also like to visualize both the continuous variables that I’ll also be controlling for in future models, Critic_Score and User_Score.
I will also be omitting all NA values to remove any entries corresponding to games that were released prior to the incorporation of critic and user scores. This is so that I can accurately control for these variables in future models. After the omission, there are still 7,098 entries remaining within the data frame.
While the scale of each variable differs (Critic_Score rated on a scale of 0-100 & User_Score rated on a scale of 0-10), the distribution visualized above is very similar to that of Critic_Score. The User_Score visualization shows a right-skewed, normally distributed plot centered, this time, at approximately 8-8.5 (compared to the center being at approximately 70-75 for the Critic_Score variable).
It seems that with the omission of NA values in previous chunks, RP rated games have been completely eliminated from the data set. Furthermore, similar to Manufacturer, there are some irrelevant pieces of data that I think my analysis could do without. Under these circumstances, I’ll be eliminating any rows containing ratings of AO and EC.
I will also be re-coding the remaining values to retain the ordinal nature of the variable that I previously intended to utilize. The new order will be as follows:
With this last adjustment made, the final entry count for the data set I’ll be utilizing throughout the remainder of my project is 7,095 games.
It’s easy to see here that the most frequently occurring ratings within the data are E & K-A , as well as T.
With all the modifications to my data frame complete and all preliminary visualizations generated, I will be moving on to hypothesis testing.
Multi-Variable Visualizations
Within this sub-section are a series of visualizations pairing multiple independent variables together. While not within the scope of my project, I do believe they can help expand the description of my data and provide insight into what my results might look like later on in my analysis. Even so, I do not find it necessary to provide an interpretation for each graph. More so, I will be utilizing this section as a sort of appendix of extra visualizations, each providing an additional dimension to my data.
Code
table1 <-data.frame(with(VGS3, table(Manufacturer,Publisher_Code)))ggplot(table1, aes(x=Manufacturer,y=Freq, fill=Publisher_Code))+geom_bar(stat="identity",position="dodge")+scale_fill_discrete(name ="Publisher Code",labels =c("A", "AA", "AAA")) +ggtitle("Distribution of Manufacturer per Publisher Code")
Code
table2 <-data.frame(with(VGS3, table(Manufacturer,Rating_Code)))ggplot(table2, aes(x=Manufacturer,y=Freq, fill=Rating_Code))+geom_bar(stat="identity",position="dodge")+scale_fill_discrete(name ="Rating",labels =c("E & K-A", "E10+", "T", "M")) +ggtitle("Distribution of Manufacturer per Rating")
Code
table3 <-data.frame(with(VGS3, table(Genre,Manufacturer)))GM <-ggplot(table3, aes(x=Genre,y=Freq, fill=Manufacturer))+geom_bar(stat="identity",position="dodge")GM +ggtitle("Distribution of Genre per Manufacturer") +theme(axis.text.x =element_text(angle =45, vjust =1, hjust=1))
To start the hypothesis testing section I’d like to redefine each variables as being either an Explanatory, Response, or Control Variable.
Explanatory Variables
Genre
Manufacturer
Response Variable
Global Sales
Control Variables
Publisher Code
Critic Score
User Score
Rating Code
Models
My final results should include 7 total models. Each one will incorporate a variant number of control variables. My intention with 7th model, however, is to include some sort of interaction between my independent variables. The goal, ultimately, will be to identify the best fitting model that can accurately determine the success of a video game when controlled for a certain combination of variables.
Independence Testing
The purpose of this sub-section is to test different combinations of my variables against the response to determine whether or not they are independent of each other. Throughout this section, I will also be testing my explanatory variables against the controls to see if I should include interaction terms later on in the modelling section.
Before I begin testing, it’s important to note that 3 of the variables that I will be testing are categorical (Genre, Manufacturer & Publisher_Code) and 1 is ordinal (Rating_Code). The remaining 3 are continuous (Global_Sales, Critic_Score & User_Score).
The first test that I’ll be utilizing is the One-Way ANOVA, a test capable of determining independence between a single categorical and dependent continuous variable. In this sub-section I will be testing the aforementioned categorical and ordinal variables against Global_Sales.
The 2nd and last test I will be using is the Welch Two Sample t-test, a test capable of interpreting independence between 2 continuous variables (assuming unequal population variance from previous visualizations). This sub-section will see me testing my continuous control variables against Global_Sales.
ANOVA
First up, for the ANOVA tests, is the Manufacturer variable against my response, Global_Sales.
Code
M_aov <-aov(Global_Sales ~ Manufacturer, data = VGS3)summary(M_aov)
Df Sum Sq Mean Sq F value Pr(>F)
Manufacturer 3 217 72.22 19.36 1.69e-12 ***
Residuals 7091 26448 3.73
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here I can see that my resulting Pr(>F) value is extremely small, suggesting I reject the null at a significance level of 0.001. This means the Manufacturer means are significantly different, establishing that these 2 variables are indeed independent of one another. With this new information, I’d like to visualize the data again, this time using a boxplot to outline the range, mean and any outliers present for each category.
I’ve generated 2 visualizations below. The first shows the entire scope of the data. Unfortunately, due to outliers, it’s impossible to make any decisive observations from this plot. However, I still wanted to include it to provide the full scope of the data and retain a “complete” visualization. I did, however, create a second graph, to mitigate the issue and visualize the ranges and means more clearly.
Note: I’ve done the same for future, applicable visualizations as well.
Code
ggplot(VGS3, mapping =aes(x=Manufacturer, y=Global_Sales))+geom_boxplot() +labs(title ="Distribution of Global Sales per Manufacturer", y ="Global Sales (millions)")
Code
limit <-c(0, 1)ggplot(VGS3, mapping =aes(x=Manufacturer, y=Global_Sales))+geom_boxplot() +labs(title ="Distribution of Global Sales per Manufacturer", y ="Global Sales (millions)") +scale_y_continuous(breaks =seq(from =0, to =1, by = .25),limits = limit)
The first graph shows the full extent of the data, revealing that Nintendo platforms host more games that sell well than any other platforms. Additionally, I’d like to acknowledge the one extreme outlier, for Nintendo, resting at about 85 million dollars in global sales. From a glance, I can also see that Microsoft and Sony are nearly comparable in their performance, with PC being the lowest performing.
The second graph is much easier to read and more telling than the first. Here I see that Sony actually has the largest range when compared to the other platforms and a higher mean of sales as well. From this visualization, I can also see that Microsoft and Nintendo are much more comparable than I had originally deemed them to be. Their means seem to be near equal, with the range of values for Nintendo games sitting just below that of Microsoft. Still, the range and mean for PC games sits far below any of the other platforms, strongly suggesting that PC games do not sell as well as those belonging to the aforementioned manufacturers.
Next I’ll be testing whether the population means among the category Genre are significantly different as well.
Code
G_aov <-aov(Global_Sales ~ Genre, data = VGS3)summary(G_aov)
Df Sum Sq Mean Sq F value Pr(>F)
Genre 11 234 21.288 5.705 3.15e-09 ***
Residuals 7083 26431 3.732
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, I received an extremely small F-value (although not as small as the test including Manufacturer) telling me that the Genre means are significantly different as well.
I’d like to visualize these variables together like I did for the Manufacturer variable.
Code
G_plot1 <-ggplot(VGS3, mapping =aes(x=Genre, y=Global_Sales))+geom_boxplot() +labs(title ="Distribution of Global Sales per Genre", y ="Global Sales (millions)")G_plot1 +theme(axis.text.x =element_text(angle =45, vjust =1, hjust=1))
Code
G_plot2 <-ggplot(VGS3, mapping =aes(x=Genre, y=Global_Sales))+geom_boxplot() +labs(title ="Distribution of Global Sales per Genre", y ="Global Sales (millions)") +scale_y_continuous(breaks =seq(from =0, to =1, by = .25),limits = limit)G_plot2 +theme(axis.text.x =element_text(angle =45, vjust =1, hjust=1))
From the first graph I can see that the Action, Racing and Shooter genres seem to sell the best when not taking into consideration the outliers. This would be in direct support of my hypothesis (The Shooter genre is the most financially successful genre when compared to all others), but I would need to conduct further analysis before making any definitive claims. Otherwise, the worst selling genres seem to be Adventure, Puzzle and Strategy games. It is important to note, however, that the outliers for the Sports genre seem to sell extremely well and may affect the impact it has on models later on.
It’s immediately apparent in the second graph that the effect of outliers on the range and mean for the Sports genre has indeed been exacerbated. Both appear to be significantly larger/higher than that of any of the other genres and I will choose to ignore it in this context until I conduct further analysis and fit my models. With that aside, it seems that a majority of the genres are much more comparable than I thought. 8 of the 12 genres have very similar means and ranges, only varying by approximately a tenth of a million dollars in sales, making it difficult to suggest any preliminary conclusions. I’m interested to see what information my models bring forth later on.
From here, I will be moving on to testing my control variables against the response with the first being Publisher_Code.
Code
P_aov <-aov(Global_Sales ~ Publisher_Code, data = VGS3)summary(P_aov)
Df Sum Sq Mean Sq F value Pr(>F)
Publisher_Code 1 705 704.8 192.6 <2e-16 ***
Residuals 7093 25960 3.7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From this test I’ve received the lowest resulting >F value thus far, again suggesting that the means are significantly different at p < 0.001.
Due to the nature of the Publisher_Code variable, I will be utilizing a column plot for my visualization instead of a boxplot.
Code
ggplot(VGS3, mapping =aes(x=Publisher_Code, y=Global_Sales))+geom_col() +labs(title ="Distribution of Global Sales per Publisher Code", y ="Global Sales (millions)", x ="Publisher Code")
As a reminder, the 1 value is equal to A tier publishers, 2 is equal to AA publishers and 3 is equal to AAA publishers.
It’s obvious from this plot that AAA publishers generate the most in global sales, with AA publishers coming in 2nd and A publishers generating the least. With that said, the difference in sales when comparing the 3 is staggering. Combining the sales of both the A and AA publishers wouldn’t equate to even half of the global sales generated by the AAA publishers. While not immediately significant to my hypothesis, this begs the question; do AAA publishers have an unfair advantage when compared to the others? While this isn’t something that I’ll be investigating through the scope of my project, the results here are interesting and it could be something worth eventually looking into.
The last ANOVA test involving Global_Sales will decide whether the Rating_Code means are significantly different as well.
Code
R_aov <-aov(Global_Sales ~ Rating_Code, data = VGS3)summary(R_aov)
Df Sum Sq Mean Sq F value Pr(>F)
Rating_Code 3 254 84.67 22.73 1.21e-14 ***
Residuals 7091 26411 3.72
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yet again, it looks like I’ve received another small F-value telling me that the Rating_Code means are significantly different and that both variables are again independent of each other.
Now, I will proceed with the visualization, returning to my boxplot method.
Code
VGS3$Rating <-factor(VGS3$Rating, levels =c("E & K-A", "E10+", "T", "M"))ggplot(VGS3, mapping =aes(x=Rating, y=Global_Sales))+geom_boxplot() +labs(title ="Distribution of Global Sales per Rating", y ="Global Sales (millions)")
Code
ggplot(VGS3, mapping =aes(x=Rating, y=Global_Sales))+geom_boxplot() +labs(title ="Distribution of Global Sales per Rating", y ="Global Sales (millions)") +scale_y_continuous(breaks =seq(from =0, to =1, by = .25),limits = limit)
It’s visible in the first plot that E/K-A games seem to have the highest number of outliers exceeding 20 million dollars in global sales, with their range even being slightly visible even from this scale. However, unlike the Genre visualization, I do believe that the consistency of these outliers is indicative of success among the rating and will end up proving to be statistically significant during model analysis. I’d also like to bring attention to the fact that the range of M rated games is also slightly visible. As such, I believe that rated M games will also prove to have a significant impact on Global_Sales.
From the second graph, I can see that E/K-A games have the widest range of values out of any of the other genres. However, the range and mean for E/K-A games do still seem relatively comparable to the rest, in contrast to my initial predictions. I do not believe there are any further conclusions I can make from this plot, but I still suspect that the impact of E/K-A and Mature rated games on Global_Sales will be the most significant later on.
Welch Two Sample t-test
Continuing with my independence testing, I will be comparing continuous variable Critic_Score against Global_Sales, this time using a Welch Two Sample t-test.
Code
t.test(VGS3$Critic_Score, VGS3$Global_Sales)
Welch Two Sample t-test
data: VGS3$Critic_Score and VGS3$Global_Sales
t = 417.1, df = 7370.5, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
69.10270 69.75529
sample estimates:
mean of x mean of y
70.1952079 0.7662128
From this test, I received an extremely low p-value. Like the ANOVA, this tells me that the the means between both variables are significantly different and both variables are independent of each other. I’d also like to point out the “mean of x” value (70.20) provided here. This is seems to track with my previous prediction and I’d like to compare it to the value received when I test the User_Score variable. Now for the visualization.
For this graph, I’ll be utilizing a column plot which seems to be more effective for continuous variables.
Code
ggplot(VGS3, mapping =aes(x = Critic_Score, y = Global_Sales))+geom_col() +labs(title ="Distribution of Global Sales per Critic Score", y ="Global Sales (millions)", x ="Critic Score")
From this plot, I can see that the trend continues to appear normally distributed. Additionally, it does seem as though the relationship between Global_Sales & Critic_Score is positive up to a certain point (approximately 85-90). Afterwards, the higher the score gets, the more Global_Sales seemingly drops. I suspect the same thing to happen when testing User_Score.
I will now be testing the aforementioned User_Score variable against Global_Sales in an identical manner.
Code
t.test(VGS3$User_Score, VGS3$Global_Sales)
Welch Two Sample t-test
data: VGS3$User_Score and VGS3$Global_Sales
t = 223.28, df = 13111, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
6.351559 6.464063
sample estimates:
mean of x mean of y
7.1740240 0.7662128
For this test, the p-value I receive for User_Score is identical to that of Critic_Score, once again telling me that the means between the tested variables are significantly different. Noting the mean of x value again, it’s actually much lower than I expected, however, it’s nearly identical to that of the Critic_Score test which presents the significant potential of a trend concerning the impact of these controls on the response variable.
I will now be visualizing these variables together the same way I did for Critic_Score.
Code
ggplot(VGS3, mapping =aes(x = User_Score, y = Global_Sales))+geom_col() +labs(title ="Distribution of Global Sales per User Score", y ="Global Sales (millions)", x ="User Score")
The information provided by this graph is nearly identical to that which I received from the Critic_Score visualization (when scaled to match each other). The one major difference I notice is that the drop off point, from which the relationship between both variables turns negative, occurs at a much lower score than the previous visualization. Additionally, the rate at which the relationship increases and decreases before and after the peak seems to be much more sheer, hinting at an exponential relationship between the 2 variables tested here.
Model Comparisons
In this section I will begin fitting a series of models in order to determine which is best at predicting the effect a video game has on global sales. I’ve fit a total of 7 models. The 1st included only 1 of the explanatory variables (Manufacturer), with Global_Sales set as the response (or “y”). After the 1st, each additional model progressively adds a single variable to the equation. Therefore, the 2nd includes 2 variables (instead of 1) and the response, the 3rd contains 3 variables and the response, the 4th, 4 variables and the response… and so on.
The 7th model, however, is slightly different. In this model, I decided to include an interaction between an assortment of my variables. Based on the data and predictions I’ve made thus far, I decided to host an interaction between 5 of the 6 independent variables, those being Manufacturer, Genre, Publisher_Code, Critic_Score and Rating_Code. The main reason I decided to exclude User_Score from this interaction is ultimately because I do not believe the impact user ratings have on global sales is as significant as any of the other variables. From personal knowledge, I know that critics often provide their ratings for video games prior to their full release on the market. As you can imagine, this is often utilized as a strong marketing tactic, used to boost anticipation for upcoming releases. However, user ratings only come into play after the games full release and I would assume, at that point, enough people would have already purchased said video game to significantly impact global sales (regardless of public reception after release). Therefore, I excluded User_Score from the interaction and kept it as a control.
Now, with the models fitted, I will summarize them utilizing the stargazer function.
With my models visualized, I’m immediately drawn to the adjusted R squared values at the bottom. It seems that the values progressively improve as more control variables are introduced to the models. The strongest adjusted R squared value I received comes from the last interaction model. With an adjusted R squared of 0.112, this model seems to be the best fit thus far. However, before I make that claim, I would also like to evaluate my models using the AIC and BIC evaluation methods. This will help to firmly establish which model is better fitting of my analysis.
In the case of AIC/BIC evaluations, the lower the value received, the better the model fits. With that in mind, I can proceed with the evaluations.
Model 1
Code
AIC(fit1)
[1] 29480.33
Code
BIC(fit1)
[1] 29514.66
Model 2
Code
AIC(fit2)
[1] 29451.97
Code
BIC(fit2)
[1] 29561.85
Model 3
Code
AIC(fit3)
[1] 29283.46
Code
BIC(fit3)
[1] 29400.21
Model 4
Code
AIC(fit4)
[1] 28895.63
Code
BIC(fit4)
[1] 29019.24
Model 5
Code
AIC(fit5)
[1] 28857.14
Code
BIC(fit5)
[1] 28987.62
Model 6
Code
AIC(fit6)
[1] 28820.77
Code
BIC(fit6)
[1] 28971.85
Model 7
Code
AIC(fit7)
[1] 29269.12
Code
BIC(fit7)
[1] 33444.34
While the adjusted R squared value is smaller, according to my AIC and BIC evaluations, model 6 is actually the best fitting for the analysis. This could be due to a number of different reasons, but I suspect that the introduction of interaction terms in the last model over complicated the equation which made it less accurate.
With model 6 established as the best fitting, I’d like to summarize it on its own in order to clearly visualize my coefficients once more.
From model 6, I receive an extremely small p-value and Adjusted R-Squared. This means that the model holds statistical significance but that the reaction in the response variable cannot be fully explained by the independent variables that I’ve provided. Now, for the resulting coefficients.
For the Manufacturer variable, Nintendo is being used as the reference level and the values provided by my regression analysis show statistical significance at the 0.001 level. Aside from Nintendo, the analysis for Microsoft and PC platforms also prove to be statistically significant and to the same degree as well. The only manufacturer analysis proven not to be statistically significant is that for Sony. Therefore, according to my analysis, for every unit increase (1 million dollars) in global sales for Nintendo games, Microsoft and PC games are projected to make approximately 29% and 93% less respectively.
Moving on to the Genre variable, the reference level utilized was the Action genre. The only analyses proven to be statistically significant for this variable are that for the Adventure, Misc and Sports genres. Adventure and Misc are statistically significant at the 0.05 significance level, while Sports is statistically significant at 0.01. This provides me with enough information to construct the following statement. For every unit increase in global sales for Action games, Adventure and Sports games are projected to make approximately 26% less while Misc games are projected to make about 27% more.
Next up is the Publisher_Code variable, which by default establishes the 1st level (A publishers) as the reference level. Under this regression model, the variable did indeed prove to be statistically significant at the 0.001 level. Therefore, starting with single A studios, each subsequent level of publishers is predicted to make about 23% more in global sales than the last.
Now, it’s time to look at the Critic_Score variable. Again, the variable is proven to be significant at the 0.001 level and I’m provided with a coefficient of ~0.04. Because this a continuous variable, the resulting interpretation is slightly different. In this case, unit increase applies to the Critic_Score scale rather than an incremental increase in Global_Sales. This means, that for every 1 unit increase to a game’s critic score, global sales for that game increase by 0.04 million dollars.
The analysis and interpretation for the User_Ratings variable is near identical to that of Critic_Score. Statistically significant at the 0.001 level, this time I am provided with a coefficient of about -0.12. Thus, for every 1 unit increase to a game’s user score, global sales for that game decrease by 0.12 million dollars. This is interesting because I would have expected a net positive relationship between User_Score and Global_Sales. However, I’m not entirely surprised. I had previously observed that the relationship between user ratings and sales seemed a tad bit sheer or drastic. Having received the results I have now, I’m beginning to suspect that the scale of user ratings may be having a large impact on the analysis’ results. As compared to the 0-100 scale for Critic_Score, User_Score is based on a scale of 1-10. Therefore, the effect of a single unit increase would be much more drastic in comparison.
The last variable to interpret is Rating_Code. Back to using a categorical variable, the reference level here is set to the rating coded as 1 (Rated E & K-A games). Under these parameters, the only other ratings proven to be statistically significant are E10+ & T rated games, both at the 0.001 level. The coefficients provided for each are approximately -0.29 & -0.24, respectively. Therefore, for every 1 million dollars rated E/K-A games generate, E10+ and T rated games are projected to make approximately 29% and 24% less respectively.
Although not entirely what I expected to receive, I believe that I have interpreted the results of my regression analysis to the best of my ability. Having acknowledged all statistically significant variables, I can now confidently define the formula for model 6 as the following:
The last step in my analysis will be to create a series of diagnostic plots to confirm whether or not model 6 violates any of the plot-respective regression assumptions. I will be specifically drawing 4 diagnostic plots, including: the Residuals vs Fitted, Normal Q-Q, Scale-Location, and Cook’s Distance plots.
Code
par(mfrow =c(2, 2)); plot(fit6, which =1:4)
The first plot is the Residuals vs. Fitted. I see on the plot that the both the linearity and constant variance assumption are violated. This is portrayed by the points not being linearly or evenly distributed around the origin. Additionally, there are still some very notable outliers. Were these outliers to be removed, the model might actually hold up to the assumptions. However, considering the quantity of outliers at the tail that veer from the line, that would mean removing a significant amount of entries from our data.
The next plot I’ll be looking at is a Normal Q-Q which is easy to tell at a glance that a violation has been made. I can see that the plot starts off relatively linear but there is a skew as it reaches the end, meaning the Normality assumption has been violated. Again, this is most likely attributed to the outliers.
Next is the Scale-Location plot. From the graph, I can see that there is a lot of variation in distance from each point to the line. This would indicate a direct violation of the Constant Variance assumption.
The last drawing to interpret is the Cook’s Distance plot. It’s easy to see the violation here immediately. In the graph, it’s clear that there are values greater than 4/n, which is a violation of the Influential Observation assumption.
From these plots, I can’t help but wonder what my results would be like were the outliers to be excluded from the data. With that said, I believe the effect of doing so would ultimately be detrimental to my analysis, as it wouldn’t encapsulate the full scope of my project. Hopefully, with more games being released every year, this data will grow and continue to contribute to the research I’ve conducted within the scope of this project.
Conclusions
Unfortunately, it’s obvious at this point that I simply do not have enough data to create a model that can accurately and fully predict the effects of Genre & Platform on video game sales. I was, however, able to make progress toward confirming my hypotheses, even if I wasn’t able to prove any given hypothesis in its entirety.
For my 1st hypothesis, Platform and Genre significantly impact Global Sales, I was able to at least prove that certain manufacturers and genres were significantly more/less likely to impact the global sales of a given video game.
My 2nd hypothesis, Nintendo and Sony consoles, when compared to other platforms, host the most financially successful video games, had similar results. Unfortunately, I could not provide my model with enough data so that my results for Sony would be considered statistically significant. Still, I was able to prove, at the 0.001 level, that Nintendo does indeed host the most financially successful games when compared to the remaining 2 platform manufacturers.
Even with both these partial successes, my one regret is that my 3rd hypothesis, The Shooter genre is the most financially successful genre when compared to all others, had to remain unacknowledged. Similar to what happened with the Sony manufacturer, I couldn’t provide my model with enough data to give me statistically significant results for the Shooter genre.
Still, even after these pros and cons, the analysis I was able to conduct proved to be extremely interesting and engaging. This topic provides clear potential for future research. Were I to continue, I would further investigate the effects publishers and critic ratings have on global sales as well. Furthermore, I would apply the same methods I did here to regional sales as well, as the data set I used also provided variables such as NA_Sales, EU_Sales, JP_Sales and Other_Sales.
I’d like to extend a special thank you to Professor Pang for her help throughout the scope of my project. This concludes my study of the The Effects of Genre & Platform on Video Game Sales.
References
Egenfeldt-Nielsen, Simon, et al. Understanding Video Games : The Essential Introduction, Taylor & Francis Group, 2012. ProQuest Ebook Central, https://ebookcentral.proquest.com/lib/uma/detail.action?docID=1181119.
Etchells, Pete. Lost in a Good Game: Why We Play Video Games and What They Can Do for Us. Icon Books, 2019.
McCullough, Hayley. (2019). From Zelda to Stanley: Comparing the Integrative Complexity of Six Video Game Genres. Press Start. 5. 137-149.
Gillies, Kendall. “Video Game Sales and Ratings.” Kaggle, 25 Jan. 2017, https://www.kaggle.com/datasets/kendallgillies/video-game-sales-and-ratings?resource=download.
Source Code
---title: "The Effects of Genre & Platform on Video Game Sales"author: "Alexis Gamez"description: "DACSS 603 Final Project Submission"date: "05/25/2023"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - finalpart3---```{r, warning = F, message = F}#| label: setup#| warning: falselibrary(plyr)library(tidyverse)library(readr)library(summarytools)library(psych)library(lattice)library(FSA)library(kableExtra)library(ggplot2)library(stargazer)library(MPV)# reading in the data setVideo_Game_Sales <-read_csv("_data/final_project/Video_Game_Sales_as_of_Jan_2017.csv")knitr::opts_chunk$set(echo = T)```# IntroductionThe current state of video games supports massive online and local communities, generating millions in revenue a year. No longer can the perception of video games continue to be that of a counter-culture media, but something more widely accepted, definitively present, and enjoyed among diverse communities.Stated eloquently by Nielsen, Smith & Tosca in, **Understanding Video Games: The Essential Introduction (2012)**, *No cultural form exists in isolation; rather, it is integrated within a complex system of meanings shaped by society and its institutions. Compared to other cultural forms, such as literature, the medium of the video game is a new member of this fascinating ecology. It is certainly true that the history of cultural media shows an almost instinctive skepticism leveled at new media. It has been true of radio, it has been true of movies, and it has certainly been true of television, which has long fought against the perception that its role was to entertain, rather than to enlighten*.Now, over 10 years later, we can see that the skepticism once surrounding video games has waned, so much so that engaging with them has now been accepted as a common hobby and even profession. I believe that video games are here to stay and rather than reject the new norm, we should grow with it. Video games will only continue to evolve in their application, especially when taking into consideration recent developments in virtual reality technology. Understanding and utilizing video games as the unique source of cultural media that they are can provide us insight into the nature of popular media and the very real weight of their market. What makes a game good? Does a game have to be good to be *"successful"*? Does fighting among platform communities contribute to the success, or lack there of, of a video game? Do critic ratings play a role? How about user ratings? Or maybe, does it come down to the publisher and whether they would have a large enough budget to advertise their games to the masses? These are all questions I'd like to address during the scope of my project. However, the main objective I've tasked myself with is to address the following research question:**What impacts do Platform & Genre have on the commercial success of a video game?**# HypothesisWhile there exists a diverse range of articles and blog posts related to console wars, game of the year announcements, loot box market structures, etc. there has been a noticeable oversight regarding the correlation between public critique and generated revenue. An often overlooked predicament of video game development occurs post-release. A game may be applauded for all the characteristics the gaming community has grown to love, but if that game's sales aren't comparable to the costs it generated, then what reward is there for the developers? Would that game still be considered successful? Would it be lucrative for developers to switch to a more widely accepted genre or platform to guarantee economic success? These questions have led to the development of the following hypotheses:**H1**: Platform and Genre significantly impact Global Sales.**H2**: Nintendo and Sony consoles host the most financially successful video games.**H3**: The Shooter genre is the most financially successful genre when compared to all others.From personal knowledge, it's known that popularity by platform may fluctuate over the course of a series' lifespan. The Nintendo manufactured Gamecube and Wii were widely popular upon release and are still commonly used today. The same can be said for Sony's Playstation 1 & 2 or Microsoft's Xbox 360. Similarly, there are also platforms like the Wii U, Playstation 3 and Xbox One that had tumultuous receptions upon release and led to committed users switching to other platform series. A common example would be the transition from Playstation to Xbox and vice versa. With this in mind, I hypothesize that Nintendo and Sony based platforms have the highest impact on sales for video games based on prior knowledge concerning the success of select platforms like the Playstation 2, Gamecube and Wii.When it comes to genre, the unrelenting success of the Call of Duty series within the time frame of this data serves as the core component of my belief that the genre *Shooters* will have the largest impact on sales among the list video game genres. More and more shooters continue to be made in attempts to emulate the success achieved with games like Call of Duty: Modern Warfare 2 and Black Ops. I also want to acknowledge other major successes such as Grand Theft Auto for example, that are major staples within their own given genres too (in the case of GTA, it's the Role Playing genre). However, it's known from personal experience that first person shooters revolutionized the gaming industry after the release of Call of Duty 4: Modern Warfare. I hypothesize that while the modern gaming market may be over-saturated with Shooters, they continue to play a large role in the commercial success of video games upon release.# Descriptive Statistics## Description and Summary of the Data This data set was pulled from the Kaggle online database and its description reads as follows,*This data set contains a list of video games with sales greater than 100,000 copies along with critic and user ratings*.With this updated data set provided by the collector, we are given 15 variables and approximately 17,500 entries. The variables are as follows:- Name [game's name]- Platform [platform of game release]- Year of Release [game's release date]- Genre [genre of game]- Publisher [publisher of game]- NA Sales [sales in North America in millions of USD]- EU Sales [sales in Europe in millions of USD]- JPN Sales [sales in Japan in millions of USD]- Other Sales [sales in rest of the world in millions of USD]- Global Sales [total worldwide sales in millions of USD]- Critic Score [aggregate score compiled by Metacritic staff]- Critic Count [the number of critis used in creating the critic score]- User Score [score according to Metacritic subscribers]- User Count [number of users who gave the user score]- Rating [ESRB rating for the game]```{r, warning = F, message = F}Video_Game_Sales_Select <- Video_Game_Sales %>%select(Genre, Platform, Rating, Publisher, Critic_Score, User_Score, Global_Sales)Video_Game_Sales_Desc <-describe(x=Video_Game_Sales_Select) %>%select(c(vars, n, mean, sd, median, min, max, range))kable(Video_Game_Sales_Desc) %>%kable_styling("striped")```## How was the data collected?Referencing the data set's description once again, it states that,*It is a combined web scrape from VGChartz and Metacritic along with manually entered year of release values for most games with a missing year of release*. The original code the collector utilized was created by Rush Kirubi, but it's made apparent that the original set limited the data to only include a subset of video game platforms. Additionally, not all the listed video games have information on Metacritic, so there are a significant amount of missing values under the critic & user scores/counts variables.This provides valuable context concerning Metacritic, the forum utilized by critics and users to rate their favorite games, and the numerous missing values within the data frame. Metacritic was established in 1999. As a result, all entries pre-dating early 2000 lack critic and user scores, as it had not been as well established at the time. These values will end up being filtered out to accommodate the controls I would like to use in the models I later create.```{r, warning = F, message = F}# summarizing our datasummary(Video_Game_Sales)```Summarizing the data shows that 9,080 entries lack critic scores and 9,618 entries lack user scores. Even with 9,618 entries omitted, there are still over 7,000 complete entries to analyze and I do not fear that the omission will negatively impact the analysis. ## Variables of InterestOf the 15 variables provided, 7 will be heavily utilized throughout the scope of this project. Those 7 have been classified below.2 main independent variables:- Platform- Genre4 confounding variables:- Publisher- Rating- Critic Scores- User Scores1 dependent variable:- Global Sales# Modifying and Visualizing the Data## ModificationsMy goal in this section is to acknowledge the steps I took to mold and form the data set I'll be using for this project into something conducive to an analysis. To start, I want to take a glimpse at and summarize my data to get a visual representation of the numbers that I'll be working with.```{r, warning = F, message = F}head(Video_Game_Sales)dfSummary(Video_Game_Sales)```Looking at the data, I immediately know that there are a couple of adjustments I want to make. The first being a couple of changes to the `Platform` variable. First, I'd like to extract all unique values to get a complete list of platforms included in the data set.```{r, warning = F, message = F}VGS <- Video_Game_SalesPlat_List <-unique(select(VGS, "Platform"))as.list(Plat_List, sorted = T)```Next, I'll duplicate the `Platform` column and re-code the values under the new variable so that they pertain to their respective manufacturer. This will clean up the data a bit and make analysis easier in the future.```{r, warning = F, message = F}# Creating new variable (Manufacturer)VGS$Manufacturer <- VGS$Platform# Re-coding values under Manufacturer to accommodate analysisVGS <- VGS %>%mutate(Manufacturer=recode(Manufacturer, 'PS4'='Sony','PS3'='Sony','PS2'='Sony','PS'='Sony','PSV'='Sony','PSP'='Sony','NES'='Nintendo','SNES'='Nintendo','N64'='Nintendo','GC'='Nintendo','DS'='Nintendo','Wii'='Nintendo','WiiU'='Nintendo','GBA'='Nintendo','3DS'='Nintendo','G'='Nintendo','GEN'='Sega','SCD'='Sega','GG'='Sega','SAT'='Sega','DC'='Sega','X'='Microsoft','X360'='Microsoft','XOne'='Microsoft','TG16'='NEC','PCFX'='NEC'))```Below, it can be seen that the re-coding worked! ```{r, warning = F, message = F}Manuf_List <-unique(select(VGS, "Manufacturer"))as.list(Manuf_List, sorted = T)```Now, I'd like draw out a list of unique values for the rest of the variables at my disposal to see what other adjustments need to be made to the data set.The next list I'd like to generate is that of the unique values under the `Genre` variable, taking into account its importance in my hypothesis.```{r, warning = F, message = F}Genre_List <-unique(select(VGS, "Genre"))list(Genre_List)```It's shown that there are a total of 12 different genres within the applicable variable. Like `Platform` & `Manufacturer`, `Genre` would also be considered a categorical variable. Unlike the aforementioned variables, I do not believe `Genre` requires any further adjustment.The next list I'd like to define is the one for the `Publisher` variable.```{r, warning = F, message = F}Publisher_List <-unique(select(VGS, "Publisher"))list(Publisher_List)```It's apparent that there are quite a few unique values under the `Publisher` variable, 627 to be exact. Simply put, there are too many unique values to efficiently conduct an analysis and control for the `Publisher` variable later on. In other words, the `Publisher` variable will over-complicate future models.In order to accommodate this, I've decided to again create a new variable (`Publisher_Code`) and re-code the unique values to fit into 3 separate categories. Those 3 categories are defined by the size and scale of the respective publishing studio. The first level is equal to 1, defining single A (or independent) studios. The second is equal 2 for AA (or mid-size) studios and the last refers to AAA (or large-scale) studios, coded as 3.```{r, warning = F, message = F}# Set values to 1 for independent publishersVGS$Publisher_Code <-1# Initialize all values to 0# Set values to 2 for AA publishersVGS$Publisher_Code[VGS$Publisher =="505 Games"| VGS$Publisher =="Aksys Games"| VGS$Publisher =="Annapurna Interactive"| VGS$Publisher =="Arc System Works"| VGS$Publisher =="Atari SA"| VGS$Publisher =="Devolver Digital"| VGS$Publisher =="Focus Entertainment"| VGS$Publisher =="Frontier Developments"| VGS$Publisher =="Humble Games"| VGS$Publisher =="Koei Tecmo"| VGS$Publisher =="Tecmo Koei"| VGS$Publisher =="Marvelous Inc"| VGS$Publisher =="Microids"| VGS$Publisher =="miHoYo"| VGS$Publisher =="Nacon"| VGS$Publisher =="NCSoft"| VGS$Publisher =="Nippon Ichi Software"| VGS$Publisher =="Paradox Interactive"| VGS$Publisher =="Raw Fury"| VGS$Publisher =="SNK"| VGS$Publisher =="SNK Playmore"| VGS$Publisher =="Spike Chunsoft"| VGS$Publisher =="ChunSoft"| VGS$Publisher =="Team17"| VGS$Publisher =="Team17 Software"| VGS$Publisher =="tinyBuild"| VGS$Publisher =="Thunderful Group"| VGS$Publisher =="Warner Bros. Interactive Entertainment"| VGS$Publisher =="Time Warner Interactive"| VGS$Publisher =="WayForward"| VGS$Publisher =="WayForward Technologies"| VGS$Publisher =="FromSoftware"| VGS$Publisher =="Infogrames"| VGS$Publisher =="Atari"| VGS$Publisher =="Deep Silver"| VGS$Publisher =="Insomniac Games"| VGS$Publisher =="LucasArts"| VGS$Publisher =="Midway Games"| VGS$Publisher =="Nordic Games"| VGS$Publisher =="THQ"] <-2# Set values to 3 for AAA publishersVGS$Publisher_Code[VGS$Publisher =="Sony Interactive Entertainment"| VGS$Publisher =="Sony Computer Entertainment"| VGS$Publisher =="Sony Music Entertainment"| VGS$Publisher =="Sony Online Entertainment"| VGS$Publisher =="Tencent Games"| VGS$Publisher =="Nintendo"| VGS$Publisher =="Microsoft Game Studios"| VGS$Publisher =="NetEase"| VGS$Publisher =="Activision Blizzard"| VGS$Publisher =="Activision"| VGS$Publisher =="Activision Value"| VGS$Publisher =="Bethesda Softworks"| VGS$Publisher =="Electronic Arts"| VGS$Publisher =="EA Games"| VGS$Publisher =="Take-Two Interactive"| VGS$Publisher =="Bandai Namco Entertainment"| VGS$Publisher =="Namco Bandai Games"| VGS$Publisher =="Square Enix"| VGS$Publisher =="Enix Corporation"| VGS$Publisher =="Nexon"| VGS$Publisher =="Netmarble"| VGS$Publisher =="Ubisoft"| VGS$Publisher =="Ubisoft Annecy"| VGS$Publisher =="Konami"| VGS$Publisher =="Konami Digital Entertainment"| VGS$Publisher =="Sega"| VGS$Publisher =="Capcom"| VGS$Publisher =="Embracer Group"| VGS$Publisher =="Gearbox Software"| VGS$Publisher =="Square"| VGS$Publisher =="Square EA"| VGS$Publisher =="SquareSoft"] <-3P_Code_List <-unique(select(VGS, "Publisher_Code"))list(P_Code_List)```The next list I want to generate is that for the `Rating` variable. ```{r, warning = F, message = F}Rating_List <-na.omit(unique(select(VGS, "Rating")))list(Rating_List)```Shown here, there are a total of 8 different ratings for games within the data set that I'll be utilizing. These values include ratings currently utilized among the ESRB (Entertainment Software Rating Board) and those that existed prior to its formation. Those ratings are defined as:1. RP = Rating Pending2. EC = Early Childhood3. E = Everyone4. K-A = Kids through Adults (Replaced by the E rating after the formation of the ESRB)5. E10+ = Everyone age 10 and up6. T = Teens7. M = Mature8. AO = Adults OnlyWhile simple and categorical in nature, I'd like to re-code this variable to be ordinal so that it can reflect the progressive inclusion of mature content through the ratings. The new variable, `Rating_Code`, will output numeric values associated with each rating. 1 for RP, 2 for EC, 3 for E & K-A, 4 for E10+, 5 for T, 6 for M and 7 for AO. I decided to join the E & K-A rating, because they effectively define the same thing. This also results in a decrease from 8 to 7 unique values.```{r, warning = F, message = F}VGS$Rating_Code <- VGS$RatingVGS <- VGS %>%mutate(Rating_Code =recode(Rating_Code,"RP"='1',"EC"='2',"E"='3',"K-A"='3',"E10+"='4',"T"='5',"M"='6',"AO"='7'))R_Code_List <-na.omit(unique(select(VGS, "Rating_Code")))list(R_Code_List)```With all the categorical variables now re-coded into something appropriate for an analysis, all that remains are the `Critic_Score` and `User_Score` variables. Both are numerical, continuous variables that I believe require no further transformations.Since there are no further adjustments I'd like to make, I want to generate the descriptive statistics once again, this time utilizing the newly created variables.```{r}VGS_select <- VGS %>%select(Genre, Manufacturer, Rating_Code, Publisher_Code, Critic_Score, User_Score, Global_Sales)VGS_Desc <-describe(x=VGS_select) %>%select(c(vars, n, mean, sd, median, min, max, range))kable(VGS_Desc) %>%kable_styling("striped")```The data is looking much more manageable! ## Single Variable VisualizationsNext, I'd like to draw up my explanatory and control variables to visualize the distribution of each and see if there are any last-minute adjustments that I'd like to make. The first variable that I will be visualizing is the `Manufacturer` variable.```{r, warning = F, message = F}# Simple Bar Plot for Manufacturer FrequencyM_counts <-table(VGS$Manufacturer)barplot(M_counts, main ="Manufacturer Distribution",xlab ="Manufacturer",ylab ="Frequency",ylim =c(0, 7000))```It looks like the existing data for the 2600, 3DO, NEC, NG, Sega and WS manufacturers is so small in comparison to the others that it's practically negligible. I don't want this to adversely affect my analysis so I'll remove those rows from the data frame.```{r, warning = F, message = F}VGS2 <- VGS[VGS$Manufacturer %in%c("Sony", "Microsoft", "Nintendo", "PC"),]M_counts2 <-table(VGS2$Manufacturer)barplot(M_counts2, main ="Manufacturer Distribution",xlab ="Manufacturer",ylab ="Frequency",ylim =c(0, 7000))```With the aforementioned values excluded, I see that a majority of games within my data set are released among Nintendo and Sony manufactured platforms. While this is supportive of my hypothesis (Nintendo and Sony hosting the most financially successful games), I'll continue to visualize the remaining variables and refrain from making conclusions until my analysis is complete.Now it's time to draw up the `Genre` variable.```{r, warning = F, message = F}G_counts <-table(VGS2$Genre)barplot(G_counts, main ="Genre Distribution",xlab ="Genre",ylab ="Frequency",ylim =c(0, 3500))```In this case, it seems all genres have adequate data and no further changes need to be made. However, I'd like to note that the Shooter genre is not one of the most frequently occurring within the data set. While this is not immediately indicative of financial success within the genre, like `Manufacturer`, I'll refrain from making any direct conclusions until the analysis is over. Next, I'd like to visualize the newly created `Publisher_Code` variable.```{r, warning = F, message = F}P_counts <-table(VGS2$Publisher_Code)barplot(P_counts, main ="Publisher Distribution",names.arg =c("A", "AA", "AAA"),horiz = T,xlab ="Frequency",ylab ="Publisher",xlim =c(0, 10000))```According to this distribution, it seems that games released by AAA publishers are the most frequently occurring within the data frame. With that in mind, there is now the potential that AAA publishers are more likely to release financially successful games when compared to other smaller publishers. However, again, definitive conclusions will be withheld until after the analysis is conducted.I'd also like to visualize both the continuous variables that I'll also be controlling for in future models, `Critic_Score` and `User_Score`. I will also be omitting all NA values to remove any entries corresponding to games that were released prior to the incorporation of critic and user scores. This is so that I can accurately control for these variables in future models. After the omission, there are still 7,098 entries remaining within the data frame.```{r, warning = F, message = F}VGS2 <-na.omit(VGS2)histogram(VGS2$Critic_Score,type =c("percent"),main ="Critic Score Distribution",xlab ="Critic Score",ylab ="Percentage")```The data visualized by the `Critic_Score` variable displays a normal, right-skewed distribution centered at an approximate value of 70-75.Now, I'd like to see if the `User_Score` distribution reflects the same thing.```{r, warning = F, message = F}histogram(VGS2$User_Score,type =c("percent"),main="User Score Distribution",xlab ="User Score",ylab ="Percentage")```While the scale of each variable differs (`Critic_Score` rated on a scale of 0-100 & `User_Score` rated on a scale of 0-10), the distribution visualized above is very similar to that of `Critic_Score`. The `User_Score` visualization shows a right-skewed, normally distributed plot centered, this time, at approximately 8-8.5 (compared to the center being at approximately 70-75 for the `Critic_Score` variable).The last variable I want to draw up is `Rating`.```{r, warning = F, message = F}R_counts <-table(VGS2$Rating_Code)barplot(R_counts, main ="Rating Distribution",names.arg =c("EC", "E & K-A", "E10+", "T", "M", "AO"),xlab ="Rating",ylab ="Frequency")```It seems that with the omission of NA values in previous chunks, RP rated games have been completely eliminated from the data set. Furthermore, similar to `Manufacturer`, there are some irrelevant pieces of data that I think my analysis could do without. Under these circumstances, I'll be eliminating any rows containing ratings of AO and EC. I will also be re-coding the remaining values to retain the ordinal nature of the variable that I previously intended to utilize. The new order will be as follows:1 = E & K-A2 = E10+3 = T4 = M```{r, warning = F, message = F}VGS3 <- VGS2[VGS2$Rating_Code %in%c("3", "4", "5", "6"),]VGS3 <- VGS3 %>%mutate(Rating =recode(Rating,"E"="E & K-A","K-A"="E & K-A"))VGS3 <- VGS3 %>%mutate(Rating_Code =recode(Rating_Code,'3'='1','4'='2','5'='3','6'='4'))R_counts2 <-table(VGS3$Rating_Code)barplot(R_counts2, main ="Rating Distribution",names.arg =c("E & K-A", "E10+", "T", "M"),xlab ="Rating",ylab ="Frequency",ylim =c(0, 2500))```With this last adjustment made, the final entry count for the data set I'll be utilizing throughout the remainder of my project is 7,095 games.It's easy to see here that the most frequently occurring ratings within the data are E & K-A , as well as T.With all the modifications to my data frame complete and all preliminary visualizations generated, I will be moving on to hypothesis testing.## Multi-Variable VisualizationsWithin this sub-section are a series of visualizations pairing multiple independent variables together. While not within the scope of my project, I do believe they can help expand the description of my data and provide insight into what my results might look like later on in my analysis. Even so, I do not find it necessary to provide an interpretation for each graph. More so, I will be utilizing this section as a sort of appendix of extra visualizations, each providing an additional dimension to my data.```{r, warning = F, message = F}table1 <-data.frame(with(VGS3, table(Manufacturer,Publisher_Code)))ggplot(table1, aes(x=Manufacturer,y=Freq, fill=Publisher_Code))+geom_bar(stat="identity",position="dodge")+scale_fill_discrete(name ="Publisher Code",labels =c("A", "AA", "AAA")) +ggtitle("Distribution of Manufacturer per Publisher Code")``````{r, warning = F, message = F}table2 <-data.frame(with(VGS3, table(Manufacturer,Rating_Code)))ggplot(table2, aes(x=Manufacturer,y=Freq, fill=Rating_Code))+geom_bar(stat="identity",position="dodge")+scale_fill_discrete(name ="Rating",labels =c("E & K-A", "E10+", "T", "M")) +ggtitle("Distribution of Manufacturer per Rating")``````{r, warning = F, message = F}table3 <-data.frame(with(VGS3, table(Genre,Manufacturer)))GM <-ggplot(table3, aes(x=Genre,y=Freq, fill=Manufacturer))+geom_bar(stat="identity",position="dodge")GM +ggtitle("Distribution of Genre per Manufacturer") +theme(axis.text.x =element_text(angle =45, vjust =1, hjust=1))``````{r, warning = F, message = F}table4 <-data.frame(with(VGS3, table(Genre,Publisher_Code)))GP <-ggplot(table4, aes(x=Genre,y=Freq, fill=Publisher_Code))+geom_bar(stat="identity",position="dodge")+scale_fill_discrete(name ="Publisher Code",labels =c("A", "AA", "AAA"))GP +ggtitle("Distribution of Genre per Publisher Code") +theme(axis.text.x =element_text(angle =45, vjust =1, hjust=1))``````{r, warning = F, message = F}table5 <-data.frame(with(VGS3, table(Genre,Rating_Code)))GR <-ggplot(table5, aes(x=Genre,y=Freq, fill=Rating_Code)) +geom_bar(stat="identity",position="dodge") +scale_fill_discrete(name ="Rating",labels =c("E & K-A", "E10+", "T", "M"))GR +ggtitle("Distribution of Genre per Rating") +theme(axis.text.x =element_text(angle =45, vjust =1, hjust=1))```# Hypothesis TestingTo start the hypothesis testing section I'd like to redefine each variables as being either an Explanatory, Response, or Control Variable.## Explanatory Variables1. Genre2. Manufacturer## Response Variable1. Global Sales## Control Variables1. Publisher Code2. Critic Score3. User Score4. Rating Code## ModelsMy final results should include 7 total models. Each one will incorporate a variant number of control variables. My intention with 7th model, however, is to include some sort of interaction between my independent variables. The goal, ultimately, will be to identify the best fitting model that can accurately determine the success of a video game when controlled for a certain combination of variables.## Independence TestingThe purpose of this sub-section is to test different combinations of my variables against the response to determine whether or not they are independent of each other. Throughout this section, I will also be testing my explanatory variables against the controls to see if I should include interaction terms later on in the modelling section.Before I begin testing, it's important to note that 3 of the variables that I will be testing are categorical (`Genre`, `Manufacturer` & `Publisher_Code`) and 1 is ordinal (`Rating_Code`). The remaining 3 are continuous (`Global_Sales`, `Critic_Score` & `User_Score`). The first test that I'll be utilizing is the One-Way ANOVA, a test capable of determining independence between a single categorical and dependent continuous variable. In this sub-section I will be testing the aforementioned categorical and ordinal variables against `Global_Sales`.The 2nd and last test I will be using is the Welch Two Sample t-test, a test capable of interpreting independence between 2 continuous variables (assuming unequal population variance from previous visualizations). This sub-section will see me testing my continuous control variables against `Global_Sales`.### ANOVAFirst up, for the ANOVA tests, is the `Manufacturer` variable against my response, `Global_Sales`.```{r, warning = F, message = F}M_aov <-aov(Global_Sales ~ Manufacturer, data = VGS3)summary(M_aov)```Here I can see that my resulting Pr(>F) value is extremely small, suggesting I reject the null at a significance level of 0.001. This means the `Manufacturer` means are significantly different, establishing that these 2 variables are indeed independent of one another. With this new information, I'd like to visualize the data again, this time using a boxplot to outline the range, mean and any outliers present for each category.I've generated 2 visualizations below. The first shows the entire scope of the data. Unfortunately, due to outliers, it's impossible to make any decisive observations from this plot. However, I still wanted to include it to provide the full scope of the data and retain a "complete" visualization. I did, however, create a second graph, to mitigate the issue and visualize the ranges and means more clearly.**Note**: I've done the same for future, applicable visualizations as well. ```{r, warning = F, message = F}ggplot(VGS3, mapping =aes(x=Manufacturer, y=Global_Sales))+geom_boxplot() +labs(title ="Distribution of Global Sales per Manufacturer", y ="Global Sales (millions)") limit <-c(0, 1)ggplot(VGS3, mapping =aes(x=Manufacturer, y=Global_Sales))+geom_boxplot() +labs(title ="Distribution of Global Sales per Manufacturer", y ="Global Sales (millions)") +scale_y_continuous(breaks =seq(from =0, to =1, by = .25),limits = limit)```The first graph shows the full extent of the data, revealing that Nintendo platforms host more games that sell well than any other platforms. Additionally, I'd like to acknowledge the one extreme outlier, for Nintendo, resting at about 85 million dollars in global sales. From a glance, I can also see that Microsoft and Sony are nearly comparable in their performance, with PC being the lowest performing.The second graph is much easier to read and more telling than the first. Here I see that Sony actually has the largest range when compared to the other platforms and a higher mean of sales as well. From this visualization, I can also see that Microsoft and Nintendo are much more comparable than I had originally deemed them to be. Their means seem to be near equal, with the range of values for Nintendo games sitting just below that of Microsoft. Still, the range and mean for PC games sits far below any of the other platforms, strongly suggesting that PC games do not sell as well as those belonging to the aforementioned manufacturers.Next I'll be testing whether the population means among the category `Genre` are significantly different as well.```{r, warning = F, message = F}G_aov <-aov(Global_Sales ~ Genre, data = VGS3)summary(G_aov)```Again, I received an extremely small F-value (although not as small as the test including `Manufacturer`) telling me that the `Genre` means are significantly different as well.I'd like to visualize these variables together like I did for the `Manufacturer` variable.```{r, warning = F, message = F}G_plot1 <-ggplot(VGS3, mapping =aes(x=Genre, y=Global_Sales))+geom_boxplot() +labs(title ="Distribution of Global Sales per Genre", y ="Global Sales (millions)")G_plot1 +theme(axis.text.x =element_text(angle =45, vjust =1, hjust=1))G_plot2 <-ggplot(VGS3, mapping =aes(x=Genre, y=Global_Sales))+geom_boxplot() +labs(title ="Distribution of Global Sales per Genre", y ="Global Sales (millions)") +scale_y_continuous(breaks =seq(from =0, to =1, by = .25),limits = limit)G_plot2 +theme(axis.text.x =element_text(angle =45, vjust =1, hjust=1))```From the first graph I can see that the Action, Racing and Shooter genres seem to sell the best when not taking into consideration the outliers. This would be in direct support of my hypothesis (The Shooter genre is the most financially successful genre when compared to all others), but I would need to conduct further analysis before making any definitive claims. Otherwise, the worst selling genres seem to be Adventure, Puzzle and Strategy games. It is important to note, however, that the outliers for the Sports genre seem to sell extremely well and may affect the impact it has on models later on.It's immediately apparent in the second graph that the effect of outliers on the range and mean for the Sports genre has indeed been exacerbated. Both appear to be significantly larger/higher than that of any of the other genres and I will choose to ignore it in this context until I conduct further analysis and fit my models. With that aside, it seems that a majority of the genres are much more comparable than I thought. 8 of the 12 genres have very similar means and ranges, only varying by approximately a tenth of a million dollars in sales, making it difficult to suggest any preliminary conclusions. I'm interested to see what information my models bring forth later on.From here, I will be moving on to testing my control variables against the response with the first being `Publisher_Code`.```{r, warning = F, message = F}P_aov <-aov(Global_Sales ~ Publisher_Code, data = VGS3)summary(P_aov)```From this test I've received the lowest resulting >F value thus far, again suggesting that the means are significantly different at p < 0.001. Due to the nature of the `Publisher_Code` variable, I will be utilizing a column plot for my visualization instead of a boxplot.```{r, warning = F, message = F}ggplot(VGS3, mapping =aes(x=Publisher_Code, y=Global_Sales))+geom_col() +labs(title ="Distribution of Global Sales per Publisher Code", y ="Global Sales (millions)", x ="Publisher Code")```As a reminder, the 1 value is equal to A tier publishers, 2 is equal to AA publishers and 3 is equal to AAA publishers. It's obvious from this plot that AAA publishers generate the most in global sales, with AA publishers coming in 2nd and A publishers generating the least. With that said, the difference in sales when comparing the 3 is staggering. Combining the sales of both the A and AA publishers wouldn't equate to even half of the global sales generated by the AAA publishers. While not immediately significant to my hypothesis, this begs the question; do AAA publishers have an unfair advantage when compared to the others? While this isn't something that I'll be investigating through the scope of my project, the results here are interesting and it could be something worth eventually looking into.The last ANOVA test involving `Global_Sales` will decide whether the `Rating_Code` means are significantly different as well.```{r, warning = F, message = F}R_aov <-aov(Global_Sales ~ Rating_Code, data = VGS3)summary(R_aov)```Yet again, it looks like I've received another small F-value telling me that the `Rating_Code` means are significantly different and that both variables are again independent of each other.Now, I will proceed with the visualization, returning to my boxplot method.```{r, warning = F, message = F}VGS3$Rating <-factor(VGS3$Rating, levels =c("E & K-A", "E10+", "T", "M"))ggplot(VGS3, mapping =aes(x=Rating, y=Global_Sales))+geom_boxplot() +labs(title ="Distribution of Global Sales per Rating", y ="Global Sales (millions)")ggplot(VGS3, mapping =aes(x=Rating, y=Global_Sales))+geom_boxplot() +labs(title ="Distribution of Global Sales per Rating", y ="Global Sales (millions)") +scale_y_continuous(breaks =seq(from =0, to =1, by = .25),limits = limit)```It's visible in the first plot that E/K-A games seem to have the highest number of outliers exceeding 20 million dollars in global sales, with their range even being slightly visible even from this scale. However, unlike the `Genre` visualization, I do believe that the consistency of these outliers is indicative of success among the rating and will end up proving to be statistically significant during model analysis. I'd also like to bring attention to the fact that the range of M rated games is also slightly visible. As such, I believe that rated M games will also prove to have a significant impact on `Global_Sales`.From the second graph, I can see that E/K-A games have the widest range of values out of any of the other genres. However, the range and mean for E/K-A games do still seem relatively comparable to the rest, in contrast to my initial predictions. I do not believe there are any further conclusions I can make from this plot, but I still suspect that the impact of E/K-A and Mature rated games on `Global_Sales` will be the most significant later on.### Welch Two Sample t-testContinuing with my independence testing, I will be comparing continuous variable `Critic_Score` against `Global_Sales`, this time using a Welch Two Sample t-test.```{r, warning = F, message = F}t.test(VGS3$Critic_Score, VGS3$Global_Sales)```From this test, I received an extremely low p-value. Like the ANOVA, this tells me that the the means between both variables are significantly different and both variables are independent of each other. I'd also like to point out the "mean of x" value (70.20) provided here. This is seems to track with my previous prediction and I'd like to compare it to the value received when I test the `User_Score` variable. Now for the visualization.For this graph, I'll be utilizing a column plot which seems to be more effective for continuous variables.```{r, warning = F, message = F}ggplot(VGS3, mapping =aes(x = Critic_Score, y = Global_Sales))+geom_col() +labs(title ="Distribution of Global Sales per Critic Score", y ="Global Sales (millions)", x ="Critic Score")```From this plot, I can see that the trend continues to appear normally distributed. Additionally, it does seem as though the relationship between `Global_Sales` & `Critic_Score` is positive up to a certain point (approximately 85-90). Afterwards, the higher the score gets, the more `Global_Sales` seemingly drops. I suspect the same thing to happen when testing `User_Score`.I will now be testing the aforementioned `User_Score` variable against `Global_Sales` in an identical manner.```{r, warning = F, message = F}t.test(VGS3$User_Score, VGS3$Global_Sales)```For this test, the p-value I receive for `User_Score` is identical to that of `Critic_Score`, once again telling me that the means between the tested variables are significantly different. Noting the mean of x value again, it's actually much lower than I expected, however, it's nearly identical to that of the `Critic_Score` test which presents the significant potential of a trend concerning the impact of these controls on the response variable.I will now be visualizing these variables together the same way I did for `Critic_Score`.```{r, warning = F, message = F}ggplot(VGS3, mapping =aes(x = User_Score, y = Global_Sales))+geom_col() +labs(title ="Distribution of Global Sales per User Score", y ="Global Sales (millions)", x ="User Score")```The information provided by this graph is nearly identical to that which I received from the `Critic_Score` visualization (when scaled to match each other). The one major difference I notice is that the drop off point, from which the relationship between both variables turns negative, occurs at a much lower score than the previous visualization. Additionally, the rate at which the relationship increases and decreases before and after the peak seems to be much more sheer, hinting at an exponential relationship between the 2 variables tested here.# Model ComparisonsIn this section I will begin fitting a series of models in order to determine which is best at predicting the effect a video game has on global sales. I've fit a total of 7 models. The 1st included only 1 of the explanatory variables (`Manufacturer`), with `Global_Sales` set as the response (or "y"). After the 1st, each additional model progressively adds a single variable to the equation. Therefore, the 2nd includes 2 variables (instead of 1) and the response, the 3rd contains 3 variables and the response, the 4th, 4 variables and the response... and so on. The 7th model, however, is slightly different. In this model, I decided to include an interaction between an assortment of my variables. Based on the data and predictions I've made thus far, I decided to host an interaction between 5 of the 6 independent variables, those being `Manufacturer`, `Genre`, `Publisher_Code`, `Critic_Score` and `Rating_Code`. The main reason I decided to exclude `User_Score` from this interaction is ultimately because I do not believe the impact user ratings have on global sales is as significant as any of the other variables. From personal knowledge, I know that critics often provide their ratings for video games prior to their full release on the market. As you can imagine, this is often utilized as a strong marketing tactic, used to boost anticipation for upcoming releases. However, user ratings only come into play after the games full release and I would assume, at that point, enough people would have already purchased said video game to significantly impact global sales (regardless of public reception after release). Therefore, I excluded `User_Score` from the interaction and kept it as a control.Now, with the models fitted, I will summarize them utilizing the stargazer function.<details><summary> View Code</summary>```{r, warning = F, message = F}VGS3$Manufacturer2 <-as.factor(VGS3$Manufacturer)VGS3$Manufacturer2 <-relevel(VGS3$Manufacturer2, ref ='Nintendo')fit1 <-lm(Global_Sales ~ Manufacturer2, data = VGS3)fit2 <-lm(Global_Sales ~ Manufacturer2 + Genre, data = VGS3)fit3 <-lm(Global_Sales ~ Manufacturer2 + Genre + Publisher_Code, data = VGS3)fit4 <-lm(Global_Sales ~ Manufacturer2 + Genre + Publisher_Code + Critic_Score, data = VGS3)fit5 <-lm(Global_Sales ~ Manufacturer2 + Genre + Publisher_Code + Critic_Score + User_Score, data = VGS3)fit6 <-lm(Global_Sales ~ Manufacturer2 + Genre + Publisher_Code + Critic_Score + User_Score + Rating_Code, data = VGS3)fit7 <-lm(Global_Sales ~ Manufacturer2*Genre*Publisher_Code*Critic_Score*Rating_Code + User_Score, data = VGS3)stargazer(fit1, fit2, fit3, fit4, fit5, fit6, fit7, type ='text')```</details>With my models visualized, I'm immediately drawn to the adjusted R squared values at the bottom. It seems that the values progressively improve as more control variables are introduced to the models. The strongest adjusted R squared value I received comes from the last interaction model. With an adjusted R squared of 0.112, this model seems to be the best fit thus far. However, before I make that claim, I would also like to evaluate my models using the AIC and BIC evaluation methods. This will help to firmly establish which model is better fitting of my analysis.In the case of AIC/BIC evaluations, the lower the value received, the better the model fits. With that in mind, I can proceed with the evaluations.Model 1```{r}AIC(fit1)BIC(fit1)```Model 2```{r}AIC(fit2)BIC(fit2)```Model 3```{r}AIC(fit3)BIC(fit3)```Model 4```{r}AIC(fit4)BIC(fit4)```Model 5```{r}AIC(fit5)BIC(fit5)```Model 6```{r}AIC(fit6)BIC(fit6)```Model 7```{r}AIC(fit7)BIC(fit7)```While the adjusted R squared value is smaller, according to my AIC and BIC evaluations, model 6 is actually the best fitting for the analysis. This could be due to a number of different reasons, but I suspect that the introduction of interaction terms in the last model over complicated the equation which made it less accurate.With model 6 established as the best fitting, I'd like to summarize it on its own in order to clearly visualize my coefficients once more.```{r, warning = F, message = F}summary(fit6)```From model 6, I receive an extremely small p-value and Adjusted R-Squared. This means that the model holds statistical significance but that the reaction in the response variable cannot be fully explained by the independent variables that I've provided. Now, for the resulting coefficients.For the `Manufacturer` variable, Nintendo is being used as the reference level and the values provided by my regression analysis show statistical significance at the 0.001 level. Aside from Nintendo, the analysis for Microsoft and PC platforms also prove to be statistically significant and to the same degree as well. The only manufacturer analysis proven not to be statistically significant is that for Sony. **Therefore, according to my analysis, for every unit increase (1 million dollars) in global sales for Nintendo games, Microsoft and PC games are projected to make approximately 29% and 93% less respectively.**Moving on to the `Genre` variable, the reference level utilized was the Action genre. The only analyses proven to be statistically significant for this variable are that for the Adventure, Misc and Sports genres. Adventure and Misc are statistically significant at the 0.05 significance level, while Sports is statistically significant at 0.01. This provides me with enough information to construct the following statement. **For every unit increase in global sales for Action games, Adventure and Sports games are projected to make approximately 26% less while Misc games are projected to make about 27% more.**Next up is the `Publisher_Code` variable, which by default establishes the 1st level (A publishers) as the reference level. Under this regression model, the variable did indeed prove to be statistically significant at the 0.001 level. **Therefore, starting with single A studios, each subsequent level of publishers is predicted to make about 23% more in global sales than the last.**Now, it's time to look at the `Critic_Score` variable. Again, the variable is proven to be significant at the 0.001 level and I'm provided with a coefficient of ~0.04. Because this a continuous variable, the resulting interpretation is slightly different. In this case, unit increase applies to the `Critic_Score` scale rather than an incremental increase in `Global_Sales`. **This means, that for every 1 unit increase to a game's critic score, global sales for that game increase by 0.04 million dollars.**The analysis and interpretation for the `User_Ratings` variable is near identical to that of `Critic_Score`. Statistically significant at the 0.001 level, this time I am provided with a coefficient of about -0.12. **Thus, for every 1 unit increase to a game's user score, global sales for that game decrease by 0.12 million dollars.** This is interesting because I would have expected a net positive relationship between `User_Score` and `Global_Sales`. However, I'm not entirely surprised. I had previously observed that the relationship between user ratings and sales seemed a tad bit sheer or drastic. Having received the results I have now, I'm beginning to suspect that the scale of user ratings may be having a large impact on the analysis' results. As compared to the 0-100 scale for `Critic_Score`, `User_Score` is based on a scale of 1-10. Therefore, the effect of a single unit increase would be much more drastic in comparison.The last variable to interpret is `Rating_Code`. Back to using a categorical variable, the reference level here is set to the rating coded as 1 (Rated E & K-A games). Under these parameters, the only other ratings proven to be statistically significant are E10+ & T rated games, both at the 0.001 level. The coefficients provided for each are approximately -0.29 & -0.24, respectively. **Therefore, for every 1 million dollars rated E/K-A games generate, E10+ and T rated games are projected to make approximately 29% and 24% less respectively.**Although not entirely what I expected to receive, I believe that I have interpreted the results of my regression analysis to the best of my ability. Having acknowledged all statistically significant variables, I can now confidently define the formula for model 6 as the following:`Global_Sales` = -1.419041 + (-0.288588 * Microsoft) + (-0.926065 * PC) + (-0.258751 * Adventure) + (0.265853 * Misc) + (-0.263520 * Sports) + (0.232663 * `Publisher_Code`) + (0.040525 * `Critic_Score`) + (-0.121010 * `User_Score`) + (-0.293211 * Rating_Code2) + (-0.237841 * Rating_Code3)# DiagnosticsThe last step in my analysis will be to create a series of diagnostic plots to confirm whether or not model 6 violates any of the plot-respective regression assumptions. I will be specifically drawing 4 diagnostic plots, including: the Residuals vs Fitted, Normal Q-Q, Scale-Location, and Cook's Distance plots.```{r, warning = F, message = F}par(mfrow =c(2, 2)); plot(fit6, which =1:4)```The first plot is the Residuals vs. Fitted. I see on the plot that the both the linearity and constant variance assumption are violated. This is portrayed by the points not being linearly or evenly distributed around the origin. Additionally, there are still some very notable outliers. Were these outliers to be removed, the model might actually hold up to the assumptions. However, considering the quantity of outliers at the tail that veer from the line, that would mean removing a significant amount of entries from our data. The next plot I'll be looking at is a Normal Q-Q which is easy to tell at a glance that a violation has been made. I can see that the plot starts off relatively linear but there is a skew as it reaches the end, meaning the Normality assumption has been violated. Again, this is most likely attributed to the outliers.Next is the Scale-Location plot. From the graph, I can see that there is a lot of variation in distance from each point to the line. This would indicate a direct violation of the Constant Variance assumption.The last drawing to interpret is the Cook's Distance plot. It's easy to see the violation here immediately. In the graph, it's clear that there are values greater than 4/n, which is a violation of the Influential Observation assumption.From these plots, I can't help but wonder what my results would be like were the outliers to be excluded from the data. With that said, I believe the effect of doing so would ultimately be detrimental to my analysis, as it wouldn't encapsulate the full scope of my project. Hopefully, with more games being released every year, this data will grow and continue to contribute to the research I've conducted within the scope of this project.# ConclusionsUnfortunately, it's obvious at this point that I simply do not have enough data to create a model that can accurately and fully predict the effects of Genre & Platform on video game sales. I was, however, able to make progress toward confirming my hypotheses, even if I wasn't able to prove any given hypothesis in its entirety. For my 1st hypothesis, *Platform and Genre significantly impact Global Sales*, I was able to at least prove that certain manufacturers and genres were significantly more/less likely to impact the global sales of a given video game.My 2nd hypothesis, *Nintendo and Sony consoles, when compared to other platforms, host the most financially successful video games*, had similar results. Unfortunately, I could not provide my model with enough data so that my results for Sony would be considered statistically significant. Still, I was able to prove, at the 0.001 level, that Nintendo does indeed host the most financially successful games when compared to the remaining 2 platform manufacturers.Even with both these partial successes, my one regret is that my 3rd hypothesis, *The Shooter genre is the most financially successful genre when compared to all others*, had to remain unacknowledged. Similar to what happened with the Sony manufacturer, I couldn't provide my model with enough data to give me statistically significant results for the Shooter genre.Still, even after these pros and cons, the analysis I was able to conduct proved to be extremely interesting and engaging. This topic provides clear potential for future research. Were I to continue, I would further investigate the effects publishers and critic ratings have on global sales as well. Furthermore, I would apply the same methods I did here to regional sales as well, as the data set I used also provided variables such as `NA_Sales`, `EU_Sales`, `JP_Sales` and `Other_Sales`.I'd like to extend a special thank you to Professor Pang for her help throughout the scope of my project. This concludes my study of the **The Effects of Genre & Platform on Video Game Sales**.### ReferencesEgenfeldt-Nielsen, Simon, et al. Understanding Video Games : The Essential Introduction, Taylor & Francis Group, 2012. ProQuest Ebook Central, https://ebookcentral.proquest.com/lib/uma/detail.action?docID=1181119.Etchells, Pete. Lost in a Good Game: Why We Play Video Games and What They Can Do for Us. Icon Books, 2019. McCullough, Hayley. (2019). From Zelda to Stanley: Comparing the Integrative Complexity of Six Video Game Genres. Press Start. 5. 137-149.Gillies, Kendall. “Video Game Sales and Ratings.” Kaggle, 25 Jan. 2017, https://www.kaggle.com/datasets/kendallgillies/video-game-sales-and-ratings?resource=download.