DACSS 603 Project Analysis

Alexis Gamez


April 12, 2023


knitr::opts_chunk$set(echo = T)
# reading in our data set
Video_Game_Sales <- read_csv("_data/final_project/Video_Game_Sales_as_of_Jan_2017.csv")
# A tibble: 6 × 15
  Name       Platform Year_of_Release Genre Publisher NA_Sales EU_Sales JP_Sales
  <chr>      <chr>              <dbl> <chr> <chr>        <dbl>    <dbl>    <dbl>
1 Wii Sports Wii                 2006 Spor… Nintendo      41.4    29.0      3.77
2 Super Mar… NES                 1985 Plat… Nintendo      29.1     3.58     6.81
3 Mario Kar… Wii                 2008 Raci… Nintendo      15.7    12.8      3.79
4 Wii Sport… Wii                 2009 Spor… Nintendo      15.6    11.0      3.28
5 Pokemon R… G                   1996 Role… Nintendo      11.3     8.89    10.2 
6 Tetris     G                   1989 Puzz… Nintendo      23.2     2.26     4.22
# … with 7 more variables: Other_Sales <dbl>, Global_Sales <dbl>,
#   Critic_Score <dbl>, Critic_Count <dbl>, User_Score <dbl>, User_Count <dbl>,
#   Rating <chr>

Feedback from Part 1

From my previous project proposal post, I was able to receive valuable feedback from Professor Pang. That feedback applying to my selection of variables and their classifications, which then applies to my hypothesis. Acknowledging the feedback, I’d like to address some changes to my analysis from my previous post.

The following list of variables was shown in my previous post along with an acknowledgment that this would be the list of variables that I was to work with throughout the duration of my project, and that remains true.

The variables are as follows:

  • Name [game’s name]
  • Platform [platform of game release]
  • Year of Release [game’s release date]
  • Genre [genre of game]
  • Publisher [publisher of game]
  • NA Sales [sales in North America in millions]
  • EU Sales [sales in Europe in millions]
  • JPN Sales [sales in Japan in millions]
  • Other Sales [sales in rest of the world in millions]
  • Global Sales [total worldwide sales in millions]
  • Critic Score [aggregate score compiled by Metacritic staff]
  • Critic Count [the number of critis used in creating the critic score]
  • User Score [score according to Metacritic subscribers]
  • User Count [number of users who gave the user score]
  • Rating [ESRB rating for the game]

However, the feedback I received applied more directly to my classification of independent and dependent variables. I originally listed that I would be utilizing 6 independent and 5 dependent variables throughout my analysis. See below for classifications

The 6 independent variables are as follows:

  • Platform
  • Genre
  • Publisher
  • Rating
  • Critic Scores
  • User Scores

The 5 dependent variables are:

  • NA Sales
  • EU Sales
  • JPN Sales
  • Other Sales
  • Global Sales

While not entirely untrue, it’s definitely incorrect. The nature of this project requires the analysis of exploratory and explanatory variables using a combination of different regression and alternative analysis models. Therefore, an over saturation of exploratory (independent) variables could lead to an overly complex model. Knowing this now, I’ve reclassified the variables appropriately below.

The 2 main independent variables are as follows:

  • Platform
  • Genre

The 4 confounding variables are as follows:

  • Publisher
  • Rating
  • Critic Scores
  • User Scores

The 5 dependent variables are:

  • NA Sales
  • EU Sales
  • JPN Sales
  • Other Sales
  • Global Sales

A confounding variable is a variable that influences both the dependent and independent variable, which causes a spurious association within the model. It must meet 2 conditions in order to be considered a true confounder.

  1. It must be correlated to the independent variable, we’ll be testing this at the beginning of the analysis.

  2. It must be causally related to the dependent variable. Again, to be tested at the start of my analysis.

Also, note that our list of dependent variable did not change. That is because my goal for this project is to test my hypothesis within each individual region and compare their models side by side to see if my results are any more true or false in varying regions. My final results should include 5 total models, each one corresponding to a respective dependent variable. Thus, each model will contain 1 dependent variable, 2 independent variable and a varying number of confounders.

My last mention of feedback for this post will be acknowledging a shift in my original hypothesis for which I wrote, “As of 2017, independent variables “Platform” and “Genre” will have the most significant impact on Global sales”. While believe what I wrote was representative of my hypothesis, I don’t believe it was conducive to analysis and understanding from those reading this. As a result, I’ve decided to reorganize my hypothesis into a format that is a bit easier follow. I will also be rewording my research question to better reflect the changes made to my project’s structure.

H1: Platform and Genre significantly impact Global Sales

H2: Platform and Genre significantly impact US and EU Sales

H3: Platform and Genre will not significantly impact JPN sales

Of the two selected independent variables (Platform & Genre), what impact does each have on the commercial success of a video game and does that impact vary by region?

With the corrections made to my hypothesis and research question, I believe I’ve acknowledged all the feedback I received from my first post.

Expanding the Description of my Data

Picking up where I left off from the last post, my goal in this section is to acknowledge the steps I didn’t get the opportunity to execute before. To start, I want to summarize my data and get a visual representation of the numbers to refresh my memory of what I’m working with.

Data Frame Summary  
Dimensions: 17416 x 15  
Duplicates: 0  

No   Variable          Stats / Values                 Freqs (% of Valid)    Graph                 Valid      Missing  
---- ----------------- ------------------------------ --------------------- --------------------- ---------- ---------
1    Name              1. Need for Speed: Most Want      12 ( 0.1%)                               17416      0        
     [character]       2. FIFA 14                         9 ( 0.1%)                               (100.0%)   (0.0%)   
                       3. LEGO Marvel Super Heroes        9 ( 0.1%)                                                   
                       4. Madden NFL 07                   9 ( 0.1%)                                                   
                       5. Madden NFL 08                   9 ( 0.1%)                                                   
                       6. Ratatouille                     9 ( 0.1%)                                                   
                       7. Angry Birds Star Wars           8 ( 0.0%)                                                   
                       8. Cars                            8 ( 0.0%)                                                   
                       9. FIFA 15                         8 ( 0.0%)                                                   
                       10. FIFA Soccer 13                 8 ( 0.0%)                                                   
                       [ 12070 others ]               17327 (99.5%)         IIIIIIIIIIIIIIIIIII                       

2    Platform          1. DS                          2251 (12.9%)          II                    17416      0        
     [character]       2. PS2                         2206 (12.7%)          II                    (100.0%)   (0.0%)   
                       3. PS3                         1362 ( 7.8%)          I                                         
                       4. Wii                         1359 ( 7.8%)          I                                         
                       5. PSP                         1304 ( 7.5%)          I                                         
                       6. X360                        1298 ( 7.5%)          I                                         
                       7. PS                          1200 ( 6.9%)          I                                         
                       8. PC                          1128 ( 6.5%)          I                                         
                       9. GBA                          844 ( 4.8%)                                                    
                       10. X                           833 ( 4.8%)                                                    
                       [ 21 others ]                  3631 (20.8%)          IIII                                      

3    Year_of_Release   Mean (sd) : 2006.6 (5.9)       42 distinct values                  : :     17408      8        
     [numeric]         min < med < max:                                                   : :     (100.0%)   (0.0%)   
                       1976 < 2008 < 2017                                               : : : .                       
                       IQR (CV) : 8 (0)                                               . : : : :                       
                                                                                    . : : : : :                       

4    Genre             1. Action                      3503 (20.1%)          IIII                  17416      0        
     [character]       2. Sports                      2408 (13.8%)          II                    (100.0%)   (0.0%)   
                       3. Misc                        1813 (10.4%)          II                                        
                       4. Role-Playing                1545 ( 8.9%)          I                                         
                       5. Adventure                   1478 ( 8.5%)          I                                         
                       6. Shooter                     1349 ( 7.7%)          I                                         
                       7. Racing                      1282 ( 7.4%)          I                                         
                       8. Simulation                   925 ( 5.3%)          I                                         
                       9. Platform                     900 ( 5.2%)          I                                         
                       10. Fighting                    864 ( 5.0%)                                                    
                       [ 2 others ]                   1349 ( 7.7%)          I                                         

5    Publisher         1. Electronic Arts             1380 ( 7.9%)          I                     17416      0        
     [character]       2. Activision                  1005 ( 5.8%)          I                     (100.0%)   (0.0%)   
                       3. Namco Bandai Games           972 ( 5.6%)          I                                         
                       4. Ubisoft                      970 ( 5.6%)          I                                         
                       5. Konami Digital Entertainm    865 ( 5.0%)                                                    
                       6. THQ                          728 ( 4.2%)                                                    
                       7. Nintendo                     722 ( 4.1%)                                                    
                       8. Sony Computer Entertainme    704 ( 4.0%)                                                    
                       9. Sega                         660 ( 3.8%)                                                    
                       10. Take-Two Interactive        433 ( 2.5%)                                                    
                       [ 617 others ]                 8977 (51.5%)          IIIIIIIIII                                

6    NA_Sales          Mean (sd) : 0.3 (0.8)          399 distinct values   :                     17416      0        
     [numeric]         min < med < max:                                     :                     (100.0%)   (0.0%)   
                       0 < 0.1 < 41.4                                       :                                         
                       IQR (CV) : 0.2 (3.1)                                 :                                         

7    EU_Sales          Mean (sd) : 0.1 (0.5)          306 distinct values   :                     17416      0        
     [numeric]         min < med < max:                                     :                     (100.0%)   (0.0%)   
                       0 < 0 < 29                                           :                                         
                       IQR (CV) : 0.1 (3.5)                                 :                                         

8    JP_Sales          Mean (sd) : 0.1 (0.3)          245 distinct values   :                     17416      0        
     [numeric]         min < med < max:                                     :                     (100.0%)   (0.0%)   
                       0 < 0 < 10.2                                         :                                         
                       IQR (CV) : 0 (4)                                     :                                         

9    Other_Sales       Mean (sd) : 0 (0.2)            157 distinct values   :                     17416      0        
     [numeric]         min < med < max:                                     :                     (100.0%)   (0.0%)   
                       0 < 0 < 10.6                                         :                                         
                       IQR (CV) : 0 (4)                                     :                                         

10   Global_Sales      Mean (sd) : 0.5 (1.5)          627 distinct values   :                     17416      0        
     [numeric]         min < med < max:                                     :                     (100.0%)   (0.0%)   
                       0 < 0.2 < 82.5                                       :                                         
                       IQR (CV) : 0.4 (3)                                   :                                         

11   Critic_Score      Mean (sd) : 68.9 (14)          82 distinct values                  :       8336       9080     
     [numeric]         min < med < max:                                               . : :       (47.9%)    (52.1%)  
                       13 < 71 < 98                                                   : : : :                         
                       IQR (CV) : 19 (0.2)                                          . : : : :                         
                                                                                . : : : : : : :                       

12   Critic_Count      Mean (sd) : 26.2 (19)          106 distinct values   :                     8336       9080     
     [numeric]         min < med < max:                                     : :                   (47.9%)    (52.1%)  
                       3 < 21 < 113                                         : : .                                     
                       IQR (CV) : 25 (0.7)                                  : : : .                                   
                                                                            : : : : : . .                             

13   User_Score        Mean (sd) : 7.1 (1.5)          95 distinct values                    :     7798       9618     
     [numeric]         min < med < max:                                                   : :     (44.8%)    (55.2%)  
                       0 < 7.5 < 9.7                                                      : :                         
                       IQR (CV) : 1.8 (0.2)                                           . : : : .                       
                                                                                  . . : : : : :                       

14   User_Count        Mean (sd) : 162.7 (562.8)      903 distinct values   :                     7798       9618     
     [numeric]         min < med < max:                                     :                     (44.8%)    (55.2%)  
                       4 < 25 < 10766                                       :                                         
                       IQR (CV) : 71 (3.5)                                  :                                         

15   Rating            1. AO                             1 ( 0.0%)                                10252      7164     
     [character]       2. E                           4120 (40.2%)          IIIIIIII              (58.9%)    (41.1%)  
                       3. E10+                        1473 (14.4%)          II                                        
                       4. EC                             8 ( 0.1%)                                                    
                       5. K-A                            3 ( 0.0%)                                                    
                       6. M                           1599 (15.6%)          III                                       
                       7. RP                             3 ( 0.0%)                                                    
                       8. T                           3045 (29.7%)          IIIII                                     

Looking at the data, I know there were a couple of adjustments I wanted to make following my last final project post. The first being a couple of adjustments to the platform variable. First, I’m going to extract all unique platform values to get a complete list of included platforms.

Plat_List <- unique(select(Video_Game_Sales, "Platform"))
as.list(Plat_List, sorted = T)
 [1] "Wii"  "NES"  "G"    "DS"   "X360" "PS3"  "PS2"  "SNES" "GBA"  "3DS" 
[11] "PS4"  "N64"  "PS"   "X"    "PC"   "2600" "PSP"  "XOne" "WiiU" "GC"  
[21] "GEN"  "DC"   "PSV"  "SAT"  "SCD"  "WS"   "NG"   "TG16" "3DO"  "GG"  
[31] "PCFX"

Next, I’ll duplicate the Platform column and re-code the values so that they pertain to their respective manufacturer. This will clean up the data a bit and make analysis easier in the future.

VGS <- Video_Game_Sales
VGS$Manufacturer <- VGS$Platform

VGS <- VGS %>% 

Below, it can be seen that the re-coding worked! Now, we’ll draw out a list of unique values for the rest of the categorical variables we have at our disposal

Manuf_List <- unique(select(VGS, "Manufacturer"))
as.list(Manuf_List, sorted = T)
 [1] "Nintendo"  "Microsoft" "Sony"      "PC"        "2600"      "Sega"     
 [7] "WS"        "NG"        "NEC"       "3DO"      
Genre_List <- unique(select(VGS, "Genre"))
# A tibble: 12 × 1
 1 Sports      
 2 Platform    
 3 Racing      
 4 Role-Playing
 5 Puzzle      
 6 Misc        
 7 Shooter     
 8 Simulation  
 9 Action      
10 Fighting    
11 Adventure   
12 Strategy    
Rating_List <- na.omit(unique(select(VGS, "Rating")))
# A tibble: 8 × 1
1 E     
2 M     
3 T     
4 E10+  
5 K-A   
6 AO    
7 EC    
8 RP    

I’d like to better visualize my descriptive statistics from last time, so again I’m going to do some re-coding. This time, I will be re-coding the Manufacturer, Genre and Rating variables so that each unique value corresponds to a number. Again, this makes it easier to draw up our numbers.

VGS <- VGS %>% 
  mutate(Manufacturer = recode(Manufacturer,
                            `1` = "Sony",
                            `2` = "Nintendo",
                            `3` = "Sega",
                            `4` = "Microsoft",
                            `5` = "PC",
                            `6` = "Neo Geo",
                            `7` = "WS",
                            `8` = "3DO",
                            `9` = "NEC",
                            `10` = "2600"))

VGS <- VGS %>% 
  mutate(Genre = recode(Genre,
                            `1` = "Sports",
                            `2` = "Platform",
                            `3` = "Racing",
                            `4` = "Role-Playing",
                            `5` = "Puzzle",
                            `6` = "Misc",
                            `7` = "Shooter",
                            `8` = "Simulation",
                            `9` = "Action",
                            `10` = "Fighting",
                            `11` = "Adventure",
                            `12` = "Strategy"))

VGS <- VGS %>% 
  mutate(Rating = recode(Rating,
                            1 = "E",
                            `2` = "M",
                            `3` = "T",
                            `4` = "E10+",
                            `5` = "K-A",
                            `6` = "AO",
                            `7` = "EC",
                            `8` = "RP"))

VGS_select <- VGS %>% 
  select(Genre, Manufacturer, Rating, Global_Sales, NA_Sales, EU_Sales, JP_Sales, Other_Sales)

VGS_Desc <- describe(x=VGS_select) %>% 
  select(c(vars, n, mean, sd, median, min, max, range))

kable(VGS_Desc) %>% 
Error: <text>:31:31: unexpected '='
30:   mutate(Rating = recode(Rating,
31:                             1 =

Our data is looking good! Next, I’d like to draw up our explanatory and control variable to see if there’s any further cleaning we can do.

x_Manufacturer <- xtabs(~Manufacturer, data = VGS)

kable(x_Manufacturer) %>% 
Manufacturer Freq
2600 133
3DO 3
Microsoft 2395
NG 12
Nintendo 6477
PC 1128
Sega 259
Sony 6999
WS 7
        xlab = "Manufacturer",
        ylab = "Frequency")

It looks like the data I have for 2600, 3DO, NEC, NG and WS is so small in comparison to the other manufacturers that it’s practically negligible. I don’t want this to adversely affect my analysis so I’m remove those rows from the data frame.

VGS2 <- VGS[VGS$Manufacturer %in% c("Sony", "2600", "Microsoft", "Nintendo", "PC", "Sega"),]

Now it’s time to draw up the Genre variable.

x_Genre <- xtabs(~Genre, data = VGS)

kable(x_Genre) %>% 
Genre Freq
Action 3503
Adventure 1478
Fighting 864
Misc 1813
Platform 900
Puzzle 615
Racing 1282
Role-Playing 1545
Shooter 1349
Simulation 925
Sports 2408
Strategy 734
        xlab = "Genre",
        ylab = "Frequency")

In this case, it seems all genres have adequate data and no changes need to be made.

The last variable I want to draw up is Rating

x_Rating <- xtabs(~Rating, data = VGS)

kable(x_Rating) %>% 
Rating Freq
AO 1
E 4120
E10+ 1473
EC 8
K-A 3
M 1599
RP 3
T 3045
        xlab = "Rating",
        ylab = "Frequency")

Similar to manufacturer, there are some irrelevant pieces of data that I think my analysis could do without. In this situation, I’ll be eliminating any rows containing ratings AO, EC, K-A and RP.

VGS3 <- VGS2[VGS2$Rating %in% c("E", "E10+", "M", "T"),]

Hypothesis Testing

To start the hypothesis testing section I’d like to introduce my list of Explanatory, Response and Control Variables.

Explanatory Variables

  1. Genre
  2. Platform/Manufacturer

Response Varibles

  1. Global Sales
  2. NA Sales
  3. JPN Sales
  4. EU Sales
  5. Other Sales

Control Varibles

  1. Publisher
  2. Year of Release
  3. Name
  4. Critic Score
  5. Critic Count
  6. User Score
  7. User Count
  8. Rating


In this section we’ll be testing the explanatory variables, Genre & Manufacturer, against the main response variable Global Sales. We’re also going to be testing a control variable Rating against sales to see if there is any significance. It’s important to note that all 3 variables we will be testing are categorical, where as Global Sales is continuous.

First up is the Manufacturer variable.

M_aov <- aov(Global_Sales ~ Manufacturer, data = VGS3)

                Df Sum Sq Mean Sq F value   Pr(>F)    
Manufacturer     4    194   48.53   17.99 9.83e-15 ***
Residuals    10232  27599    2.70                     
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we see that our Pr(>F) value is very small, allowing us to reject the null at a significance level of 0.001. This means the Manufacturer means are significantly different.

ggplot(VGS3, mapping = aes(x=Manufacturer, y=Global_Sales, color=Manufacturer))+
    geom_boxplot() +
    labs(title = "Distribution of Global Sales per Manufacturer", y = "Global Sales (millions)") 

limit <- c(0, 1)

ggplot(VGS3, mapping = aes(x=Manufacturer, y=Global_Sales, color=Manufacturer))+
    geom_boxplot() +
    labs(title = "Distribution of Global Sales per Manufacturer", y = "Total Worlwide Sales (millions)") +
    scale_y_continuous(breaks = seq(from =0, to = 1, by = .25),
                       limits = limit)

Next we’ll be testing whether the population means among the category Genre are significantly different as well.

G_aov <- aov(Global_Sales ~ Genre, data = VGS3)

               Df Sum Sq Mean Sq F value   Pr(>F)    
Genre          11    212  19.245   7.134 3.27e-12 ***
Residuals   10225  27581   2.697                     
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Once again, we receive an extremely small F-value telling us that the Genre means are significantly different.

ggplot(VGS3, mapping = aes(x=Genre, y=Global_Sales, color=Genre))+
    geom_boxplot() +
    labs(title = "Distribution of Global Sales per Genre", y = "Global Sales (millions)") 

ggplot(VGS3, mapping = aes(x=Genre, y=Global_Sales, color=Genre))+
    geom_boxplot() +
    labs(title = "Distribution of Global Sales per Genre", y = "Total Worlwide Sales (millions)") +
    scale_y_continuous(breaks = seq(from =0, to = 1, by = .25),
                       limits = limit)

Finally, the last test will decided whether the Rating means are significantly different as well.

R_aov <- aov(Global_Sales ~ Rating, data = VGS3)

               Df Sum Sq Mean Sq F value Pr(>F)    
Rating          3    245   81.60   30.31 <2e-16 ***
Residuals   10233  27548    2.69                   
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And again it looks like we receive another small F-value telling us that the Rating means are significantly different.

ggplot(VGS3, mapping = aes(x=Rating, y=Global_Sales, color=Rating))+
    geom_boxplot() +
    labs(title = "Distribution of Global Sales per Rating", y = "Global Sales (millions)") 

ggplot(VGS3, mapping = aes(x=Rating, y=Global_Sales, color=Rating))+
    geom_boxplot() +
    labs(title = "Distribution of Global Sales per Rating", y = "Total Worlwide Sales (millions)") +
    scale_y_continuous(breaks = seq(from =0, to = 1, by = .25),
                       limits = limit)

Chi-Square Test

The point of a Chi-Square test is to determine whether there is any association between two categorical variables that may or may not be independent. As such, I’ll be testing my explanatory variables against each other and, individually against the control variable Rating. All variables tested must be categorical in the case of Chi-Square tests.

First, I’ll test Genre against the control.

chisq.test(VGS3$Genre, VGS3$Rating, correct = F)

    Pearson's Chi-squared test

data:  VGS3$Genre and VGS3$Rating
X-squared = 5263.6, df = 33, p-value < 2.2e-16

Here we receive a very low p-value, telling me that I can reject the null and that Genre is not independent of Rating.

table1 <- data.frame(with(VGS3, table(Genre,Rating)))

ggplot(table1, aes(x=Genre,y=Freq, fill=Rating))+
  scale_fill_discrete(name = "Rating",labels = Rating_List)

Next, I’ll be testing Manufacturer against Rating.

chisq.test(VGS3$Manufacturer, VGS3$Rating, correct = F)

    Pearson's Chi-squared test

data:  VGS3$Manufacturer and VGS3$Rating
X-squared = 1691.1, df = 12, p-value < 2.2e-16

Yet again, I received a low p-value and I know that these 2 variables are also not independent of each other

table2 <- data.frame(with(VGS3, table(Manufacturer,Rating)))

ggplot(table2, aes(x=Manufacturer,y=Freq, fill=Rating))+
  scale_fill_discrete(name = "Rating",labels = Rating_List)

Lastly, I will be testing our explanatory variables against each other

chisq.test(VGS3$Genre, VGS3$Manufacturer, correct = F)

    Pearson's Chi-squared test

data:  VGS3$Genre and VGS3$Manufacturer
X-squared = 1955.3, df = 44, p-value < 2.2e-16

Even with my last test, I again receive a small p-value telling me the results are, yet again, statistically significant.

table3 <- data.frame(with(VGS3, table(Genre,Manufacturer)))

ggplot(table3, aes(x=Genre,y=Freq, fill=Manufacturer))+
  scale_fill_discrete(name = "Rating",labels = Manuf_List)

Model Comparisons

The first model I’ll be fitting takes into consideration the Global Sales and Manufacturer variables.

M_fit <- lm(Global_Sales ~ Manufacturer, data = VGS3)

lm(formula = Global_Sales ~ Manufacturer, data = VGS3)

   Min     1Q Median     3Q    Max 
-0.721 -0.501 -0.351 -0.054 82.009 

                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           0.64427    0.03661  17.600  < 2e-16 ***
ManufacturerNintendo -0.11353    0.04550  -2.495   0.0126 *  
ManufacturerPC       -0.39851    0.06682  -5.964 2.55e-09 ***
ManufacturerSega     -0.31927    0.44046  -0.725   0.4686    
ManufacturerSony      0.08715    0.04559   1.912   0.0559 .  
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.642 on 10232 degrees of freedom
Multiple R-squared:  0.006984,  Adjusted R-squared:  0.006596 
F-statistic: 17.99 on 4 and 10232 DF,  p-value: 9.827e-15

From this model I receive a very small p-value, giving it statistical significance. However, the Adjusted R-Squared value is close to 0. This indicates response variable Global Sales is not explained by explanatory variable Manufacturer. The model does tell me however that releasing as a Nintendo or PC game leads to a higher likelihood of a decrease in worldwide sales, with PC being harshest at a decrease of -0.39851(million) .

The next model I create takes into consideration the Global Sales, Manufacturer & Rating variables. In this case, Rating serves as a control variable. From the model, I again score a very low p-value along with a low Adjusted R-squared. Even so, from it I can still gather that Rating is statistically significant. I know that games released with ratings of E10+ or T are more likely to negatively sell worldwide, with a decrease in -0.14225(million) Global Sales for games released with an E10+ rating. However, it seems as though rated M games have the opposite reaction and tend to sell better globally with an increase of 0.35798(million) in Global Sales.

MR_fit <- lm(Global_Sales ~ Manufacturer + Rating, data = VGS3)

lm(formula = Global_Sales ~ Manufacturer + Rating, data = VGS3)

   Min     1Q Median     3Q    Max 
-1.069 -0.496 -0.334 -0.016 81.976 

                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           0.60115    0.04477  13.429  < 2e-16 ***
ManufacturerNintendo -0.03711    0.04781  -0.776  0.43760    
ManufacturerPC       -0.41329    0.06662  -6.204 5.72e-10 ***
ManufacturerSega     -0.22873    0.43872  -0.521  0.60214    
ManufacturerSony      0.11971    0.04554   2.629  0.00858 ** 
RatingE10+           -0.14225    0.04973  -2.861  0.00423 ** 
RatingM               0.35798    0.05153   6.946 3.97e-12 ***
RatingT              -0.10220    0.04096  -2.495  0.01262 *  
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.635 on 10229 degrees of freedom
Multiple R-squared:  0.01617,   Adjusted R-squared:  0.0155 
F-statistic: 24.02 on 7 and 10229 DF,  p-value: < 2.2e-16

The last model I’ll fit involving the Manufacturer variable is similar to the last, in the sense that it includes the Rating variable. This time, instead of using Rating as a control variable, I’d like to see how it interacts with Manufacturer in the model. We saw in previous tests that the variables were not considered independent and I’d like to see the their product.

M_R_fit <- lm(Global_Sales ~ Manufacturer*Rating, data = VGS3)

lm(formula = Global_Sales ~ Manufacturer * Rating, data = VGS3)

   Min     1Q Median     3Q    Max 
-1.158 -0.497 -0.296 -0.021 81.933 

Coefficients: (1 not defined because of singularities)
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      0.52147    0.06953   7.500 6.93e-14 ***
ManufacturerNintendo             0.08559    0.07761   1.103   0.2701    
ManufacturerPC                  -0.37403    0.14519  -2.576   0.0100 *  
ManufacturerSega                -0.07480    0.94490  -0.079   0.9369    
ManufacturerSony                 0.15962    0.08445   1.890   0.0588 .  
RatingE10+                      -0.02235    0.12660  -0.177   0.8599    
RatingM                          0.58546    0.09911   5.907 3.60e-09 ***
RatingT                         -0.08714    0.09325  -0.934   0.3501    
ManufacturerNintendo:RatingE10+ -0.10606    0.14520  -0.730   0.4651    
ManufacturerPC:RatingE10+       -0.01041    0.24620  -0.042   0.9663    
ManufacturerSega:RatingE10+           NA         NA      NA       NA    
ManufacturerSony:RatingE10+     -0.20234    0.15562  -1.300   0.1936    
ManufacturerNintendo:RatingM    -0.90965    0.18487  -4.920 8.77e-07 ***
ManufacturerPC:RatingM          -0.42672    0.18937  -2.253   0.0243 *  
ManufacturerSega:RatingM        -0.98213    1.88728  -0.520   0.6028    
ManufacturerSony:RatingM        -0.09842    0.12671  -0.777   0.4373    
ManufacturerNintendo:RatingT    -0.15411    0.11833  -1.302   0.1928    
ManufacturerPC:RatingT           0.22153    0.18149   1.221   0.2223    
ManufacturerSega:RatingT        -0.04353    1.07847  -0.040   0.9678    
ManufacturerSony:RatingT         0.05471    0.11377   0.481   0.6306    
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.632 on 10218 degrees of freedom
Multiple R-squared:  0.02058,   Adjusted R-squared:  0.01886 
F-statistic: 11.93 on 18 and 10218 DF,  p-value: < 2.2e-16

Once again, I receive a small p-value and Adjusted R-Squared, meaning statistical significance but that our response variable is not explained by the predictors. According to the numbers, rated M Nintendo & PC games are the least likely to sell well Globally. Nintendo’s decrease being the harshest with -0.98213(million).

The next set of models I’ll be fitting at their foundation must include the Global Sales and Genre variables. Eventually, I’ll be working in the Rating variable as I did with the previous models.

G_fit <- lm(Global_Sales ~ Genre, data = VGS3)

lm(formula = Global_Sales ~ Genre, data = VGS3)

   Min     1Q Median     3Q    Max 
-0.826 -0.526 -0.336 -0.041 81.885 

                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        0.610694   0.034640  17.630  < 2e-16 ***
GenreAdventure    -0.326162   0.083301  -3.915 9.08e-05 ***
GenreFighting     -0.021453   0.084976  -0.252 0.800693    
GenreMisc         -0.003596   0.064888  -0.055 0.955809    
GenrePlatform      0.111836   0.076437   1.463 0.143468    
GenrePuzzle       -0.298688   0.092792  -3.219 0.001291 ** 
GenreRacing       -0.019836   0.064862  -0.306 0.759755    
GenreRole-Playing  0.045047   0.067740   0.665 0.506061    
GenreShooter       0.225707   0.061552   3.667 0.000247 ***
GenreSimulation   -0.126948   0.075973  -1.671 0.094757 .  
GenreSports        0.044037   0.054317   0.811 0.417538    
GenreStrategy     -0.389677   0.092792  -4.199 2.70e-05 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.642 on 10225 degrees of freedom
Multiple R-squared:  0.007617,  Adjusted R-squared:  0.006549 
F-statistic: 7.134 on 11 and 10225 DF,  p-value: 3.265e-12

Like previous models, small p-value and Adjusted R-squared. Again we have statistical significance, but no noteworthy predictor. I can still see that the Shooter genre is the most likely to sell well globally with an increase of 0.225707(million), with an opposite reaction occurring for the Adventure, Puzzle and Strategy. Strategy games being the least likely to sell well with a decrease of -0.389677(million).

Like with the Manufacturer models, I will be introducing Rating as a control variable here. Again, I received a low p-value and Adjusted R-squared, so the predictors don’t explain the response variable. It’s important to note that the resulting estimates are nearly identical to those from the model that included Manufacturer as an explanatory value.

GR_fit <- lm(Global_Sales ~ Genre + Rating, data = VGS3)

lm(formula = Global_Sales ~ Genre + Rating, data = VGS3)

   Min     1Q Median     3Q    Max 
-1.014 -0.505 -0.322 -0.015 81.867 

                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        0.57723    0.04800  12.025  < 2e-16 ***
GenreAdventure    -0.32065    0.08332  -3.848  0.00012 ***
GenreFighting      0.05950    0.08703   0.684  0.49417    
GenreMisc          0.07582    0.06690   1.133  0.25713    
GenrePlatform      0.18728    0.07911   2.367  0.01794 *  
GenrePuzzle       -0.24564    0.09657  -2.544  0.01098 *  
GenreRacing        0.03942    0.06810   0.579  0.56276    
GenreRole-Playing  0.07739    0.06807   1.137  0.25559    
GenreShooter       0.11574    0.06344   1.824  0.06811 .  
GenreSimulation   -0.05352    0.07807  -0.686  0.49301    
GenreSports        0.09558    0.05985   1.597  0.11029    
GenreStrategy     -0.30272    0.09309  -3.252  0.00115 ** 
RatingE10+        -0.13038    0.05267  -2.476  0.01332 *  
RatingM            0.33113    0.05966   5.550 2.93e-08 ***
RatingT           -0.09415    0.04577  -2.057  0.03972 *  
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.637 on 10222 degrees of freedom
Multiple R-squared:  0.01466,   Adjusted R-squared:  0.01332 
F-statistic: 10.87 on 14 and 10222 DF,  p-value: < 2.2e-16

The last model now introduces the interaction between Genre and Rating. Even on this last model, I get a low p-value and Adjusted R-squared, giving statistical significance, but no explanation.

G_R_fit <- lm(Global_Sales ~ Genre*Rating, data = VGS3)

lm(formula = Global_Sales ~ Genre * Rating, data = VGS3)

   Min     1Q Median     3Q    Max 
-1.095 -0.492 -0.300 -0.003 81.837 

Coefficients: (1 not defined because of singularities)
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   0.46021    0.07825   5.882 4.19e-09 ***
GenreAdventure               -0.12195    0.14723  -0.828  0.40753    
GenreFighting                -0.32396    0.58357  -0.555  0.57882    
GenreMisc                     0.16268    0.10886   1.494  0.13511    
GenrePlatform                 0.45218    0.11599   3.899 9.74e-05 ***
GenrePuzzle                  -0.17066    0.12193  -1.400  0.16162    
GenreRacing                   0.15580    0.10272   1.517  0.12936    
GenreRole-Playing             0.36945    0.19296   1.915  0.05557 .  
GenreShooter                 -0.37341    0.24420  -1.529  0.12627    
GenreSimulation               0.03965    0.11858   0.334  0.73814    
GenreSports                   0.24319    0.09125   2.665  0.00771 ** 
GenreStrategy                -0.17750    0.20562  -0.863  0.38801    
RatingE10+                   -0.01025    0.10737  -0.095  0.92397    
RatingM                       0.52678    0.10217   5.156 2.57e-07 ***
RatingT                       0.02409    0.09983   0.241  0.80932    
GenreAdventure:RatingE10+    -0.16259    0.25346  -0.641  0.52122    
GenreFighting:RatingE10+      0.12031    0.69770   0.172  0.86309    
GenreMisc:RatingE10+          0.06948    0.18065   0.385  0.70052    
GenrePlatform:RatingE10+     -0.48472    0.19251  -2.518  0.01182 *  
GenrePuzzle:RatingE10+        0.20788    0.28006   0.742  0.45795    
GenreRacing:RatingE10+       -0.04266    0.20667  -0.206  0.83646    
GenreRole-Playing:RatingE10+ -0.37379    0.25711  -1.454  0.14603    
GenreShooter:RatingE10+       0.28896    0.33342   0.867  0.38614    
GenreSimulation:RatingE10+   -0.06826    0.26631  -0.256  0.79771    
GenreSports:RatingE10+       -0.18810    0.19681  -0.956  0.33923    
GenreStrategy:RatingE10+     -0.05212    0.27883  -0.187  0.85172    
GenreAdventure:RatingM       -0.43610    0.22797  -1.913  0.05577 .  
GenreFighting:RatingM        -0.07180    0.63204  -0.114  0.90955    
GenreMisc:RatingM            -0.87505    0.47114  -1.857  0.06330 .  
GenrePlatform:RatingM        -0.90583    0.95373  -0.950  0.34225    
GenrePuzzle:RatingM                NA         NA      NA       NA    
GenreRacing:RatingM          -0.88068    0.39457  -2.232  0.02564 *  
GenreRole-Playing:RatingM    -0.43619    0.23954  -1.821  0.06864 .  
GenreShooter:RatingM          0.49127    0.26190   1.876  0.06072 .  
GenreSimulation:RatingM      -0.76263    0.74396  -1.025  0.30534    
GenreSports:RatingM          -0.98580    0.42410  -2.324  0.02012 *  
GenreStrategy:RatingM        -0.60140    0.38665  -1.555  0.11988    
GenreAdventure:RatingT       -0.20413    0.21743  -0.939  0.34782    
GenreFighting:RatingT         0.45590    0.59296   0.769  0.44200    
GenreMisc:RatingT            -0.10644    0.16356  -0.651  0.51519    
GenrePlatform:RatingT        -0.57572    0.24049  -2.394  0.01669 *  
GenrePuzzle:RatingT          -0.10030    0.49160  -0.204  0.83834    
GenreRacing:RatingT          -0.08470    0.17255  -0.491  0.62353    
GenreRole-Playing:RatingT    -0.28161    0.21758  -1.294  0.19561    
GenreShooter:RatingT          0.47458    0.26636   1.782  0.07483 .  
GenreSimulation:RatingT      -0.04573    0.17764  -0.257  0.79686    
GenreSports:RatingT          -0.25165    0.15901  -1.583  0.11355    
GenreStrategy:RatingT        -0.10959    0.24781  -0.442  0.65832    
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.636 on 10190 degrees of freedom
Multiple R-squared:  0.01905,   Adjusted R-squared:  0.01462 
F-statistic: 4.303 on 46 and 10190 DF,  p-value: < 2.2e-16

According to the data, Platform games rated E10+ & T along with rated M Racing & Sports games are all the least likely to sell well globally. Rated M Sports games are most likely to sell the least with a decrease in -0.98580(million).

Although all models presented similar results, the one I will be using to perform my diagnostics will be the one including the interaction between Manufacturer and Rating. This model seemed to have the highest R-squared value and presented some interesting data. Therefore, the regression model being used will be lm(Global_Sales ~ Manufacturer*Rating, data=VGS3).


The first plot I’ll draw out is the Residuals vs. Fitted plot. In it we can see that the linearity assumption is held up, but the constant variance assumption is violated. This is portrayed by the points not being evenly distributed around the origin, or 0. Additionally, there are still some very notable outliers.

plot(lm(Global_Sales ~ Manufacturer*Genre, data = VGS3), which = 1)

The next plot I’ll be looking at is a Normal Q-Q which is easy to tell at a glance that a violation has been made. I can see that the plot starts off relatively linear but there is a skew as it reaches the end, meaning the Normality assumption has been violated.

plot(lm(Global_Sales ~ Manufacturer*Genre, data = VGS3), which = 2)

The next plot is a Scale-Location plot. From the results I can see that there is an increasing trend, a direct violation of the Constant Variance assumption.

plot(lm(Global_Sales ~ Manufacturer*Genre, data = VGS3), which = 3)

With the last plot that I’ll draw up, it’s easy to see the violation here immediately. In the graph, it’s clear that we have values greater than 4/n. This is a violation of the Influential Observation assumption.

plot(lm(Global_Sales ~ Manufacturer*Genre, data = VGS3), which = 4)

Next Steps

Unfortunately, I wasn’t able to introduce all the control variables I would’ve liked to before this assignment was due. I had trouble re-coding the Publisher control variable as there are so many different ones and the process would’ve have overly tedious for the time that I had. If possible, I’d also like to introduce the critic and user score control variables as well.

Additionally, I’d like to introduce regional sales as alternative response variables as opposed to global sales to see if any significance can be drawn from that data as well.


Egenfeldt-Nielsen, Simon, et al. Understanding Video Games : The Essential Introduction, Taylor & Francis Group, 2012. ProQuest Ebook Central,

Etchells, Pete. Lost in a Good Game: Why We Play Video Games and What They Can Do for Us. Icon Books, 2019.

McCullough, Hayley. (2019). From Zelda to Stanley: Comparing the Integrative Complexity of Six Video Game Genres. Press Start. 5. 137-149.

Gillies, Kendall. “Video Game Sales and Ratings.” Kaggle, 25 Jan. 2017,