finalproject
Final project qmd file for 603
Author

Jerin Jacob

Published

March 10, 2023

Now a days, movies are a well marketed entertainment product. Just like any other products in the market, movies are also having an allocated a marketing budget and promotional activities are done in scale. This often result in the opening weekend’s gross ticketing volume to rise. But are the pre-release promotional activities helping the movie to collect more or is it just creating a hype initially? Or does the movie’s gross collection is not at all dependant on pre release promotions? This dataset has 200 highest grossing movies of 2022. It has both the opening week’s gross as well as the total gross collection of the movies, along with other variables. Assuming that opening week’s collection is depending on the pre-release promotion, by looking on the relationship between opening week’s gross and total gross, I am trying to see how the pre-release activities help the producers earn more in boxoffice.

Research Question: To what extent does the success of a movie depend on its opening week’s collection?

Hypothesis: Opening week’s collection is positively correlated with the Box Office total collection.

Loading all the packages required for the project.

Code
library(readxl)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(lubridate)
library(dplyr)
library(stringr)

Reading the data

Code
df <- read_excel("_data/project_data.xlsx") |>
  as.data.frame()
head(df)
  Rank                                     Release      Budget     Box Office
1    1                           Top Gun: Maverick 177 million $1.493 billion
2    2                    Avatar: The Way of Water 460 million $2.319 billion
3    3              Black Panther: Wakanda Forever 250 million $859.1 million
4    4 Doctor Strange in the Multiverse of Madness 200 million $955.8 million
5    5                    Jurassic World: Dominion 185 million $1.004 billion
6    6                    Minions: The Rise of Gru 100 million $940.5 million
      Gross max_th   Opening perc_tot_gr open_th  Open Close
1 718732821   4751 126707459        17.6    4735 44708 44911
2 636955746   4340 134100226        21.1    4202 44911    NA
3 453474324   4396 181339761        40.0    4396 44876    NA
4 411331607   4534 187420998        45.6    4534 44687    NA
5 376851080   4697 145075625        38.5    4676 44722 44827
6 369695210   4427 107010140        28.9    4391 44743    NA
                          Distributor  int_gross world_gross
1                  Paramount Pictures  770000000  1488732821
2                20th Century Studios 1539273359  2176229105
3 Walt Disney Studios Motion Pictures  389276658   842750982
4 Walt Disney Studios Motion Pictures  544444197   955775804
5                  Universal Pictures  625127000  1001978080
6                  Universal Pictures  569933000   939628210

There are 14 variables with 200 rows.

COLUMN DESCRIPTION

‘Rank’: rank of the movie ‘Release’: release date of the movie ‘Budget’: The budget of the movie production ‘Box Office’: The total Box Office collection ‘Gross’: domestic gross of the movie ‘max_th’: maximum number of theaters the movie was released in ‘Opening’: gross on opening weekend ‘perc_tot_gr’: domestic percentage of the total gross ‘open_th’: number of theaters the movie opened in ‘Open’: opening date ‘Close’: closing date ‘Distributor’: name of the distributor ‘int_gross’: international gross ‘world_gross’: worldwide gross

  • ‘Release’: release date of the movie

  • ‘Distributor’: name of the distributor

  • ‘Small_Dist’: Whether a small distributor or not

  • ‘Open_date’: Date of release

  • ‘season’: The season in which the movie was released

  • ‘Opening’: gross on opening weekend

  • ‘open_th’: number of theaters the movie opened in

  • ‘max_th’: maximum number of theaters the movie was released in

Cleaning the Data

For our purpose of analysis, we need to clean and transform the data a bit.

First, the Budget and Boxoffice columns are cleaned using stringr function so that the column values are numeric. We are getting rid of the character part in those values including ‘$’ and the value unit. Also, there are certain values which are in Indian Rupees and South Korean won. So we need to convert thoses values to US dollars. The data for columns Budget and Box office were taken from Wikipedia.

Code
df <- df |> 
  mutate(Budget = gsub("\\$", "", Budget)) |>
  mutate(Budget = sub(".*-", "", Budget)) |>
  mutate(`Box Office` = gsub("\\$", "", `Box Office`)) 

df[c('Budget', 'Unit')] <- str_split_fixed(df$Budget, ' ', 2)

df$Budget <- as.numeric(df$Budget)
Warning: NAs introduced by coercion
Code
df$Budget <- ifelse(df$Unit == "million", df$Budget * 1000000,
                    ifelse(df$Unit == "billion", df$Budget * 1000000000,
                           ifelse(df$Unit == "Kmillion", df$Budget * 1000000*0.00074,
                                  ifelse(df$Unit == "Kbillion", df$Budget * 1000000000*0.00074,
                                         ifelse(df$Unit == "crore", df$Budget * 10000000*80, df$Budget)))))

df[c('Box Office', 'BXUnit')] <- str_split_fixed(df$`Box Office`, ' ', 2)

df$`Box Office` <- as.numeric(df$`Box Office`)
Warning: NAs introduced by coercion
Code
df$`Box Office` <- ifelse(df$BXUnit == "million", df$`Box Office` * 1000000,
                    ifelse(df$BXUnit == "billion", df$`Box Office` * 1000000000,
                           ifelse(df$BXUnit == "Kmillion", df$`Box Office` * 1000000*0.00074,
                                  ifelse(df$BXUnit == "Kbillion", df$`Box Office` * 1000000000*0.00074,
                                         ifelse(df$BXUnit == "crore", df$`Box Office` * 10000000*80, df$`Box Office`)))))

Since the original dataset from Kaggle had a column named world_gross, we can compare both variables and assume that the highest value in either of the column can be considerd as the final world_gross.

Code
df$`Box Office` <- ifelse(df$`Box Office` < df$world_gross, df$world_gross, df$`Box Office`)

We can count the number of movies in the list for each distributor and any distributor who don’t have more than 3 movies in their name can be considered as a smaller distributor and thus assuming that they won’t have cash rich promotional campaigns that would lead to an audience pull to the theatre in the initial week.

Code
df <- df %>% group_by(Distributor) %>% mutate(Count=n_distinct(`Box Office`)) 

df$Small_Dist <- ifelse(df$Count <= 3, 1, 0)

We can convert the dbl to date format and set the reference date so that the dates are correct. After that, from the Open_date, we can categorize thsoe dates to the season so that it can be used as a confounder. Seasons might have some effect on the theatre footfall and thereby, box office collections.

Code
df$Open_date <- as.Date(df$Open, origin = "1899-12-30")

# Create a new column with the season for each date
df <- df %>% 
  mutate(season = case_when(
    between(month(Open_date), 3, 5) ~ "Spring",
    between(month(Open_date), 6, 8) ~ "Summer",
    between(month(Open_date), 9, 11) ~ "Fall",
    TRUE ~ "Winter"
  ))
Code
head(df)
# A tibble: 6 × 20
# Groups:   Distributor [4]
   Rank Release    Budget `Box Office`  Gross max_th Opening perc_tot_gr open_th
  <dbl> <chr>       <dbl>        <dbl>  <dbl>  <dbl>   <dbl>       <dbl>   <dbl>
1     1 Top Gun: … 1.77e8   1493000000 7.19e8   4751  1.27e8        17.6    4735
2     2 Avatar: T… 4.6 e8   2319000000 6.37e8   4340  1.34e8        21.1    4202
3     3 Black Pan… 2.5 e8    859100000 4.53e8   4396  1.81e8        40      4396
4     4 Doctor St… 2   e8    955800000 4.11e8   4534  1.87e8        45.6    4534
5     5 Jurassic … 1.85e8   1004000000 3.77e8   4697  1.45e8        38.5    4676
6     6 Minions: … 1   e8    940500000 3.70e8   4427  1.07e8        28.9    4391
# ℹ 11 more variables: Open <dbl>, Close <dbl>, Distributor <chr>,
#   int_gross <dbl>, world_gross <dbl>, Unit <chr>, BXUnit <chr>, Count <int>,
#   Small_Dist <dbl>, Open_date <date>, season <chr>

Checking for NA values in each variable. There are 90 NA values in Budget variable, 38 in Box Office, 155 in Close date variable and 3 in int_gross. All othe variables seems to be good in terms of NA values.

Code
colSums(is.na(df))
       Rank     Release      Budget  Box Office       Gross      max_th 
          0           0          90          38           0           0 
    Opening perc_tot_gr     open_th        Open       Close Distributor 
          0           0           0           0         155           0 
  int_gross world_gross        Unit      BXUnit       Count  Small_Dist 
          3           0           0           0           0           0 
  Open_date      season 
          0           0 
Code
table(df$Small_Dist)

  0   1 
140  60 

Since 155 values of Close are NAs, it is better not to include that variable in the analysis. Most of the NA values are for the movies by small distributors, which need to be noted.

Code
df <- subset(df, select = -Close)

df <- na.omit(df)
# Checking for NA's
colSums(is.na(df))
       Rank     Release      Budget  Box Office       Gross      max_th 
          0           0           0           0           0           0 
    Opening perc_tot_gr     open_th        Open Distributor   int_gross 
          0           0           0           0           0           0 
world_gross        Unit      BXUnit       Count  Small_Dist   Open_date 
          0           0           0           0           0           0 
     season 
          0 

Have an idea about the structure of the dataset.

Code
str(df)
gropd_df [104 × 19] (S3: grouped_df/tbl_df/tbl/data.frame)
 $ Rank       : num [1:104] 1 2 3 4 5 6 7 8 9 10 ...
 $ Release    : chr [1:104] "Top Gun: Maverick" "Avatar: The Way of Water" "Black Panther: Wakanda Forever" "Doctor Strange in the Multiverse of Madness" ...
 $ Budget     : num [1:104] 1.77e+08 4.60e+08 2.50e+08 2.00e+08 1.85e+08 1.00e+08 2.00e+08 2.50e+08 1.10e+08 2.60e+08 ...
 $ Box Office : num [1:104] 1.49e+09 2.32e+09 8.59e+08 9.56e+08 1.00e+09 ...
 $ Gross      : num [1:104] 7.19e+08 6.37e+08 4.53e+08 4.11e+08 3.77e+08 ...
 $ max_th     : num [1:104] 4751 4340 4396 4534 4697 ...
 $ Opening    : num [1:104] 1.27e+08 1.34e+08 1.81e+08 1.87e+08 1.45e+08 ...
 $ perc_tot_gr: num [1:104] 17.6 21.1 40 45.6 38.5 28.9 36.3 42 37.8 39.8 ...
 $ open_th    : num [1:104] 4735 4202 4396 4534 4676 ...
 $ Open       : num [1:104] 44708 44911 44876 44687 44722 ...
 $ Distributor: chr [1:104] "Paramount Pictures" "20th Century Studios" "Walt Disney Studios Motion Pictures" "Walt Disney Studios Motion Pictures" ...
 $ int_gross  : num [1:104] 7.70e+08 1.54e+09 3.89e+08 5.44e+08 6.25e+08 ...
 $ world_gross: num [1:104] 1.49e+09 2.18e+09 8.43e+08 9.56e+08 1.00e+09 ...
 $ Unit       : chr [1:104] "million" "million" "million" "million" ...
 $ BXUnit     : chr [1:104] "billion" "billion" "million" "million" ...
 $ Count      : int [1:104] 12 4 9 9 19 19 6 9 12 6 ...
 $ Small_Dist : num [1:104] 0 0 0 0 0 0 0 0 0 0 ...
 $ Open_date  : Date[1:104], format: "2022-05-27" "2022-12-16" ...
 $ season     : chr [1:104] "Spring" "Winter" "Fall" "Spring" ...
 - attr(*, "groups")= tibble [32 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ Distributor: chr [1:32] "-" "20th Century Studios" "A24" "Blue Fox Entertainment" ...
  ..$ .rows      : list<int> [1:32] 
  .. ..$ : int [1:3] 87 94 95
  .. ..$ : int [1:4] 2 35 38 42
  .. ..$ : int [1:5] 26 54 60 67 102
  .. ..$ : int 100
  .. ..$ : int 66
  .. ..$ : int 97
  .. ..$ : int 62
  .. ..$ : int 103
  .. ..$ : int [1:3] 25 32 49
  .. ..$ : int 76
  .. ..$ : int [1:6] 37 41 63 74 80 86
  .. ..$ : int 82
  .. ..$ : int 89
  .. ..$ : int [1:3] 85 98 104
  .. ..$ : int [1:3] 50 51 73
  .. ..$ : int [1:3] 77 81 83
  .. ..$ : int 58
  .. ..$ : int [1:10] 1 9 16 17 24 31 52 55 79 99
  .. ..$ : int 88
  .. ..$ : int 90
  .. ..$ : int 84
  .. ..$ : int [1:2] 39 64
  .. ..$ : int [1:6] 13 18 22 34 44 48
  .. ..$ : int 93
  .. ..$ : int [1:2] 28 45
  .. ..$ : int 91
  .. ..$ : int [1:5] 30 69 70 72 78
  .. ..$ : int [1:19] 5 6 11 14 19 23 27 29 33 43 ...
  .. ..$ : int [1:7] 3 4 8 15 40 56 71
  .. ..$ : int [1:6] 7 10 12 20 21 36
  .. ..$ : int [1:2] 96 101
  .. ..$ : int 92
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE
 - attr(*, "na.action")= 'omit' Named int [1:96] 40 42 47 54 61 63 66 69 75 77 ...
  ..- attr(*, "names")= chr [1:96] "40" "42" "47" "54" ...

Using the glimpse() function, let’s have a look at how our data would look like!

Code
glimpse(df )
Rows: 104
Columns: 19
Groups: Distributor [32]
$ Rank         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ Release      <chr> "Top Gun: Maverick", "Avatar: The Way of Water", "Black P…
$ Budget       <dbl> 1.77e+08, 4.60e+08, 2.50e+08, 2.00e+08, 1.85e+08, 1.00e+0…
$ `Box Office` <dbl> 1493000000, 2319000000, 859100000, 955800000, 1004000000,…
$ Gross        <dbl> 718732821, 636955746, 453474324, 411331607, 376851080, 36…
$ max_th       <dbl> 4751, 4340, 4396, 4534, 4697, 4427, 4417, 4375, 4258, 440…
$ Opening      <dbl> 126707459, 134100226, 181339761, 187420998, 145075625, 10…
$ perc_tot_gr  <dbl> 17.6, 21.1, 40.0, 45.6, 38.5, 28.9, 36.3, 42.0, 37.8, 39.…
$ open_th      <dbl> 4735, 4202, 4396, 4534, 4676, 4391, 4417, 4375, 4234, 440…
$ Open         <dbl> 44708, 44911, 44876, 44687, 44722, 44743, 44624, 44750, 4…
$ Distributor  <chr> "Paramount Pictures", "20th Century Studios", "Walt Disne…
$ int_gross    <dbl> 770000000, 1539273359, 389276658, 544444197, 625127000, 5…
$ world_gross  <dbl> 1488732821, 2176229105, 842750982, 955775804, 1001978080,…
$ Unit         <chr> "million", "million", "million", "million", "million", "m…
$ BXUnit       <chr> "billion", "billion", "million", "million", "billion", "m…
$ Count        <int> 12, 4, 9, 9, 19, 19, 6, 9, 12, 6, 19, 6, 9, 19, 9, 12, 12…
$ Small_Dist   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Open_date    <date> 2022-05-27, 2022-12-16, 2022-11-11, 2022-05-06, 2022-06-…
$ season       <chr> "Spring", "Winter", "Fall", "Spring", "Summer", "Summer",…

Summary of each variables

Code
summary(df)
      Rank          Release              Budget            Box Office       
 Min.   :  1.00   Length:104         Min.   :1.500e+05   Min.   :3.250e+05  
 1st Qu.: 26.75   Class :character   1st Qu.:1.665e+07   1st Qu.:2.170e+07  
 Median : 56.50   Mode  :character   Median :3.550e+07   Median :6.535e+07  
 Mean   : 67.90                      Mean   :1.059e+10   Mean   :1.467e+10  
 3rd Qu.:102.50                      3rd Qu.:9.000e+07   3rd Qu.:1.966e+08  
 Max.   :196.00                      Max.   :3.200e+11   Max.   :4.000e+11  
     Gross               max_th        Opening           perc_tot_gr   
 Min.   :   325042   Min.   :   5   Min.   :     8416   Min.   : 0.10  
 1st Qu.:  3755174   1st Qu.:1005   1st Qu.:   825579   1st Qu.:21.25  
 Median : 17247468   Median :3131   Median :  5128384   Median :31.85  
 Mean   : 67815629   Mean   :2584   Mean   : 20890734   Mean   :29.89  
 3rd Qu.: 69210756   3rd Qu.:3848   3rd Qu.: 19126885   3rd Qu.:39.85  
 Max.   :718732821   Max.   :4751   Max.   :187420998   Max.   :62.90  
    open_th            Open       Distributor          int_gross        
 Min.   :   2.0   Min.   :44568   Length:104         Min.   :6.752e+04  
 1st Qu.: 661.5   1st Qu.:44673   Class :character   1st Qu.:2.542e+06  
 Median :3075.0   Median :44768   Mode  :character   Median :2.467e+07  
 Mean   :2400.9   Mean   :44757                      Mean   :8.879e+07  
 3rd Qu.:3770.0   3rd Qu.:44841                      3rd Qu.:6.163e+07  
 Max.   :4735.0   Max.   :44925                      Max.   :1.539e+09  
  world_gross            Unit              BXUnit              Count       
 Min.   :8.416e+03   Length:104         Length:104         Min.   : 1.000  
 1st Qu.:9.576e+06   Class :character   Class :character   1st Qu.: 4.750  
 Median :4.348e+07   Mode  :character   Mode  :character   Median : 9.000  
 Mean   :1.563e+08                                         Mean   : 8.721  
 3rd Qu.:1.445e+08                                         3rd Qu.:12.000  
 Max.   :2.176e+09                                         Max.   :19.000  
   Small_Dist       Open_date             season         
 Min.   :0.0000   Min.   :2022-01-07   Length:104        
 1st Qu.:0.0000   1st Qu.:2022-04-22   Class :character  
 Median :0.0000   Median :2022-07-25   Mode  :character  
 Mean   :0.2115   Mean   :2022-07-15                     
 3rd Qu.:0.0000   3rd Qu.:2022-10-07                     
 Max.   :1.0000   Max.   :2022-12-30                     

Let’s look on the distribution of data

Code
hist(df$`Box Office`)

Code
hist(df$Opening)

Code
# Plot histogram for "Box Office" variable with colors
hist(df$`Box Office`, col = "blue", main = "Box Office Distribution", xlab = "Box Office")

Code
# Plot histogram for "Opening" variable with colors
hist(df$Opening, col = "red", main = "Opening Distribution", xlab = "Opening")

Check correlation between Box Office and Opening

Code
cor.test(df$Opening, df$`Box Office`)

    Pearson's product-moment correlation

data:  df$Opening and df$`Box Office`
t = -1.0944, df = 102, p-value = 0.2764
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.29421354  0.08665663
sample estimates:
       cor 
-0.1077295 
Code
cor.test(log(df$Opening), log(df$`Box Office`))

    Pearson's product-moment correlation

data:  log(df$Opening) and log(df$`Box Office`)
t = 4.8371, df = 102, p-value = 4.681e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2610906 0.5765810
sample estimates:
      cor 
0.4319587 

The data appears to be skewed. Therefore we can try some data transformation. Log transformation would be the first option. Let’s visualize the correlation of variables in the dataset.

Code
# Subset the desired variables
pairs_df <- df[, c("Budget", "Box Office", "Opening", "open_th", "max_th")]

# Create a scatterplot matrix with colors
pairs(pairs_df, col = rainbow(length(pairs_df)), pch = 19)

From the visualization, we can conclude that all the variables are correlated by open_th and max_th are moving exactly the same in the graph which means we should drop one of them to avoid the multicollinearity.

Code
# Plot histogram for the logarithm of "Box Office" variable with colors
hist(log(df$`Box Office`), col = "blue", main = "Box Office Distribution (Log)", xlab = "Log(Box Office)")

Code
# Plot histogram for the logarithm of "Opening" variable with colors
hist(log(df$Opening), col = "red", main = "Opening Distribution (Log)", xlab = "Log(Opening)")

The log transformation made the data look like a normal distribution.

Running different models to get the best fit

Log transformed model with only the predictor and the dependant variable

Code
# Fit the multiple regression model using the log-transformed data
model1 <- lm(log(`Box Office`) ~ log(Opening), data = df)

# Print the model summary
#summary(model1)
#plot(model1)

Log transformation for all continous variables.

Code
model2 <- lm(log(`Box Office`) ~ log(Opening) + log(Budget) + log(open_th) + Small_Dist + season, data = df)
#summary(model2)
#plot(model2)

The R-square has significantly improved for this model. Let us try more models.

Log transformation for only Box Office, Opening and Budget.

Code
model3 <- lm(log(`Box Office`) ~ log(Opening) + log(Budget) + open_th + Small_Dist + season, data = df)
#summary(model3)
#plot(model3)

Log transformation for only Box Office, Opening and Budget. Removing season confounder.

Code
model4 <- lm(log(`Box Office`) ~ log(Opening) + log(Budget) + open_th + Small_Dist, data = df)
#summary(model4)
#plot(model4)

R square values are further below the previous model. Let us try the log transformed model with only the Budget and Small_Distributor as confounders.

Code
model5 <- lm(log(`Box Office`) ~ log(Opening) + log(Budget) + open_th, data = df)
#summary(model5)
#plot(model5)

Now, we can run a log transformed model with only the Opening and Budget.

Code
model6 <- lm(log(`Box Office`) ~ log(Opening) + log(Budget), data = df)
#summary(model6)
#plot(model6)

Let us run models after quadratic transformation

Code
model_quad1 <- lm(log(`Box Office`) ~ poly(Opening, 2, raw=TRUE) + log(Budget) + open_th + Small_Dist + season, data = df)
#summary(model_quad1)
#plot(model_quad1)

Selecting a model and Interpretation of the result

Now that we have multiple models, we can select the best model out of them using following criteria; a) R-squared b) Adjusted R-squared c) PRESS d) AIC (Akaike Information Criterion) e) BIC (Bayesian Information Criterion)

In the model selection process, we can follow the rule of thumb as, for R-squared and Adjusted R-square, higher is better while for PRESS, AIC and BIC, lower is better.

Create a functions to get R-squared, Adjusted R-squared & PRESS

Code
rsquared <- function(fit) summary(fit)$r.squared
adj_rsquared <- function(fit) summary(fit)$adj.r.squared
PRESS <- function(fit) {
  pr <- residuals(fit)/(1-lm.influence(fit)$hat)
  sum(pr^2)
}

For AIC and BIC, the functions AIC() and BIC() can be used

Now, applying the functions to model objects

Code
models <- list(model1, model2, model3, model4, model5, model6, model_quad1)
model_comparison <- data.frame(models = c('model1', 'model2', 'model3', 'model4', 'model5', 'model6', 'model_quad1'),
           rSquared = sapply(models, rsquared),
           adj_rSquared = sapply(models, adj_rsquared),
           PRESS = sapply(models, PRESS),
           AIC = sapply(models, AIC),
           BIC = sapply(models, BIC)) |>
  print()
       models  rSquared adj_rSquared    PRESS      AIC      BIC
1      model1 0.1865883    0.1786137 637.0491 486.5987 494.5319
2      model2 0.7330391    0.7135732 257.5371 382.7287 406.5282
3      model3 0.7348150    0.7154786 253.7251 382.0345 405.8341
4      model4 0.7185101    0.7071368 253.9816 382.2401 398.1064
5      model5 0.7104267    0.7017395 253.7536 383.1845 396.4065
6      model6 0.6883896    0.6822191 269.4781 388.8124 399.3900
7 model_quad1 0.6951725    0.6695029 289.9369 398.5236 424.9675
Code
library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
Code
stargazer(model1, model2, model3, model4, model5, model6, model_quad1, type = 'text')

==================================================================================================================================================================================================
                                                                                                      Dependent variable:                                                                         
                              --------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                                                                       log(`Box Office`)                                                                          
                                        (1)                    (2)                    (3)                    (4)                     (5)                     (6)                     (7)          
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
log(Opening)                         0.518***                0.631***               0.792***               0.790***               0.750***                 0.294***                               
                                      (0.107)                (0.133)                (0.178)                (0.178)                 (0.178)                 (0.069)                                
                                                                                                                                                                                                  
poly(Opening, 2, raw = TRUE)1                                                                                                                                                      0.00000*       
                                                                                                                                                                                  (0.00000)       
                                                                                                                                                                                                  
poly(Opening, 2, raw = TRUE)2                                                                                                                                                       -0.000        
                                                                                                                                                                                   (0.000)        
                                                                                                                                                                                                  
log(Budget)                                                  0.710***               0.642***               0.657***               0.665***                 0.746***                0.758***       
                                                             (0.056)                (0.063)                (0.063)                 (0.064)                 (0.059)                 (0.060)        
                                                                                                                                                                                                  
log(open_th)                                                -0.337***                                                                                                                             
                                                             (0.121)                                                                                                                              
                                                                                                                                                                                                  
open_th                                                                            -0.001***              -0.001***               -0.001***                                         0.0001        
                                                                                    (0.0002)               (0.0002)               (0.0002)                                         (0.0001)       
                                                                                                                                                                                                  
Small_Dist                                                   0.795**                 0.649*                 0.634*                                                                  0.467         
                                                             (0.377)                (0.372)                (0.376)                                                                 (0.398)        
                                                                                                                                                                                                  
seasonSpring                                                 0.804**                0.880**                                                                                         0.800*        
                                                             (0.405)                (0.404)                                                                                        (0.437)        
                                                                                                                                                                                                  
seasonSummer                                                  0.048                  0.005                                                                                          -0.104        
                                                             (0.379)                (0.377)                                                                                        (0.411)        
                                                                                                                                                                                                  
seasonWinter                                                  0.086                  0.271                                                                                          0.072         
                                                             (0.410)                (0.414)                                                                                        (0.443)        
                                                                                                                                                                                                  
Constant                             10.425***                -1.967                -3.845**               -3.907**                -3.308*                  0.606                  4.037***       
                                      (1.640)                (1.518)                (1.915)                (1.899)                 (1.883)                 (1.278)                 (1.116)        
                                                                                                                                                                                                  
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations                            104                    104                    104                    104                     104                     104                     104          
R2                                     0.187                  0.733                  0.735                  0.719                   0.710                   0.688                   0.695         
Adjusted R2                            0.179                  0.714                  0.715                  0.707                   0.702                   0.682                   0.670         
Residual Std. Error              2.463 (df = 102)        1.454 (df = 96)        1.450 (df = 96)        1.471 (df = 99)        1.484 (df = 100)         1.532 (df = 101)        1.562 (df = 95)    
F Statistic                   23.398*** (df = 1; 102) 37.658*** (df = 7; 96) 38.002*** (df = 7; 96) 63.175*** (df = 4; 99) 81.779*** (df = 3; 100) 111.561*** (df = 2; 101) 27.081*** (df = 8; 95)
==================================================================================================================================================================================================
Note:                                                                                                                                                                  *p<0.1; **p<0.05; ***p<0.01

Out of all the models we ran, model3 looks as the best fit.

The linear regression model predicts the logarithm of Box Office collections based on the logarithm of Opening weekend collections, logarithm of Budget, number of theaters the movie opened in, Small_Dist (a binary variable indicating whether the distributor is small or not), and the season in which the movie was released (Spring, Summer, or Winter).

The coefficients of the independent variables show the direction and magnitude of their effect on the dependent variable. The p-value associated with each coefficient indicates whether the coefficient is statistically significant or not.

The intercept coefficient is -3.845, which means that if all independent variables are zero, the model predicts that the logarithm of Box Office collections is -3.845. However, since all the independent variables are not zero in practice, this value is not meaningful.

The coefficient of the logarithm of Opening weekend collections is 0.792, which means that a one percent increase in Opening weekend collections is associated with a 0.792 percent increase in Box Office collections.

The coefficient of the logarithm of Budget is 0.642, which means that a one percent increase in Budget is associated with a 0.642 percent increase in Box Office collections.

The coefficient of the number of theaters the movie opened in (open_th) is negative (-0.00068), which means that as the number of theaters increases by one, the predicted logarithm of Box Office collections decreases by 0.00068. This is something to be studied further, as it goes against our logic and expectation.

The coefficient of Small_Dist is 0.649, which means that the Box Office collections of movies released by small distributors are 0.649 times higher than those of movies released by large distributors, holding other variables constant. This also need more detailed study.

The coefficients associated with season indicate the difference in Box Office collections between movies released in that season and movies released in Fall (omitted reference level). For example, the coefficient of season Spring is 0.88, which means that the predicted Box Office collections of movies released in Spring are 0.88 times higher than those of movies released in Fall.

The adjusted R-squared of the model is 0.7155, which means that 71.55% of the variation in the logarithm of Box Office collections can be explained by the independent variables in the model.

Visualising the regression model.

Code
library(ggplot2)

# Set the color palette
colors <- c("red", "blue", "green", "orange", "purple", "pink")

# Create a blank plot with the desired layout
par(mfrow=c(2,3))

# Loop through the diagnostic plots and assign colors
for (i in 1:6) {
  plot(model3, which = i, col = colors[i])
}

The Scale-Location plot suggests some level of heteroskedacity. So, it is better to check if there is a serious problem of heteroskedacity,we can run Breusch-Pagan test.

Code
library(lmtest)
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
Code
bptest(model3)

    studentized Breusch-Pagan test

data:  model3
BP = 16.322, df = 7, p-value = 0.02233

The p-value for Breusch-Pagan test is 0.02233. This is possibly a heteroskedacity. We need to look more into this.

Conclusion:

The model that we build shows that the Box Office collection of a movie is affected positively by the Opening week collection. 1 percentage of increase in Opening week collection results in 0.792 percent increase in Box office collection after taking into account the intercept(-3.845). But the Breusch-Pagan test for heteroskedacity gives a low p-value which suggests that the model is showing heteroskedacity.

The dataset has only 200 rows out of which 96 rows were dropped as they have NA values. Also, many movies in this dataset had Box office collections much lesser than their budget. But that does not necessarily mean that those movies made a loss. In the age of online streaming platforms, many movies are making money not from Box Office only. But our study was focused only on the revenue from theatre collection. Also, in the future study, the variable transformation to address the heteroskedacity should be done.

References:

  1. Nasir, Suphan & Öcal, Figen. (2016). Film Marketing: The Impact of Publicity Activities on Demand Generation. 10.4018/978-1-5225-0143-5.ch019.

  2. Elizabeth Cooper-Martin (1991) ,“Consumers and Movies: Some Findings on Experiential Products”, in NA - Advances in Consumer Research Volume 18, eds. Rebecca H. Holman and Michael R. Solomon, Provo, UT : Association for Consumer Research, Pages: 372-378.

  3. Kaggle.com (Original dataset)

  4. Wikipedia.com (Movie pages to get budget and box office collections)