Data Analytics and Computational Social Science: Final Project : Samhith Barlaya

Samhith Barlaya

Introduction

In the world of movies, IMDB rating often reflects the popular opinion on a movie. A high IMDB rating often propels new users to watch the same. It is hence interesting to analyse the impact of various factors like movie duration, movie director, actors and social media on a movie’s rating. For this project, I have analysed the “IMDB 5000 Movie Dataset” to answer the following research questions :

Does the duration of a movie impact its popularity?
Does it matter to a movie if its cast is popular among Facebook users?
What country do the most popular movies belong to?
Does presence of an actor boost a movie’s ratings?

Data

This dataset contains data on 5000 movies and their respective IMDB ratings. The following code demonstrates how to import the dataset for analysis:

movie_data <- read.csv("C:/Users/gbsam/Desktop/movie_metadata.csv")

Here is a quick glimpse of the dataset :

head(movie_data)

  color     director_name num_critic_for_reviews duration
1 Color     James Cameron                    723      178
2 Color    Gore Verbinski                    302      169
3 Color        Sam Mendes                    602      148
4 Color Christopher Nolan                    813      164
5             Doug Walker                     NA       NA
6 Color    Andrew Stanton                    462      132
  director_facebook_likes actor_3_facebook_likes     actor_2_name
1                       0                    855 Joel David Moore
2                     563                   1000    Orlando Bloom
3                       0                    161     Rory Kinnear
4                   22000                  23000   Christian Bale
5                     131                     NA       Rob Walker
6                     475                    530  Samantha Morton
  actor_1_facebook_likes     gross                          genres
1                   1000 760505847 Action|Adventure|Fantasy|Sci-Fi
2                  40000 309404152        Action|Adventure|Fantasy
3                  11000 200074175       Action|Adventure|Thriller
4                  27000 448130642                 Action|Thriller
5                    131        NA                     Documentary
6                    640  73058679         Action|Adventure|Sci-Fi
     actor_1_name
1     CCH Pounder
2     Johnny Depp
3 Christoph Waltz
4       Tom Hardy
5     Doug Walker
6    Daryl Sabara
                                               movie_title
1                                                 AvatarÂ 
2               Pirates of the Caribbean: At World's EndÂ 
3                                                SpectreÂ 
4                                  The Dark Knight RisesÂ 
5 Star Wars: Episode VII - The Force AwakensÂ             
6                                            John CarterÂ 
  num_voted_users cast_total_facebook_likes         actor_3_name
1          886204                      4834            Wes Studi
2          471220                     48350       Jack Davenport
3          275868                     11700     Stephanie Sigman
4         1144337                    106759 Joseph Gordon-Levitt
5               8                       143                     
6          212204                      1873         Polly Walker
  facenumber_in_poster
1                    0
2                    0
3                    1
4                    0
5                    0
6                    1
                                                     plot_keywords
1                           avatar|future|marine|native|paraplegic
2     goddess|marriage ceremony|marriage proposal|pirate|singapore
3                              bomb|espionage|sequel|spy|terrorist
4 deception|imprisonment|lawlessness|police officer|terrorist plot
5                                                                 
6               alien|american civil war|male nipple|mars|princess
                                       movie_imdb_link
1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1
2 http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1
3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1
4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1
5 http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1
6 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1
  num_user_for_reviews language country content_rating    budget
1                 3054  English     USA          PG-13 237000000
2                 1238  English     USA          PG-13 300000000
3                  994  English      UK          PG-13 245000000
4                 2701  English     USA          PG-13 250000000
5                   NA                                        NA
6                  738  English     USA          PG-13 263700000
  title_year actor_2_facebook_likes imdb_score aspect_ratio
1       2009                    936        7.9         1.78
2       2007                   5000        7.1         2.35
3       2015                    393        6.8         2.35
4       2012                  23000        8.5         2.35
5         NA                     12        7.1           NA
6       2012                    632        6.6         2.35
  movie_facebook_likes
1                33000
2                    0
3                85000
4               164000
5                    0
6                24000

To understand the dataset further, the following are the variables in the dataset and their respective types (as given by column ‘Mode’):

summary.default(movie_data)

                          Length Class  Mode     
color                     5043   -none- character
director_name             5043   -none- character
num_critic_for_reviews    5043   -none- numeric  
duration                  5043   -none- numeric  
director_facebook_likes   5043   -none- numeric  
actor_3_facebook_likes    5043   -none- numeric  
actor_2_name              5043   -none- character
actor_1_facebook_likes    5043   -none- numeric  
gross                     5043   -none- numeric  
genres                    5043   -none- character
actor_1_name              5043   -none- character
movie_title               5043   -none- character
num_voted_users           5043   -none- numeric  
cast_total_facebook_likes 5043   -none- numeric  
actor_3_name              5043   -none- character
facenumber_in_poster      5043   -none- numeric  
plot_keywords             5043   -none- character
movie_imdb_link           5043   -none- character
num_user_for_reviews      5043   -none- numeric  
language                  5043   -none- character
country                   5043   -none- character
content_rating            5043   -none- character
budget                    5043   -none- numeric  
title_year                5043   -none- numeric  
actor_2_facebook_likes    5043   -none- numeric  
imdb_score                5043   -none- numeric  
aspect_ratio              5043   -none- numeric  
movie_facebook_likes      5043   -none- numeric

The following table lists the median, mean and sd for each of the numeric columns (with column name suffixes _mean, _median and _sd respectively representing corresponding statistic):

library(dplyr)
movie_data %>%   summarise_if(is.numeric, list(mean = mean,median = median,sd = sd), na.rm = TRUE) %>% head()

  num_critic_for_reviews_mean duration_mean
1                    140.1943      107.2011
  director_facebook_likes_mean actor_3_facebook_likes_mean
1                     686.5092                    645.0098
  actor_1_facebook_likes_mean gross_mean num_voted_users_mean
1                    6560.047   48468408             83668.16
  cast_total_facebook_likes_mean facenumber_in_poster_mean
1                       9699.064                  1.371173
  num_user_for_reviews_mean budget_mean title_year_mean
1                  272.7708    39752620        2002.471
  actor_2_facebook_likes_mean imdb_score_mean aspect_ratio_mean
1                    1651.754        6.442138          2.220403
  movie_facebook_likes_mean num_critic_for_reviews_median
1                  7525.965                           110
  duration_median director_facebook_likes_median
1             103                             49
  actor_3_facebook_likes_median actor_1_facebook_likes_median
1                         371.5                           988
  gross_median num_voted_users_median
1     25517500                  34359
  cast_total_facebook_likes_median facenumber_in_poster_median
1                             3090                           1
  num_user_for_reviews_median budget_median title_year_median
1                         156         2e+07              2005
  actor_2_facebook_likes_median imdb_score_median aspect_ratio_median
1                           595               6.6                2.35
  movie_facebook_likes_median num_critic_for_reviews_sd duration_sd
1                         166                  121.6017    25.19744
  director_facebook_likes_sd actor_3_facebook_likes_sd
1                   2813.329                  1665.042
  actor_1_facebook_likes_sd gross_sd num_voted_users_sd
1                  15020.76 68452990           138485.3
  cast_total_facebook_likes_sd facenumber_in_poster_sd
1                      18163.8                2.013576
  num_user_for_reviews_sd budget_sd title_year_sd
1                377.9829 206114898       12.4746
  actor_2_facebook_likes_sd imdb_score_sd aspect_ratio_sd
1                  4042.439      1.125116        1.385113
  movie_facebook_likes_sd
1                19320.45

The following table lists the frequency of classes for each of the non-numeric columns (skipping a few columns because of too many different values)

movie_data %>%   count(language, sort = TRUE)

     language    n
1     English 4704
2      French   73
3     Spanish   40
4       Hindi   28
5    Mandarin   26
6      German   19
7    Japanese   18
8               12
9   Cantonese   11
10    Italian   11
11    Russian   11
12     Korean    8
13 Portuguese    8
14     Arabic    5
15     Danish    5
16     Hebrew    5
17    Swedish    5
18      Dutch    4
19  Norwegian    4
20    Persian    4
21     Polish    4
22    Chinese    3
23       Thai    3
24 Aboriginal    2
25       Dari    2
26  Icelandic    2
27 Indonesian    2
28       None    2
29   Romanian    2
30       Zulu    2
31    Aramaic    1
32    Bosnian    1
33      Czech    1
34   Dzongkha    1
35   Filipino    1
36      Greek    1
37  Hungarian    1
38    Kannada    1
39     Kazakh    1
40       Maya    1
41  Mongolian    1
42    Panjabi    1
43  Slovenian    1
44    Swahili    1
45      Tamil    1
46     Telugu    1
47       Urdu    1
48 Vietnamese    1

movie_data %>%   count(country,sort = TRUE)

                country    n
1                   USA 3807
2                    UK  448
3                France  154
4                Canada  126
5               Germany   97
6             Australia   55
7                 India   34
8                 Spain   33
9                 China   30
10                Italy   23
11                Japan   23
12            Hong Kong   17
13               Mexico   17
14          New Zealand   15
15          South Korea   14
16              Ireland   12
17              Denmark   11
18               Russia   11
19               Brazil    8
20               Norway    8
21         South Africa    8
22               Sweden    6
23                         5
24          Netherlands    5
25               Poland    5
26             Thailand    5
27            Argentina    4
28              Belgium    4
29                 Iran    4
30               Israel    4
31              Romania    4
32       Czech Republic    3
33              Iceland    3
34          Switzerland    3
35         West Germany    3
36               Greece    2
37              Hungary    2
38               Taiwan    2
39          Afghanistan    1
40                Aruba    1
41              Bahamas    1
42             Bulgaria    1
43             Cambodia    1
44             Cameroon    1
45                Chile    1
46             Colombia    1
47   Dominican Republic    1
48                Egypt    1
49              Finland    1
50              Georgia    1
51            Indonesia    1
52                Kenya    1
53           Kyrgyzstan    1
54                Libya    1
55             New Line    1
56              Nigeria    1
57        Official site    1
58             Pakistan    1
59               Panama    1
60                 Peru    1
61          Philippines    1
62             Slovakia    1
63             Slovenia    1
64         Soviet Union    1
65               Turkey    1
66 United Arab Emirates    1

movie_data %>%   count(content_rating,sort = TRUE)

   content_rating    n
1               R 2118
2           PG-13 1461
3              PG  701
4                  303
5       Not Rated  116
6               G  112
7         Unrated   62
8        Approved   55
9           TV-14   30
10          TV-MA   20
11          TV-PG   13
12              X   13
13           TV-G   10
14         Passed    9
15          NC-17    7
16             GP    6
17              M    5
18           TV-Y    1
19          TV-Y7    1

Visualization

The following bar graph shows the distribution of the IMDB ratings in this dataset across years. The variable I am using is ‘title_year’, which depicts the year that the movie was released. From the visualization, it is clear that majority of the movies in the dataset have been released between the years 2000 - 2010.

library(ggplot2)
ggplot(data = movie_data,aes(x = title_year)) + geom_bar()

The following plot shows the variation of IMDB score of a movie with respect to the number of likes on the movie’s Facebook page. From the plot, we are able to identify that as the number of Facebook likes increase, the chances of higher IMDB rating increases.

ggplot(data = movie_data, aes(x = movie_facebook_likes, y = imdb_score)) +   geom_point(alpha = 0.5)

In an attempt to answer one of my research questions, “What country do the most popular movies belong to?” - I plot a bar graph with countries in the x-axis and their median IMDB score in the y-axis.

library(dplyr)
library(ggplot2)

movie_data <- read.csv("C:/Users/gbsam/Desktop/movie_metadata.csv")

country_summary <- movie_data %>% 
  group_by(country) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score), 
            n_ = n()) %>% top_n(10, median_rating) 

barPlot <- ggplot(country_summary, aes(reorder(country, -median_rating), median_rating)) + 
                   geom_col() +  
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) +  theme(axis.text.x=element_text(angle=90)) 

barPlot + labs(y="Movie IMDB Rating, with uncertainity", x = "Country")

Many countries listed here have very less review count (sometimes just one), which gives us a skewed result. Instead, I choose 10 countries that have highest review count and then plot the above graph, which gives us a more realistic result.

country_summary <- movie_data %>% 
  group_by(country) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(10, n_) 

barPlot <- ggplot(country_summary, aes(reorder(country, -median_rating), median_rating)) + 
                   geom_col() +  
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2)

barPlot + labs(y="Movie IMDB Rating, with uncertainity", x = "Country") +  theme(axis.text.x=element_text(angle=90))

To answer my research question “Does the duration of a movie impact its popularity?” I try to plot a line graph of movie duration vs IMDB rating.

ggplot(data=movie_data, aes(x=duration, y=imdb_score, group=1)) + geom_smooth() + geom_point(alpha = 0.3)

There are too many points here to make any conclusions. Hence, I try to categorize the movies by their language.

ggplot(subset(movie_data, language %in% c('English', 'Cantonese', 'French','German', 'Japanese', 'Italian', 'Mandarin', 'Spanish')), aes(x=duration, y=imdb_score, group=1)) + geom_smooth() + facet_wrap(vars(language)) + geom_point(alpha = 0.3)

I also try to find out the ratio of content ratings in each year of movies. I also scale the graph to highlight the relevant parts of the graph to the user.

ggplot(data = movie_data,aes(x = title_year, fill = content_rating)) + geom_bar() + xlim(c(1990,NA)) + ggtitle("Plot of number of movies per year and content rating share per year ") +
  xlab("Years") + ylab("No of movies")

Trying to answer the question - ‘Does it matter to a movie if its cast is popular among Facebook users?’ in the following two plots, we see an interesting similarity of trend between the first and second plot - that is, the movie rating seems to have a similar variation in IMDB rating with the increase in either Actor 1’s facebook likes or Actor 2’s facebook likes.

ggplot(data=movie_data, aes(x=actor_1_facebook_likes, y=imdb_score, group=1)) + geom_smooth()

ggplot(data=movie_data, aes(x=actor_2_facebook_likes, y=imdb_score, group=1)) + geom_smooth()

ggplot(data=movie_data, aes(x=actor_3_facebook_likes, y=imdb_score, group=1)) + geom_smooth()

Next, we visualize some plots to understand if presence of an actor boost a movie’s ratings. In order to answer this, I try to get the top 15 most popular actors when being listed as Actor 1, Actor 2 or Actor 3 in a movie. We can see that some names like Morgan Freeman, Steve Buscami, Bruce Willis appear on mulitple plots, which seem to indicate that their presence in a movie has some affect on its rating.

country_summary <- movie_data %>% 
  group_by(actor_1_name) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(15, n_) 

ggplot(country_summary, aes(reorder(actor_1_name, -median_rating), median_rating)) + 
                   geom_col() +  theme(axis.text.x=element_text(angle=90)) + 
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) + ggtitle("Plot of top 15 actors (as Actor 1) by median IMDB rating of their movies ") +
  xlab("Actors") + ylab("IMDB Median Rating")

country_summary <- movie_data %>% 
  group_by(actor_2_name) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(15, n_) 

ggplot(country_summary, aes(reorder(actor_2_name, -median_rating), median_rating)) + 
                   geom_col() +  theme(axis.text.x=element_text(angle=90)) + 
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) + ggtitle("Plot of top 15 actors (as Actor 2) by median IMDB rating of their movies ") +
  xlab("Actors") + ylab("IMDB Median Rating")

country_summary <- movie_data %>% 
  group_by(actor_3_name) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(15, n_) 

ggplot(country_summary, aes(reorder(actor_3_name, -median_rating), median_rating)) + 
                   geom_col() +  theme(axis.text.x=element_text(angle=90)) + 
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) + ggtitle("Plot of top 15 actors (as Actor 3) by median IMDB rating of their movies ") +
  xlab("Actors") + ylab("IMDB Median Rating")

Reflection

Being a frequent user of IMDB ratings for movies, this research was pretty interesting one for me. The IMDB 5000 Movie dataset is a relatively clean dataset with a few missing values. In this paper, I try to visualize a few plots to try to understand the impact of factors like a movie’s director, actor, number of Facebook likes on its page and duration of the movie on a movie’s popular opinion (indicated by the IMDB score). Although presence of huge number of rows and values for categorical variables hindered some of the visualizations, I have tried to reduce the scope of the visualization to few variable values to make the plot more understandable to the user. Using R for the first time, this was a great learning experience.

One of the challenging part of the project was identifying what countries the most popular movies belonged to. Since there are a lot of different values for the country variable, the median IMDB value per country was skewed, and some of the countries with top most median IMDB ratings had one or two total ratings available in the database. In order to tackle this, I first filtered the dataset to fetch 10 countries with highest count of ratings first and then plotted them according to their IMDB median rating.

I feel usage of bar plots, facet wraps and smooth plots helped improve the understand-ability of the plots. However, I could have improved more by sampling the dataset and narrowing down the scope of the analysis. One of the things I would have loved to analyse is the probability of success of a director in a genre given the IMDB ratings of his earlier movies. This unfortunately could not be done owing to the presence of huge number of different values to the genre variable. These things could be possible next steps in the project. I am also curious to dig deeper into the questions on the impact that presence of a particular actor has on movie rating and also the impact of the number of Facebook likes of a particular actor on his/her movies.

Conclusion

From the above plots, it is clear that the duration of the movie does not have much impact on its popularity in the English and Italian languages. For other languages,there seems to be an increase in the rating with increase in the duration. Among the 10 countries with highest review count in the dataset, Japan seems to be having the highest median IMDB rating, followed closely by Spain, UK and India. Most movies seem to belong to either content rating of R or a rating of PG-13.

The impact of the number of Facebook likes on movie’s page on its IMDB rating is unclear, due to lack of sufficient number of movies with high Facebook page like count. Similarly, the impact of an actor’s Facebook page like count is not very clear. There seems to be an interesting similarity in the plots of Facebook like counts of Actor 1 against IMDB rating and Actor 2 against IMDB rating that might be worth exploring further. Although presence of an actor seems to have some impact on movie rating, it cannot be conclusively proven from the results in the plots.

Bibliography

Yueming (2017), ‘IMDB 5000 Movie Dataset’, https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset
R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.
Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.

Comment on this article Share:

Final Project : Samhith Barlaya