Submission for Final Project
In the world of movies, IMDB rating often reflects the popular opinion on a movie. A high IMDB rating often propels new users to watch the same. It is hence interesting to analyse the impact of various factors like movie duration, movie director, actors and social media on a movie’s rating. For this project, I have analysed the “IMDB 5000 Movie Dataset” to answer the following research questions :
This dataset contains data on 5000 movies and their respective IMDB ratings. The following code demonstrates how to import the dataset for analysis:
movie_data <- read.csv("C:/Users/gbsam/Desktop/movie_metadata.csv")
Here is a quick glimpse of the dataset :
head(movie_data)
color director_name num_critic_for_reviews duration
1 Color James Cameron 723 178
2 Color Gore Verbinski 302 169
3 Color Sam Mendes 602 148
4 Color Christopher Nolan 813 164
5 Doug Walker NA NA
6 Color Andrew Stanton 462 132
director_facebook_likes actor_3_facebook_likes actor_2_name
1 0 855 Joel David Moore
2 563 1000 Orlando Bloom
3 0 161 Rory Kinnear
4 22000 23000 Christian Bale
5 131 NA Rob Walker
6 475 530 Samantha Morton
actor_1_facebook_likes gross genres
1 1000 760505847 Action|Adventure|Fantasy|Sci-Fi
2 40000 309404152 Action|Adventure|Fantasy
3 11000 200074175 Action|Adventure|Thriller
4 27000 448130642 Action|Thriller
5 131 NA Documentary
6 640 73058679 Action|Adventure|Sci-Fi
actor_1_name
1 CCH Pounder
2 Johnny Depp
3 Christoph Waltz
4 Tom Hardy
5 Doug Walker
6 Daryl Sabara
movie_title
1 AvatarÂ
2 Pirates of the Caribbean: At World's EndÂ
3 SpectreÂ
4 The Dark Knight RisesÂ
5 Star Wars: Episode VII - The Force AwakensÂ
6 John CarterÂ
num_voted_users cast_total_facebook_likes actor_3_name
1 886204 4834 Wes Studi
2 471220 48350 Jack Davenport
3 275868 11700 Stephanie Sigman
4 1144337 106759 Joseph Gordon-Levitt
5 8 143
6 212204 1873 Polly Walker
facenumber_in_poster
1 0
2 0
3 1
4 0
5 0
6 1
plot_keywords
1 avatar|future|marine|native|paraplegic
2 goddess|marriage ceremony|marriage proposal|pirate|singapore
3 bomb|espionage|sequel|spy|terrorist
4 deception|imprisonment|lawlessness|police officer|terrorist plot
5
6 alien|american civil war|male nipple|mars|princess
movie_imdb_link
1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1
2 http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1
3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1
4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1
5 http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1
6 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1
num_user_for_reviews language country content_rating budget
1 3054 English USA PG-13 237000000
2 1238 English USA PG-13 300000000
3 994 English UK PG-13 245000000
4 2701 English USA PG-13 250000000
5 NA NA
6 738 English USA PG-13 263700000
title_year actor_2_facebook_likes imdb_score aspect_ratio
1 2009 936 7.9 1.78
2 2007 5000 7.1 2.35
3 2015 393 6.8 2.35
4 2012 23000 8.5 2.35
5 NA 12 7.1 NA
6 2012 632 6.6 2.35
movie_facebook_likes
1 33000
2 0
3 85000
4 164000
5 0
6 24000
To understand the dataset further, the following are the variables in the dataset and their respective types (as given by column ‘Mode’):
summary.default(movie_data)
Length Class Mode
color 5043 -none- character
director_name 5043 -none- character
num_critic_for_reviews 5043 -none- numeric
duration 5043 -none- numeric
director_facebook_likes 5043 -none- numeric
actor_3_facebook_likes 5043 -none- numeric
actor_2_name 5043 -none- character
actor_1_facebook_likes 5043 -none- numeric
gross 5043 -none- numeric
genres 5043 -none- character
actor_1_name 5043 -none- character
movie_title 5043 -none- character
num_voted_users 5043 -none- numeric
cast_total_facebook_likes 5043 -none- numeric
actor_3_name 5043 -none- character
facenumber_in_poster 5043 -none- numeric
plot_keywords 5043 -none- character
movie_imdb_link 5043 -none- character
num_user_for_reviews 5043 -none- numeric
language 5043 -none- character
country 5043 -none- character
content_rating 5043 -none- character
budget 5043 -none- numeric
title_year 5043 -none- numeric
actor_2_facebook_likes 5043 -none- numeric
imdb_score 5043 -none- numeric
aspect_ratio 5043 -none- numeric
movie_facebook_likes 5043 -none- numeric
The following table lists the median, mean and sd for each of the numeric columns (with column name suffixes _mean, _median and _sd respectively representing corresponding statistic):
library(dplyr)
movie_data %>% summarise_if(is.numeric, list(mean = mean,median = median,sd = sd), na.rm = TRUE) %>% head()
num_critic_for_reviews_mean duration_mean
1 140.1943 107.2011
director_facebook_likes_mean actor_3_facebook_likes_mean
1 686.5092 645.0098
actor_1_facebook_likes_mean gross_mean num_voted_users_mean
1 6560.047 48468408 83668.16
cast_total_facebook_likes_mean facenumber_in_poster_mean
1 9699.064 1.371173
num_user_for_reviews_mean budget_mean title_year_mean
1 272.7708 39752620 2002.471
actor_2_facebook_likes_mean imdb_score_mean aspect_ratio_mean
1 1651.754 6.442138 2.220403
movie_facebook_likes_mean num_critic_for_reviews_median
1 7525.965 110
duration_median director_facebook_likes_median
1 103 49
actor_3_facebook_likes_median actor_1_facebook_likes_median
1 371.5 988
gross_median num_voted_users_median
1 25517500 34359
cast_total_facebook_likes_median facenumber_in_poster_median
1 3090 1
num_user_for_reviews_median budget_median title_year_median
1 156 2e+07 2005
actor_2_facebook_likes_median imdb_score_median aspect_ratio_median
1 595 6.6 2.35
movie_facebook_likes_median num_critic_for_reviews_sd duration_sd
1 166 121.6017 25.19744
director_facebook_likes_sd actor_3_facebook_likes_sd
1 2813.329 1665.042
actor_1_facebook_likes_sd gross_sd num_voted_users_sd
1 15020.76 68452990 138485.3
cast_total_facebook_likes_sd facenumber_in_poster_sd
1 18163.8 2.013576
num_user_for_reviews_sd budget_sd title_year_sd
1 377.9829 206114898 12.4746
actor_2_facebook_likes_sd imdb_score_sd aspect_ratio_sd
1 4042.439 1.125116 1.385113
movie_facebook_likes_sd
1 19320.45
The following table lists the frequency of classes for each of the non-numeric columns (skipping a few columns because of too many different values)
language n
1 English 4704
2 French 73
3 Spanish 40
4 Hindi 28
5 Mandarin 26
6 German 19
7 Japanese 18
8 12
9 Cantonese 11
10 Italian 11
11 Russian 11
12 Korean 8
13 Portuguese 8
14 Arabic 5
15 Danish 5
16 Hebrew 5
17 Swedish 5
18 Dutch 4
19 Norwegian 4
20 Persian 4
21 Polish 4
22 Chinese 3
23 Thai 3
24 Aboriginal 2
25 Dari 2
26 Icelandic 2
27 Indonesian 2
28 None 2
29 Romanian 2
30 Zulu 2
31 Aramaic 1
32 Bosnian 1
33 Czech 1
34 Dzongkha 1
35 Filipino 1
36 Greek 1
37 Hungarian 1
38 Kannada 1
39 Kazakh 1
40 Maya 1
41 Mongolian 1
42 Panjabi 1
43 Slovenian 1
44 Swahili 1
45 Tamil 1
46 Telugu 1
47 Urdu 1
48 Vietnamese 1
country n
1 USA 3807
2 UK 448
3 France 154
4 Canada 126
5 Germany 97
6 Australia 55
7 India 34
8 Spain 33
9 China 30
10 Italy 23
11 Japan 23
12 Hong Kong 17
13 Mexico 17
14 New Zealand 15
15 South Korea 14
16 Ireland 12
17 Denmark 11
18 Russia 11
19 Brazil 8
20 Norway 8
21 South Africa 8
22 Sweden 6
23 5
24 Netherlands 5
25 Poland 5
26 Thailand 5
27 Argentina 4
28 Belgium 4
29 Iran 4
30 Israel 4
31 Romania 4
32 Czech Republic 3
33 Iceland 3
34 Switzerland 3
35 West Germany 3
36 Greece 2
37 Hungary 2
38 Taiwan 2
39 Afghanistan 1
40 Aruba 1
41 Bahamas 1
42 Bulgaria 1
43 Cambodia 1
44 Cameroon 1
45 Chile 1
46 Colombia 1
47 Dominican Republic 1
48 Egypt 1
49 Finland 1
50 Georgia 1
51 Indonesia 1
52 Kenya 1
53 Kyrgyzstan 1
54 Libya 1
55 New Line 1
56 Nigeria 1
57 Official site 1
58 Pakistan 1
59 Panama 1
60 Peru 1
61 Philippines 1
62 Slovakia 1
63 Slovenia 1
64 Soviet Union 1
65 Turkey 1
66 United Arab Emirates 1
content_rating n
1 R 2118
2 PG-13 1461
3 PG 701
4 303
5 Not Rated 116
6 G 112
7 Unrated 62
8 Approved 55
9 TV-14 30
10 TV-MA 20
11 TV-PG 13
12 X 13
13 TV-G 10
14 Passed 9
15 NC-17 7
16 GP 6
17 M 5
18 TV-Y 1
19 TV-Y7 1
The following bar graph shows the distribution of the IMDB ratings in this dataset across years. The variable I am using is ‘title_year’, which depicts the year that the movie was released. From the visualization, it is clear that majority of the movies in the dataset have been released between the years 2000 - 2010.
The following plot shows the variation of IMDB score of a movie with respect to the number of likes on the movie’s Facebook page. From the plot, we are able to identify that as the number of Facebook likes increase, the chances of higher IMDB rating increases.
ggplot(data = movie_data, aes(x = movie_facebook_likes, y = imdb_score)) + geom_point(alpha = 0.5)
In an attempt to answer one of my research questions, “What country do the most popular movies belong to?” - I plot a bar graph with countries in the x-axis and their median IMDB score in the y-axis.
library(dplyr)
library(ggplot2)
movie_data <- read.csv("C:/Users/gbsam/Desktop/movie_metadata.csv")
country_summary <- movie_data %>%
group_by(country) %>%
summarise(median_rating = median(imdb_score),
sd_rating = sd(imdb_score),
n_ = n()) %>% top_n(10, median_rating)
barPlot <- ggplot(country_summary, aes(reorder(country, -median_rating), median_rating)) +
geom_col() +
geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) + theme(axis.text.x=element_text(angle=90))
barPlot + labs(y="Movie IMDB Rating, with uncertainity", x = "Country")
Many countries listed here have very less review count (sometimes just one), which gives us a skewed result. Instead, I choose 10 countries that have highest review count and then plot the above graph, which gives us a more realistic result.
country_summary <- movie_data %>%
group_by(country) %>%
summarise(median_rating = median(imdb_score),
sd_rating = sd(imdb_score),
n_ = n()) %>% top_n(10, n_)
barPlot <- ggplot(country_summary, aes(reorder(country, -median_rating), median_rating)) +
geom_col() +
geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2)
barPlot + labs(y="Movie IMDB Rating, with uncertainity", x = "Country") + theme(axis.text.x=element_text(angle=90))
To answer my research question “Does the duration of a movie impact its popularity?” I try to plot a line graph of movie duration vs IMDB rating.
ggplot(data=movie_data, aes(x=duration, y=imdb_score, group=1)) + geom_smooth() + geom_point(alpha = 0.3)
There are too many points here to make any conclusions. Hence, I try to categorize the movies by their language.
ggplot(subset(movie_data, language %in% c('English', 'Cantonese', 'French','German', 'Japanese', 'Italian', 'Mandarin', 'Spanish')), aes(x=duration, y=imdb_score, group=1)) + geom_smooth() + facet_wrap(vars(language)) + geom_point(alpha = 0.3)
I also try to find out the ratio of content ratings in each year of movies. I also scale the graph to highlight the relevant parts of the graph to the user.
Trying to answer the question - ‘Does it matter to a movie if its cast is popular among Facebook users?’ in the following two plots, we see an interesting similarity of trend between the first and second plot - that is, the movie rating seems to have a similar variation in IMDB rating with the increase in either Actor 1’s facebook likes or Actor 2’s facebook likes.
ggplot(data=movie_data, aes(x=actor_1_facebook_likes, y=imdb_score, group=1)) + geom_smooth()
ggplot(data=movie_data, aes(x=actor_2_facebook_likes, y=imdb_score, group=1)) + geom_smooth()
ggplot(data=movie_data, aes(x=actor_3_facebook_likes, y=imdb_score, group=1)) + geom_smooth()
Next, we visualize some plots to understand if presence of an actor boost a movie’s ratings. In order to answer this, I try to get the top 15 most popular actors when being listed as Actor 1, Actor 2 or Actor 3 in a movie. We can see that some names like Morgan Freeman, Steve Buscami, Bruce Willis appear on mulitple plots, which seem to indicate that their presence in a movie has some affect on its rating.
country_summary <- movie_data %>%
group_by(actor_1_name) %>%
summarise(median_rating = median(imdb_score),
sd_rating = sd(imdb_score),
n_ = n()) %>% top_n(15, n_)
ggplot(country_summary, aes(reorder(actor_1_name, -median_rating), median_rating)) +
geom_col() + theme(axis.text.x=element_text(angle=90)) +
geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) + ggtitle("Plot of top 15 actors (as Actor 1) by median IMDB rating of their movies ") +
xlab("Actors") + ylab("IMDB Median Rating")
country_summary <- movie_data %>%
group_by(actor_2_name) %>%
summarise(median_rating = median(imdb_score),
sd_rating = sd(imdb_score),
n_ = n()) %>% top_n(15, n_)
ggplot(country_summary, aes(reorder(actor_2_name, -median_rating), median_rating)) +
geom_col() + theme(axis.text.x=element_text(angle=90)) +
geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) + ggtitle("Plot of top 15 actors (as Actor 2) by median IMDB rating of their movies ") +
xlab("Actors") + ylab("IMDB Median Rating")
country_summary <- movie_data %>%
group_by(actor_3_name) %>%
summarise(median_rating = median(imdb_score),
sd_rating = sd(imdb_score),
n_ = n()) %>% top_n(15, n_)
ggplot(country_summary, aes(reorder(actor_3_name, -median_rating), median_rating)) +
geom_col() + theme(axis.text.x=element_text(angle=90)) +
geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) + ggtitle("Plot of top 15 actors (as Actor 3) by median IMDB rating of their movies ") +
xlab("Actors") + ylab("IMDB Median Rating")
Being a frequent user of IMDB ratings for movies, this research was pretty interesting one for me. The IMDB 5000 Movie dataset is a relatively clean dataset with a few missing values. In this paper, I try to visualize a few plots to try to understand the impact of factors like a movie’s director, actor, number of Facebook likes on its page and duration of the movie on a movie’s popular opinion (indicated by the IMDB score). Although presence of huge number of rows and values for categorical variables hindered some of the visualizations, I have tried to reduce the scope of the visualization to few variable values to make the plot more understandable to the user. Using R for the first time, this was a great learning experience.
One of the challenging part of the project was identifying what countries the most popular movies belonged to. Since there are a lot of different values for the country variable, the median IMDB value per country was skewed, and some of the countries with top most median IMDB ratings had one or two total ratings available in the database. In order to tackle this, I first filtered the dataset to fetch 10 countries with highest count of ratings first and then plotted them according to their IMDB median rating.
I feel usage of bar plots, facet wraps and smooth plots helped improve the understand-ability of the plots. However, I could have improved more by sampling the dataset and narrowing down the scope of the analysis. One of the things I would have loved to analyse is the probability of success of a director in a genre given the IMDB ratings of his earlier movies. This unfortunately could not be done owing to the presence of huge number of different values to the genre variable. These things could be possible next steps in the project. I am also curious to dig deeper into the questions on the impact that presence of a particular actor has on movie rating and also the impact of the number of Facebook likes of a particular actor on his/her movies.
From the above plots, it is clear that the duration of the movie does not have much impact on its popularity in the English and Italian languages. For other languages,there seems to be an increase in the rating with increase in the duration. Among the 10 countries with highest review count in the dataset, Japan seems to be having the highest median IMDB rating, followed closely by Spain, UK and India. Most movies seem to belong to either content rating of R or a rating of PG-13.
The impact of the number of Facebook likes on movie’s page on its IMDB rating is unclear, due to lack of sufficient number of movies with high Facebook page like count. Similarly, the impact of an actor’s Facebook page like count is not very clear. There seems to be an interesting similarity in the plots of Facebook like counts of Actor 1 against IMDB rating and Actor 2 against IMDB rating that might be worth exploring further. Although presence of an actor seems to have some impact on movie rating, it cannot be conclusively proven from the results in the plots.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Barlaya (2022, Jan. 25). Data Analytics and Computational Social Science: Final Project : Samhith Barlaya. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomsbarlayafinal/
BibTeX citation
@misc{barlaya2022final, author = {Barlaya, Samhith}, title = {Data Analytics and Computational Social Science: Final Project : Samhith Barlaya}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomsbarlayafinal/}, year = {2022} }