Data Analytics and Computational Social Science: Homework 6 : Samhith Barlaya

Samhith Barlaya

The following are my research questions and their respective plots:

In an attempt to answer one of my research questions, “What country do the most popular movies belong to?” - I plot a bar graph with countries in the x-axis and their median IMDB score in the y-axis.

library(dplyr)
library(ggplot2)

movie_data <- read.csv("C:/Users/gbsam/Desktop/movie_metadata.csv")

country_summary <- movie_data %>% 
  group_by(country) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score), 
            n_ = n()) %>% top_n(10, median_rating) 

barPlot <- ggplot(country_summary, aes(reorder(country, -median_rating), median_rating)) + 
                   geom_col() +  
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2)

barPlot + labs(y="Movie IMDB Rating, with uncertainity", x = "Country")

Many countries listed here have very less review count (sometimes just one), which gives us a skewed result. Instead, I choose 10 countries that have highest review count and then plot the above graph, which gives us a more realistic result.

country_summary <- movie_data %>% 
  group_by(country) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(10, n_) 

barPlot <- ggplot(country_summary, aes(reorder(country, -median_rating), median_rating)) + 
                   geom_col() +  
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2)

barPlot + labs(y="Movie IMDB Rating, with uncertainity", x = "Country")

To answer my research question “Does the duration of a movie impact its popularity?” I try to plot a line graph of movie duration vs IMDB rating.

ggplot(data=movie_data, aes(x=duration, y=imdb_score, group=1)) + geom_smooth()

There are too many points here to make any conclusions. Hence, I try to categorize the movies by their language.

ggplot(subset(movie_data, language %in% c('English', 'Cantonese', 'French','German', 'Japanese', 'Italian', 'Mandarin', 'Spanish')), aes(x=duration, y=imdb_score, group=1)) + geom_smooth() + facet_wrap(vars(language))

I also try to find out the ratio of content ratings in each year of movies. I also scale the graph to highlight the relevant parts of the graph to the user.

ggplot(data = movie_data,aes(x = title_year, fill = content_rating)) + geom_bar() + xlim(c(1990,NA)) + ggtitle("Plot of number of movies per year and content rating share per year ") +
  xlab("Years") + ylab("No of movies")

Does it matter to a movie if its cast is popular among Facebook users?

We see an interesting similarity of trend between the first and second plot - that is, the movie rating seems to have a similar variation in IMDB rating with the increase in either Actor 1’s facebook likes or Actor 2’s facebook likes.

ggplot(data=movie_data, aes(x=actor_1_facebook_likes, y=imdb_score, group=1)) + geom_smooth()

ggplot(data=movie_data, aes(x=actor_2_facebook_likes, y=imdb_score, group=1)) + geom_smooth()

ggplot(data=movie_data, aes(x=actor_3_facebook_likes, y=imdb_score, group=1)) + geom_smooth()

Does presence of an actor boost a movie’s ratings?

In order to answer this, I try to get the top 15 most popular actors when being listed as Actor 1, Actor 2 or Actor 3 in a movie. We can see that some names like Morgan Freeman, Steve Buscami, Bruce Willis appear on mulitple plots, which seem to indicate that their presence in a movie has some affect on its rating.

country_summary <- movie_data %>% 
  group_by(actor_1_name) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(15, n_) 

ggplot(country_summary, aes(reorder(actor_1_name, -median_rating), median_rating)) + 
                   geom_col() +  theme(axis.text.x=element_text(angle=90)) + 
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) + ggtitle("Plot of top 15 actors by median IMDB rating of their movies ") +
  xlab("Actors") + ylab("IMDB Median Rating")

country_summary <- movie_data %>% 
  group_by(actor_2_name) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(15, n_) 

ggplot(country_summary, aes(reorder(actor_2_name, -median_rating), median_rating)) + 
                   geom_col() +  theme(axis.text.x=element_text(angle=90)) + 
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) + ggtitle("Plot of top 15 actors by median IMDB rating of their movies ") +
  xlab("Actors") + ylab("IMDB Median Rating")

country_summary <- movie_data %>% 
  group_by(actor_3_name) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(15, n_) 

ggplot(country_summary, aes(reorder(actor_3_name, -median_rating), median_rating)) + 
                   geom_col() +  theme(axis.text.x=element_text(angle=90)) + 
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) + ggtitle("Plot of top 15 actors by median IMDB rating of their movies ") +
  xlab("Actors") + ylab("IMDB Median Rating")

What is missing from your final project?

The research question of ‘What genre is a director expected to succeed in, given his past ratings?’ is missing. Both director and genre variables have too many different values and hence are difficult to analyse.

What do you hope to accomplish between now and submission time?

Try answering the research question ‘What genre is a director expected to succeed in, given his past ratings?’ using limited data.
Find better way to answer the question ‘Does presence of an actor boost a movie’s ratings?’. The current visualization seems a bit confusing and lacks complete clarity on whether actor’s presence affects a movie’s popularity.
Dig a little deeper on the interesting correlation between Actor 1 and Actor 2’s Facebook like counts and their impact on movie rating.

Comment on this article Share:

Homework 6 : Samhith Barlaya