Homework 6 : Samhith Barlaya

Submission for Homework 6

Samhith Barlaya

The following are my research questions and their respective plots:


movie_data <- read.csv("C:/Users/gbsam/Desktop/movie_metadata.csv")

country_summary <- movie_data %>% 
  group_by(country) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score), 
            n_ = n()) %>% top_n(10, median_rating) 

barPlot <- ggplot(country_summary, aes(reorder(country, -median_rating), median_rating)) + 
                   geom_col() +  
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2)

barPlot + labs(y="Movie IMDB Rating, with uncertainity", x = "Country") 

Many countries listed here have very less review count (sometimes just one), which gives us a skewed result. Instead, I choose 10 countries that have highest review count and then plot the above graph, which gives us a more realistic result.

country_summary <- movie_data %>% 
  group_by(country) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(10, n_) 

barPlot <- ggplot(country_summary, aes(reorder(country, -median_rating), median_rating)) + 
                   geom_col() +  
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2)

barPlot + labs(y="Movie IMDB Rating, with uncertainity", x = "Country") 

To answer my research question “Does the duration of a movie impact its popularity?” I try to plot a line graph of movie duration vs IMDB rating.

ggplot(data=movie_data, aes(x=duration, y=imdb_score, group=1)) + geom_smooth()

There are too many points here to make any conclusions. Hence, I try to categorize the movies by their language.

ggplot(subset(movie_data, language %in% c('English', 'Cantonese', 'French','German', 'Japanese', 'Italian', 'Mandarin', 'Spanish')), aes(x=duration, y=imdb_score, group=1)) + geom_smooth() + facet_wrap(vars(language)) 

I also try to find out the ratio of content ratings in each year of movies. I also scale the graph to highlight the relevant parts of the graph to the user.

ggplot(data = movie_data,aes(x = title_year, fill = content_rating)) + geom_bar() + xlim(c(1990,NA)) + ggtitle("Plot of number of movies per year and content rating share per year ") +
  xlab("Years") + ylab("No of movies")

We see an interesting similarity of trend between the first and second plot - that is, the movie rating seems to have a similar variation in IMDB rating with the increase in either Actor 1’s facebook likes or Actor 2’s facebook likes.

ggplot(data=movie_data, aes(x=actor_1_facebook_likes, y=imdb_score, group=1)) + geom_smooth()
ggplot(data=movie_data, aes(x=actor_2_facebook_likes, y=imdb_score, group=1)) + geom_smooth()
ggplot(data=movie_data, aes(x=actor_3_facebook_likes, y=imdb_score, group=1)) + geom_smooth()

Does presence of an actor boost a movie’s ratings?

In order to answer this, I try to get the top 15 most popular actors when being listed as Actor 1, Actor 2 or Actor 3 in a movie. We can see that some names like Morgan Freeman, Steve Buscami, Bruce Willis appear on mulitple plots, which seem to indicate that their presence in a movie has some affect on its rating.

country_summary <- movie_data %>% 
  group_by(actor_1_name) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(15, n_) 

ggplot(country_summary, aes(reorder(actor_1_name, -median_rating), median_rating)) + 
                   geom_col() +  theme(axis.text.x=element_text(angle=90)) + 
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) + ggtitle("Plot of top 15 actors by median IMDB rating of their movies ") +
  xlab("Actors") + ylab("IMDB Median Rating")
country_summary <- movie_data %>% 
  group_by(actor_2_name) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(15, n_) 

ggplot(country_summary, aes(reorder(actor_2_name, -median_rating), median_rating)) + 
                   geom_col() +  theme(axis.text.x=element_text(angle=90)) + 
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) + ggtitle("Plot of top 15 actors by median IMDB rating of their movies ") +
  xlab("Actors") + ylab("IMDB Median Rating")
country_summary <- movie_data %>% 
  group_by(actor_3_name) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(15, n_) 

ggplot(country_summary, aes(reorder(actor_3_name, -median_rating), median_rating)) + 
                   geom_col() +  theme(axis.text.x=element_text(angle=90)) + 
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2) + ggtitle("Plot of top 15 actors by median IMDB rating of their movies ") +
  xlab("Actors") + ylab("IMDB Median Rating")

What is missing from your final project?

What do you hope to accomplish between now and submission time?


Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".


For attribution, please cite this work as

Barlaya (2022, Jan. 20). Data Analytics and Computational Social Science: Homework 6 : Samhith Barlaya. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomsbarlayahw6/

BibTeX citation

  author = {Barlaya, Samhith},
  title = {Data Analytics and Computational Social Science: Homework 6 : Samhith Barlaya},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomsbarlayahw6/},
  year = {2022}