Homework 5 : Samhith Barlaya

Submission for Homework 5

Samhith Barlaya
2022-01-13

1. Error bars to visualize uncertainity around estimate.

In an attempt to answer one of my research questions, “What country do the most popular movies belong to?” - I plot a bar graph with countries in the x-axis and their median IMDB score in the y-axis.

library(dplyr)
library(ggplot2)

movie_data <- read.csv("C:/Users/gbsam/Desktop/movie_metadata.csv")

country_summary <- movie_data %>% 
  group_by(country) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score), 
            n_ = n()) %>% top_n(10, median_rating) 

barPlot <- ggplot(country_summary, aes(reorder(country, -median_rating), median_rating)) + 
                   geom_col() +  
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2)

barPlot + labs(y="Movie IMDB Rating, with uncertainity", x = "Country") 

Many countries listed here have very less review count (sometimes just one), which gives us a skewed result. Instead, I choose 10 countries that have highest review count and then plot the above graph, which gives us a more realistic result.

country_summary <- movie_data %>% 
  group_by(country) %>%   
  summarise(median_rating = median(imdb_score),  
            sd_rating = sd(imdb_score),
            n_ = n()) %>% top_n(10, n_) 

barPlot <- ggplot(country_summary, aes(reorder(country, -median_rating), median_rating)) + 
                   geom_col() +  
                   geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2)

barPlot + labs(y="Movie IMDB Rating, with uncertainity", x = "Country") 

2. Facet wraps, fill aesthetic and labels/titles

To answer my research question “Does the duration of a movie impact its popularity?” I try to plot a line graph of movie duration vs IMDB rating.

ggplot(data=movie_data, aes(x=duration, y=imdb_score, group=1)) + geom_line()+ geom_point()

There are too many points here to make any conclusions. Hence, I try to categorize the movies by their language.

ggplot(data=movie_data, aes(x=duration, y=imdb_score, group=1)) + geom_line()+ geom_point() + facet_wrap(vars(language)) 

Revisiting my plot on count of movies per year in the Homework 4, I try to find out the ratio of content ratings in each year of movies. I also scale the graph to highlight the relevant parts of the graph to the user.

ggplot(data = movie_data,aes(x = title_year, fill = content_rating)) + geom_bar() + xlim(c(1990,NA)) + ggtitle("Plot of number of movies per year and content rating share per year ") +
  xlab("Years") + ylab("No of movies")

3. Answer the following questions

What is missing (if anything) in your analysis process so far?

Need to find a solution for existence of too many values for categorical variables in the data set. This is hindering the visualization.

What conclusions can you make about your research questions at this point?

  1. We have some insights on the countries to which the most popular movies belong to based on the visualizations on this homework.
  2. On the question about the impact of the duration of a movie on its popularity, the initial analysis doesn’t show any direct impact.
  3. We can also conclude that most movies released in a year are rated R or PG-13.

What do you think a naive reader would need to fully understand your graphs?

  1. From my end - Reduction in the amount of info in the plots might be needed to help user understand graphs better (Number of different values in X axis or Number of points plotted on a scatter graph need to be reduced)
  2. User also needs some knowledge on central tendency concepts, knowledge on bar graphs and line plots.

Is there anything you want to answer with your dataset, but can’t?

  1. Since there are a lot of different values for categorical variables, I will be unable to include all these categories for future analysis. Including all these might cause unreadable graphs and plots.
  2. Because of too many rows of data, I am also unable to compute similarity of movies (initially thought I could leverage genre and plot keyword information to achieve this),

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Barlaya (2022, Jan. 14). Data Analytics and Computational Social Science: Homework 5 : Samhith Barlaya. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomsbarlayahw5/

BibTeX citation

@misc{barlaya2022homework,
  author = {Barlaya, Samhith},
  title = {Data Analytics and Computational Social Science: Homework 5 : Samhith Barlaya},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomsbarlayahw5/},
  year = {2022}
}