Submission for Homework 5
In an attempt to answer one of my research questions, “What country do the most popular movies belong to?” - I plot a bar graph with countries in the x-axis and their median IMDB score in the y-axis.
library(dplyr)
library(ggplot2)
movie_data <- read.csv("C:/Users/gbsam/Desktop/movie_metadata.csv")
country_summary <- movie_data %>%
group_by(country) %>%
summarise(median_rating = median(imdb_score),
sd_rating = sd(imdb_score),
n_ = n()) %>% top_n(10, median_rating)
barPlot <- ggplot(country_summary, aes(reorder(country, -median_rating), median_rating)) +
geom_col() +
geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2)
barPlot + labs(y="Movie IMDB Rating, with uncertainity", x = "Country")
Many countries listed here have very less review count (sometimes just one), which gives us a skewed result. Instead, I choose 10 countries that have highest review count and then plot the above graph, which gives us a more realistic result.
country_summary <- movie_data %>%
group_by(country) %>%
summarise(median_rating = median(imdb_score),
sd_rating = sd(imdb_score),
n_ = n()) %>% top_n(10, n_)
barPlot <- ggplot(country_summary, aes(reorder(country, -median_rating), median_rating)) +
geom_col() +
geom_errorbar(aes(ymin = median_rating - sd_rating, ymax = median_rating + sd_rating), width=0.2)
barPlot + labs(y="Movie IMDB Rating, with uncertainity", x = "Country")
To answer my research question “Does the duration of a movie impact its popularity?” I try to plot a line graph of movie duration vs IMDB rating.
ggplot(data=movie_data, aes(x=duration, y=imdb_score, group=1)) + geom_line()+ geom_point()
There are too many points here to make any conclusions. Hence, I try to categorize the movies by their language.
ggplot(data=movie_data, aes(x=duration, y=imdb_score, group=1)) + geom_line()+ geom_point() + facet_wrap(vars(language))
Revisiting my plot on count of movies per year in the Homework 4, I try to find out the ratio of content ratings in each year of movies. I also scale the graph to highlight the relevant parts of the graph to the user.
Need to find a solution for existence of too many values for categorical variables in the data set. This is hindering the visualization.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Barlaya (2022, Jan. 14). Data Analytics and Computational Social Science: Homework 5 : Samhith Barlaya. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomsbarlayahw5/
BibTeX citation
@misc{barlaya2022homework, author = {Barlaya, Samhith}, title = {Data Analytics and Computational Social Science: Homework 5 : Samhith Barlaya}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomsbarlayahw5/}, year = {2022} }