Homework2 Erika Nagai

hw2

movie

gender

Introduction to Visualization

Author

Erika Nagai

Published

October 12, 2022

Challenge Overview

Read in a dataset from the _data folder in the course blog repository, or choose your own data. If you decide to use one of the datasets we have provided, please use a challenging dataset - check with us if you are not sure.
Clean the data as needed using dplyr and related tidyverse packages.
Provide a narrative about the data set (look it up if you aren’t sure what you have got) and the variables in your dataset, including what type of data each variable is. The goal of this step is to communicate in a visually appealing way to non-experts - not to replicate r-code.
Identify potential research questions that your dataset can help answer.

# install libraries

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2

Warning: package 'ggplot2' was built under R version 4.2.2

Warning: package 'stringr' was built under R version 4.2.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(ggplot2)
library(stringr)
library(tidyr)
library(dplyr)
library(summarytools)


Attaching package: 'summarytools'

The following object is masked from 'package:tibble':

    view

Read in a data

movie = read_csv("_data/movies_metadata.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 45466 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (14): belongs_to_collection, genres, homepage, imdb_id, original_langua...
dbl   (7): budget, id, popularity, revenue, runtime, vote_average, vote_count
lgl   (2): adult, video
date  (1): release_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This movie dataset was generated by Movielens, a (non-profit) movie review website (https://movielens.org/), and was obtained from the following Kaggle link. (https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download&select=movies_metadata.csv)

The movie dataset is contains 45466 movies with a released date between December 9th 1874 and December 16th 2020.

The data includes the information of genres, revenue, runtime, languages, status (released/in production etc…).

This dataset includes the following columns.

colnames(movie)

 [1] "adult"                 "belongs_to_collection" "budget"               
 [4] "genres"                "homepage"              "id"                   
 [7] "imdb_id"               "original_language"     "original_title"       
[10] "overview"              "popularity"            "poster_path"          
[13] "production_companies"  "production_countries"  "release_date"         
[16] "revenue"               "runtime"               "spoken_languages"     
[19] "status"                "tagline"               "title"                
[22] "video"                 "vote_average"          "vote_count"

The data type of each column is as follows.

str(movie)

spc_tbl_ [45,466 × 24] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ adult                : logi [1:45466] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ belongs_to_collection: chr [1:45466] "{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path"| __truncated__ NA "{'id': 119050, 'name': 'Grumpy Old Men Collection', 'poster_path': '/nLvUdqgPgm3F85NMCii9gVFUcet.jpg', 'backdro"| __truncated__ NA ...
 $ budget               : num [1:45466] 3.0e+07 6.5e+07 0.0 1.6e+07 0.0 6.0e+07 5.8e+07 0.0 3.5e+07 5.8e+07 ...
 $ genres               : chr [1:45466] "[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]" "[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]" "[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]" "[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]" ...
 $ homepage             : chr [1:45466] "http://toystory.disney.com/toy-story" NA NA NA ...
 $ id                   : num [1:45466] 862 8844 15602 31357 11862 ...
 $ imdb_id              : chr [1:45466] "tt0114709" "tt0113497" "tt0113228" "tt0114885" ...
 $ original_language    : chr [1:45466] "en" "en" "en" "en" ...
 $ original_title       : chr [1:45466] "Toy Story" "Jumanji" "Grumpier Old Men" "Waiting to Exhale" ...
 $ overview             : chr [1:45466] "Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. "| __truncated__ "When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwi"| __truncated__ "A family wedding reignites the ancient feud between next-door neighbors and fishing buddies John and Max. Meanw"| __truncated__ "Cheated on, mistreated and stepped on, the women are holding their breath, waiting for the elusive \"good man\""| __truncated__ ...
 $ popularity           : num [1:45466] 21.95 17.02 11.71 3.86 8.39 ...
 $ poster_path          : chr [1:45466] "/rhIRbceoE9lR4veEXuwCC2wARtG.jpg" "/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg" "/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg" "/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg" ...
 $ production_companies : chr [1:45466] "[{'name': 'Pixar Animation Studios', 'id': 3}]" "[{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communicat"| __truncated__ "[{'name': 'Warner Bros.', 'id': 6194}, {'name': 'Lancaster Gate', 'id': 19464}]" "[{'name': 'Twentieth Century Fox Film Corporation', 'id': 306}]" ...
 $ production_countries : chr [1:45466] "[{'iso_3166_1': 'US', 'name': 'United States of America'}]" "[{'iso_3166_1': 'US', 'name': 'United States of America'}]" "[{'iso_3166_1': 'US', 'name': 'United States of America'}]" "[{'iso_3166_1': 'US', 'name': 'United States of America'}]" ...
 $ release_date         : Date[1:45466], format: "1995-10-30" "1995-12-15" ...
 $ revenue              : num [1:45466] 3.74e+08 2.63e+08 0.00 8.15e+07 7.66e+07 ...
 $ runtime              : num [1:45466] 81 104 101 127 106 170 127 97 106 130 ...
 $ spoken_languages     : chr [1:45466] "[{'iso_639_1': 'en', 'name': 'English'}]" "[{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]" "[{'iso_639_1': 'en', 'name': 'English'}]" "[{'iso_639_1': 'en', 'name': 'English'}]" ...
 $ status               : chr [1:45466] "Released" "Released" "Released" "Released" ...
 $ tagline              : chr [1:45466] NA "Roll the dice and unleash the excitement!" "Still Yelling. Still Fighting. Still Ready for Love." "Friends are the people who let you be yourself... and never let you forget it." ...
 $ title                : chr [1:45466] "Toy Story" "Jumanji" "Grumpier Old Men" "Waiting to Exhale" ...
 $ video                : logi [1:45466] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ vote_average         : num [1:45466] 7.7 6.9 6.5 6.1 5.7 7.7 6.2 5.4 5.5 6.6 ...
 $ vote_count           : num [1:45466] 5415 2413 92 34 173 ...
 - attr(*, "spec")=
  .. cols(
  ..   adult = col_logical(),
  ..   belongs_to_collection = col_character(),
  ..   budget = col_double(),
  ..   genres = col_character(),
  ..   homepage = col_character(),
  ..   id = col_double(),
  ..   imdb_id = col_character(),
  ..   original_language = col_character(),
  ..   original_title = col_character(),
  ..   overview = col_character(),
  ..   popularity = col_double(),
  ..   poster_path = col_character(),
  ..   production_companies = col_character(),
  ..   production_countries = col_character(),
  ..   release_date = col_date(format = ""),
  ..   revenue = col_double(),
  ..   runtime = col_double(),
  ..   spoken_languages = col_character(),
  ..   status = col_character(),
  ..   tagline = col_character(),
  ..   title = col_character(),
  ..   video = col_logical(),
  ..   vote_average = col_double(),
  ..   vote_count = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

movieSummary = dfSummary(movie)

Tidy data

The values in some certain columns such as “belongs_to_collection”, “genres”, “production_companies”, “production_countries”, “spoken_languages” are in a list format.

movie %>% select(c("belongs_to_collection", "genres", "production_companies", "production_countries", "spoken_languages"))

# A tibble: 45,466 × 5
   belongs_to_collection                          genres produ…¹ produ…² spoke…³
   <chr>                                          <chr>  <chr>   <chr>   <chr>  
 1 {'id': 10194, 'name': 'Toy Story Collection',… [{'id… [{'nam… [{'iso… [{'iso…
 2 <NA>                                           [{'id… [{'nam… [{'iso… [{'iso…
 3 {'id': 119050, 'name': 'Grumpy Old Men Collec… [{'id… [{'nam… [{'iso… [{'iso…
 4 <NA>                                           [{'id… [{'nam… [{'iso… [{'iso…
 5 {'id': 96871, 'name': 'Father of the Bride Co… [{'id… [{'nam… [{'iso… [{'iso…
 6 <NA>                                           [{'id… [{'nam… [{'iso… [{'iso…
 7 <NA>                                           [{'id… [{'nam… [{'iso… [{'iso…
 8 <NA>                                           [{'id… [{'nam… [{'iso… [{'iso…
 9 <NA>                                           [{'id… [{'nam… [{'iso… [{'iso…
10 {'id': 645, 'name': 'James Bond Collection', … [{'id… [{'nam… [{'iso… [{'iso…
# … with 45,456 more rows, and abbreviated variable names
#   ¹production_companies, ²production_countries, ³spoken_languages

Genre

First I need to delete “[” and “]”. I used a useful package that can remove brackets (round, square, curly or any shape), “qdapRegex”.

head(movie$genres)

[1] "[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"                        
[2] "[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]"                       
[3] "[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]"                                                        
[4] "[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]"                           
[5] "[{'id': 35, 'name': 'Comedy'}]"                                                                                          
[6] "[{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'Crime'}, {'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}]"

Question: I had to use this package to remove the swuare brackets but I originally wanted to do so with str_extract or str_replace, but it didn’t work because “[ ]” have a special meaning in regex. I would appreciate it if you could show me how I could have done it by using str_ functions.

library(qdapRegex)


Attaching package: 'qdapRegex'

The following object is masked from 'package:dplyr':

    explain

The following object is masked from 'package:ggplot2':

    %+%

movie$clean_genres <- rm_square(movie$genres, extract = TRUE)

head(movie$clean_genres)

[[1]]
[1] "{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}"

[[2]]
[1] "{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}"

[[3]]
[1] "{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}"

[[4]]
[1] "{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}"

[[5]]
[1] "{'id': 35, 'name': 'Comedy'}"

[[6]]
[1] "{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'Crime'}, {'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}"

I counted the number of “id” showed in the “genre” column to see how many genres each movie has.

The maximum number of genres is 8 and there are movies that do NOT have a genre assigned.

movie$num_genre <- str_count(movie$genres, "id")
summary(movie$num_genre)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   2.000   2.004   3.000   8.000

Since a single column “genres” now contains multiple genres of information, let’s split the string so that each column contains only one genre of information.

#str_split(movie$clean_genres, "\\},")

movie <- movie %>% 
  separate(clean_genres, c("genre1", "genre2", "genre3", "genre4", "genre5", "genre6", "genre7", "genre8"), "\\},")

Warning: Expected 8 pieces. Missing pieces filled with `NA` in 45463 rows [1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

The values in the genre1 to genre8 columns still contain unnecessary {}, etc., so let’s clean them up!

#"\\{", "\\}",


# I wanted to as follows but this misses some "{" "
# movie$genre1 <- str_replace(movie$genre1, c("\\{","\\}"), "")

# remove {





movie$genre1 <- str_replace(movie$genre1, "\\{", "")
movie$genre2 <- str_replace(movie$genre2, "\\{", "")
movie$genre3 <- str_replace(movie$genre3, "\\{", "")
movie$genre4 <- str_replace(movie$genre4, "\\{", "")
movie$genre5 <- str_replace(movie$genre5, "\\{", "")
movie$genre6 <- str_replace(movie$genre6, "\\{", "")
movie$genre7 <- str_replace(movie$genre7, "\\{", "")
movie$genre8 <- str_replace(movie$genre8, "\\{", "")


# remove }
movie$genre1 <- str_replace(movie$genre1, "\\}", "")
movie$genre2 <- str_replace(movie$genre2, "\\}", "")
movie$genre3 <- str_replace(movie$genre3, "\\}", "")
movie$genre4 <- str_replace(movie$genre4, "\\}", "")
movie$genre5 <- str_replace(movie$genre5, "\\}", "")
movie$genre6 <- str_replace(movie$genre6, "\\}", "")
movie$genre7 <- str_replace(movie$genre7, "\\}", "")
movie$genre8 <- str_replace(movie$genre8, "\\}", "")


# remove "'id':"
movie$genre1 <- str_replace(movie$genre1, "'id':", "")
movie$genre2 <- str_replace(movie$genre2, "'id':", "")
movie$genre3 <- str_replace(movie$genre3, "'id':", "")
movie$genre4 <- str_replace(movie$genre4, "'id':", "")
movie$genre5 <- str_replace(movie$genre5, "'id':", "")
movie$genre6 <- str_replace(movie$genre6, "'id':", "")
movie$genre7 <- str_replace(movie$genre7, "'id':", "")
movie$genre8 <- str_replace(movie$genre8, "'id':", "")

# remove ", 'name':"
movie$genre1 <- str_replace(movie$genre1, ", 'name': ", "")
movie$genre2 <- str_replace(movie$genre2, ", 'name': ", "")
movie$genre3 <- str_replace(movie$genre3, ", 'name': ", "")
movie$genre4 <- str_replace(movie$genre4, ", 'name': ", "")
movie$genre5 <- str_replace(movie$genre5, ", 'name': ", "")
movie$genre6 <- str_replace(movie$genre6, ", 'name': ", "")
movie$genre7 <- str_replace(movie$genre7, ", 'name': ", "")
movie$genre8 <- str_replace(movie$genre8, ", 'name': ", "")

Next we will make the genre name columns.

pivot_wider(
  movie,
  names_from = genre1, values_from = num_genre
)

Warning: Values from `num_genre` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = {summary_fun}` to summarise duplicates.
* Use the following dplyr code to identify duplicates.
  {data} %>%
    dplyr::group_by(adult, belongs_to_collection, budget, genres, homepage, id, imdb_id, original_language, original_title, overview, popularity, poster_path, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, video, vote_average, vote_count, genre2, genre3, genre4, genre5, genre6, genre7, genre8, genre1) %>%
    dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
    dplyr::filter(n > 1L)

Error in `values[spec$.name]`:
! Can't subset columns with `spec$.name`.
✖ Subscript `spec$.name` can't contain the empty string.
✖ It has an empty string at location 12.

Question: I wanted to make genre options contained in genre 1 - 8 columns like the below image, but I haven’t figured out how… I tried to use the Pivot_wider feature, but it didn’t work the way I wanted it to.

Is there any function that I could use for this?

After I figure out how to clean “genre” columns, I will do the same with the columns “belongs_to_collection”, “genres”, “production_companies”, “production_countries”, “spoken_languages”.

Then, I would love to join the bechdel test dataset (The movie has to have at least 2 [named] female characters, who talk to each other, about something other than a man) by using imdb_id.

Research questions

How does female representation in the movie affects its popularity and profitability (≒ revenue)?
How has changed the degree of female representation in the movies worldwide?