DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Final Project Erika Nagai

  • Final materials
    • Fall 2022 posts
    • final Posts

On this page

  • Introduction
  • Read in data
    • (1) “The Movie Data set” from Kaggle :
    • (2) Bechdel test API:
    • (3) Open Movie Database (OMDb):
  • Describe data
    • (1) “The Movie Data set”
    • (2) Bechdel test
  • Tidy data
    • (1) TMDb data
    • Checking missing values
    • Deleting unnecessary columns
    • Cleaning data in JSON nested list format
    • Adding a new column
    • (2) Bechdel data
    • Sense Check
  • Join data sets
  • Read in data / Describe data (OMDb)
  • Data Analysis and Visualization
    • Is female representation in movie stories improving over time?
    • What category represents women better?
    • Do movies with good female representation succeed in gaining more popularity on online review site?
    • Do movies with good female representation succeed in making more money?
  • Conclusion
  • Reflection
  • Contact
  • Bibliography and References

Final Project Erika Nagai

  • Show All Code
  • Hide All Code

  • View Source
movie
gender
bechdel test
female representation
Erika Nagai
Female representation in movie stories
Author

Erika Nagai

Published

December 23, 2022

Introduction

Even though half of the moviegoers in the United States are women, their story is not represented as much as that of men. Only 35% of the main characters in the best movies of 2021 were female, and male characters outnumbered female characters in 85% of the films. (Women and Hollywood, 2022) Also lacking are women behind the scenes in the film business. In 2021, just 12% of the top 100 films are directed by women. (Lauzen, n.d.)

In this project, I focused on analyzing female representation in movie stories rather than in their production. I used the data from three major movie databases, The Movie Database (TMDb), MovieLens, Open Movie Database (OMDb), and the Bechdel Test, one of the most common criteria to measure female presentation.

I would like to answer the following questions in this analysis:

  1. Is female representation in movie stories improving over time?

  2. What genres are doing better than others in terms of female representation?

  3. Are movies with good female representation more popular?

  4. Do movies with good female representation more money?

This analysis intends to document the data cleaning process, which is quite long. If you are curious only about the analysis results, please skip to Data Analysis and Visualization part.

Code
# install libraries

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'tibble' was built under R version 4.2.2
Warning: package 'stringr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(ggplot2)
library(stringr)
library(tidyr)
library(dplyr)
library(summarytools)
Warning: package 'summarytools' was built under R version 4.2.2

Attaching package: 'summarytools'

The following object is masked from 'package:tibble':

    view
Code
library(lubridate)
Warning: package 'lubridate' was built under R version 4.2.2
Loading required package: timechange
Warning: package 'timechange' was built under R version 4.2.2

Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union
Code
library(ggridges)

Read in data

For this analysis, I used the following different data sets.

(1) “The Movie Data set” from Kaggle :

This dataset was obtained from Kaggle “The Movie Data set”(https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download&select=movies_metadata.csv).

Code
#(1) movies_metadata.csv obtained from Kaggle
movie = read_csv("_data/movies_metadata.csv")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 45466 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (14): belongs_to_collection, genres, homepage, imdb_id, original_langua...
dbl   (7): budget, id, popularity, revenue, runtime, vote_average, vote_count
lgl   (2): adult, video
date  (1): release_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

(2) Bechdel test API:

I generated the data by using the bechdel test API https://bechdeltest.com/api/v1/doc.

Code
#(2) bechdel test obtained by using bechdel API

library(rjson)
library(jsonlite)
Warning: package 'jsonlite' was built under R version 4.2.2

Attaching package: 'jsonlite'
The following objects are masked from 'package:rjson':

    fromJSON, toJSON
The following object is masked from 'package:purrr':

    flatten
Code
#json_file <- "http://bechdeltest.com/api/v1/getAllMovies"
#bechdel <- read_json(path = json_file, simplifyVector = TRUE)
#bechdel$titleId <- paste("tt",bechdel$imdbid, sep = "")

#write.csv(bechdel, file = "_data/bechdel.csv")
bechdel <- read_csv("_data/bechdel.csv")
New names:
• `` -> `...1`
Rows: 9802 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): imdbid, title, titleId
dbl (4): ...1, year, rating, id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

(3) Open Movie Database (OMDb):

This dataset was generated by using Open Movie Database (OMDb) API https://www.omdbapi.com/. OMDb provides detailed information such as director, writer, nominated award, the number/score of reviews on IMDb, etc… if you provide the name or IMDb id of movies. I decided to use this database because it gives me data related to movies’ popularity and financial success of the films, which “The Movie Data set” doesn’t include.

OMDb doesn’t give you a list of all movies registered on it. Instead, you need to provide the exact movie title or the IMDb id to get the list of information. Thus, I will use this API once I have the data that join movies_metadata.csv and the Bechdel test. (Please refer to “Read in data / Describe data (OMDb)” for this process.)

Describe data

(1) “The Movie Data set”

This data was originally created from The Movie Database (https://www.themoviedb.org/) and MovieLens (https://movielens.org/).This movie dataset was generated by Movielens, a (non-profit) movie review website (https://movielens.org/), and was obtained from the following Kaggle link. (https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download&select=movies_metadata.csv)

The movie dataset contains 45,466 movies with release date between December 9th, 1874 and December 16th, 2020.

The data includes the general information of movies, such as genres, revenue, run time, languages, status (released/in production etc…).

Code
print(summarytools::dfSummary(movie),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

movie

Dimensions: 45466 x 24
Duplicates: 17
Variable Stats / Values Freqs (% of Valid) Graph Missing
adult [logical]
1. FALSE
2. TRUE
45454(100.0%)
9(0.0%)
3 (0.0%)
belongs_to_collection [character]
1. {'id': 415931, 'name': 'T
2. {'id': 421566, 'name': 'T
3. {'id': 645, 'name': 'Jame
4. {'id': 96887, 'name': 'Za
5. {'id': 37261, 'name': 'Th
6. {'id': 34055, 'name': 'Po
7. {'id': 413661, 'name': 'C
8. {'id': 374509, 'name': 'G
9. {'id': 148324, 'name': 'U
10. {'id': 38451, 'name': 'Ch
[ 1688 others ]
29(0.6%)
27(0.6%)
26(0.6%)
26(0.6%)
25(0.6%)
22(0.5%)
21(0.5%)
16(0.4%)
15(0.3%)
15(0.3%)
4272(95.1%)
40972 (90.1%)
budget [numeric]
Mean (sd) : 4224579 (17424133)
min ≤ med ≤ max:
0 ≤ 0 ≤ 3.8e+08
IQR (CV) : 0 (4.1)
1223 distinct values 3 (0.0%)
genres [character]
1. [{'id': 18, 'name': 'Dram
2. [{'id': 35, 'name': 'Come
3. [{'id': 99, 'name': 'Docu
4. []
5. [{'id': 18, 'name': 'Dram
6. [{'id': 35, 'name': 'Come
7. [{'id': 27, 'name': 'Horr
8. [{'id': 35, 'name': 'Come
9. [{'id': 35, 'name': 'Come
10. [{'id': 18, 'name': 'Dram
[ 4059 others ]
5000(11.0%)
3621(8.0%)
2723(6.0%)
2442(5.4%)
1301(2.9%)
1135(2.5%)
974(2.1%)
930(2.0%)
593(1.3%)
532(1.2%)
26215(57.7%)
0 (0.0%)
homepage [character]
1. http://www.georgecarlin.c
2. http://www.wernerherzog.c
3. http://breakblade.jp/
4. http://phantasm.com
5. http://www.crownintlpictu
6. http://www.crownintlpictu
7. http://www.crownintlpictu
8. http://www.kungfupanda.co
9. http://www.missionimpossi
10. http://www.thehungergames
[ 7663 others ]
12(0.2%)
7(0.1%)
6(0.1%)
4(0.1%)
4(0.1%)
4(0.1%)
4(0.1%)
4(0.1%)
4(0.1%)
4(0.1%)
7729(99.3%)
37684 (82.9%)
id [numeric]
Mean (sd) : 108359.9 (112460.7)
min ≤ med ≤ max:
2 ≤ 60003 ≤ 469172
IQR (CV) : 130878.5 (1)
45433 distinct values 3 (0.0%)
imdb_id [character]
1. 0
2. tt1180333
3. tt0022537
4. tt0022879
5. tt0046468
6. tt0062229
7. tt0067306
8. tt0080000
9. tt0082992
10. tt0084387
[ 45407 others ]
3(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
45427(100.0%)
17 (0.0%)
original_language [character]
1. en
2. fr
3. it
4. ja
5. de
6. es
7. ru
8. hi
9. ko
10. zh
[ 82 others ]
32269(71.0%)
2438(5.4%)
1529(3.4%)
1350(3.0%)
1080(2.4%)
994(2.2%)
826(1.8%)
508(1.1%)
444(1.0%)
409(0.9%)
3608(7.9%)
11 (0.0%)
original_title [character]
1. Alice in Wonderland
2. Hamlet
3. A Christmas Carol
4. Cinderella
5. Les Misérables
6. Macbeth
7. The Three Musketeers
8. Blackout
9. Frankenstein
10. Heidi
[ 43363 others ]
8(0.0%)
8(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
6(0.0%)
6(0.0%)
6(0.0%)
45397(99.8%)
0 (0.0%)
overview [character]
1. No overview found.
2. No Overview
3. A few funny little novels
4. Adaptation of the Jane Au
5. King Lear, old and tired,
6. No movie overview availab
7. Recovering from a nail gu
8. Released
9. A film by Jem Cohen
10. A former aristocrat Ippol
[ 44296 others ]
133(0.3%)
7(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
44345(99.6%)
959 (2.1%)
popularity [numeric]
Mean (sd) : 2.9 (6)
min ≤ med ≤ max:
0 ≤ 1.1 ≤ 547.5
IQR (CV) : 3.3 (2.1)
43757 distinct values 6 (0.0%)
poster_path [character]
1. /5D7UBSEgdyONE6Lql6xS7s6O
2. /2kslZXOaW0HmnGuVPCnQlCdX
3. /qW1oQlOHizRHXZQrpkimYr0o
4. /8VSZ9coCzxOCW2wE2Qene1H1
5. /cdwVC18URfEdQjjxqJyRMoGD
6. /2ngnUQX9Abkesfq72uvBF3uj
7. /2trk2fape4S0CYnRg38kNVe8
8. /4J6Ai4C5YRgfRUTlirrJ7Qsm
9. /5GasjPRAy5rlEyDOH7MeOyxy
10. /5ILjS6XB5deiHop8SXPsYxXW
[ 45014 others ]
5(0.0%)
4(0.0%)
4(0.0%)
3(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
45051(99.9%)
386 (0.8%)
production_companies [character]
1. []
2. [{'name': 'Metro-Goldwyn-
3. [{'name': 'Warner Bros.',
4. [{'name': 'Paramount Pict
5. [{'name': 'Twentieth Cent
6. [{'name': 'Universal Pict
7. [{'name': 'RKO Radio Pict
8. [{'name': 'Columbia Pictu
9. [{'name': 'Columbia Pictu
10. [{'name': 'Mosfilm', 'id'
[ 22698 others ]
11875(26.1%)
742(1.6%)
540(1.2%)
505(1.1%)
439(1.0%)
320(0.7%)
247(0.5%)
207(0.5%)
146(0.3%)
145(0.3%)
30297(66.6%)
3 (0.0%)
production_countries [character]
1. [{'iso_3166_1': 'US', 'na
2. []
3. [{'iso_3166_1': 'GB', 'na
4. [{'iso_3166_1': 'FR', 'na
5. [{'iso_3166_1': 'JP', 'na
6. [{'iso_3166_1': 'IT', 'na
7. [{'iso_3166_1': 'CA', 'na
8. [{'iso_3166_1': 'DE', 'na
9. [{'iso_3166_1': 'IN', 'na
10. [{'iso_3166_1': 'RU', 'na
[ 2383 others ]
17851(39.3%)
6282(13.8%)
2238(4.9%)
1654(3.6%)
1356(3.0%)
1030(2.3%)
840(1.8%)
749(1.6%)
735(1.6%)
735(1.6%)
11993(26.4%)
3 (0.0%)
release_date [Date]
min : 1874-12-09
med : 2001-08-30
max : 2020-12-16
range : 146y 0m 7d
17333 distinct values 90 (0.2%)
revenue [numeric]
Mean (sd) : 11209349 (64332247)
min ≤ med ≤ max:
0 ≤ 0 ≤ 2787965087
IQR (CV) : 0 (5.7)
6863 distinct values 6 (0.0%)
runtime [numeric]
Mean (sd) : 94.1 (38.4)
min ≤ med ≤ max:
0 ≤ 95 ≤ 1256
IQR (CV) : 22 (0.4)
353 distinct values 263 (0.6%)
spoken_languages [character]
1. [{'iso_639_1': 'en', 'nam
2. []
3. [{'iso_639_1': 'fr', 'nam
4. [{'iso_639_1': 'ja', 'nam
5. [{'iso_639_1': 'it', 'nam
6. [{'iso_639_1': 'es', 'nam
7. [{'iso_639_1': 'ru', 'nam
8. [{'iso_639_1': 'de', 'nam
9. [{'iso_639_1': 'en', 'nam
10. [{'iso_639_1': 'en', 'nam
[ 1921 others ]
22395(49.3%)
3829(8.4%)
1853(4.1%)
1289(2.8%)
1218(2.7%)
902(2.0%)
807(1.8%)
762(1.7%)
681(1.5%)
572(1.3%)
11152(24.5%)
6 (0.0%)
status [character]
1. Canceled
2. In Production
3. Planned
4. Post Production
5. Released
6. Rumored
2(0.0%)
20(0.0%)
15(0.0%)
98(0.2%)
45014(99.2%)
230(0.5%)
87 (0.2%)
tagline [character]
1. Based on a true story.
2. -
3. Be careful what you wish
4. Trust no one.
5. A Love Story
6. Classic Albums
7. Documentary
8. Drama
9. How far would you go?
10. Know Your Enemy
[ 20272 others ]
7(0.0%)
4(0.0%)
4(0.0%)
4(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
20374(99.8%)
25055 (55.1%)
title [character]
1. Cinderella
2. Alice in Wonderland
3. Hamlet
4. Beauty and the Beast
5. Les Misérables
6. A Christmas Carol
7. Blackout
8. The Three Musketeers
9. Treasure Island
10. Aftermath
[ 42267 others ]
11(0.0%)
9(0.0%)
9(0.0%)
8(0.0%)
8(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
6(0.0%)
45381(99.8%)
6 (0.0%)
video [logical]
1. FALSE
2. TRUE
45367(99.8%)
93(0.2%)
6 (0.0%)
vote_average [numeric]
Mean (sd) : 5.6 (1.9)
min ≤ med ≤ max:
0 ≤ 6 ≤ 10
IQR (CV) : 1.8 (0.3)
92 distinct values 6 (0.0%)
vote_count [numeric]
Mean (sd) : 109.9 (491.3)
min ≤ med ≤ max:
0 ≤ 10 ≤ 14075
IQR (CV) : 31 (4.5)
1820 distinct values 6 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

(2) Bechdel test

This data (bechdel) dataset documents the Bechdel test rating of 9630 movies released between 1874 and 2022. Each row represents a movie.

According to “Merriam-Webster”, the Bechdel test is “a set of criteria used as a test to evaluate a work of fiction (such as a film) based on its inclusion and representation of female characters” (https://www.merriam-webster.com/dictionary/Bechdel%20Test)

A movie passes the Bechdel test when it has at least two female featured characters that talk to each other about other than a man (men).

In bechdel, the Bechdel test rating is registered in the following manner.

0 ~ No two female characters

1 ~ Two female characters who don’t talk to each other

2 ~ Two female characters talk to each other about a man (men)

3 ~ Passes Bechdel test: Two female characters talk to each other about other than a man (men)

Apart from Bechdel ratings, this dataset (bechdel) contains the released year, title of movies, and id. id that starts with “tt” followed by 7 digits is the IMDb id and can work as a foreign key when joining this data with the movie dataset.

Code
print(summarytools::dfSummary(bechdel),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

bechdel

Dimensions: 9802 x 7
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
...1 [numeric]
Mean (sd) : 4901.5 (2829.7)
min ≤ med ≤ max:
1 ≤ 4901.5 ≤ 9802
IQR (CV) : 4900.5 (0.6)
9802 distinct values 0 (0.0%)
year [numeric]
Mean (sd) : 1996.2 (27)
min ≤ med ≤ max:
1010 ≤ 2006 ≤ 2022
IQR (CV) : 25 (0)
142 distinct values 0 (0.0%)
rating [numeric]
Mean (sd) : 2.1 (1.1)
min ≤ med ≤ max:
0 ≤ 3 ≤ 3
IQR (CV) : 2 (0.5)
0:1084(11.1%)
1:2124(21.7%)
2:1000(10.2%)
3:5594(57.1%)
0 (0.0%)
id [numeric]
Mean (sd) : 5224.9 (3052.9)
min ≤ med ≤ max:
1 ≤ 5211.5 ≤ 10641
IQR (CV) : 5233.5 (0.6)
9802 distinct values 0 (0.0%)
imdbid [character]
1. 0035279
2. 0086425
3. 0117056
4. 2043900
5. 2457282
6. 0000001
7. 0000002
8. 0000003
9. 0000004
10. 0000005
[ 9784 others ]
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
9784(99.8%)
3 (0.0%)
title [character]
1. Cinderella
2. Dracula
3. Little Women
4. Pride and Prejudice
5. Robin Hood
6. Shelter
7. A Star Is Born
8. Alice in Wonderland
9. Anna Karenina
10. Annie
[ 9547 others ]
5(0.1%)
4(0.0%)
4(0.0%)
4(0.0%)
4(0.0%)
4(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
9765(99.6%)
0 (0.0%)
titleId [character]
1. tt
2. tt0035279
3. tt0086425
4. tt0117056
5. tt2043900
6. tt2457282
7. tt0000001
8. tt0000002
9. tt0000003
10. tt0000004
[ 9785 others ]
3(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
9785(99.8%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

Tidy data

(1) TMDb data

Checking missing values

First of all, I took a look at missing values (NA values) in the data.

Code
movie %>% select(everything()) %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  t()
Warning: `funs()` was deprecated in dplyr 0.8.0.
ℹ Please use a list of either functions or lambdas:

# Simple named list: list(mean = mean, median = median)

# Auto named with `tibble::lst()`: tibble::lst(mean, median)

# Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
                       [,1]
adult                     3
belongs_to_collection 40972
budget                    3
genres                    0
homepage              37684
id                        3
imdb_id                  17
original_language        11
original_title            0
overview                959
popularity                6
poster_path             386
production_companies      3
production_countries      3
release_date             90
revenue                   6
runtime                 263
spoken_languages          6
status                   87
tagline               25055
title                     6
video                     6
vote_average              6
vote_count                6

Deleting unnecessary columns

I deleted the following columns, which seem irrelevant for this analysis or have too many missing values.
- adult: This information doesn’t add any significant meaning as almost all of the observations have FALSE value for this column

- belong_to_collection: This column has 40000+ missing values
- homepage: Unnecessary for this analysis

- overview: Unnecessary for this analysis

- poster_path: Unnecessary for this analysis

- tagline: Unnecessary for this analysis
- video: This information doesn’t add any significant meaning as almost all of observations have FALSE value for this column

- popularity: This information may be interesting, however, it is not clear how this popularity is measured or where it was generated. Therefore, I decided to use popularity data from OMDb.

- vote_average: Same as popularity

-vote_count: Same as popularity

Then, I changed the order of the columns.

Code
movie <- movie %>% select(-c("adult", "homepage", "overview", "poster_path", "tagline", "belongs_to_collection", "poster_path", "video", "popularity", "vote_average", "vote_count"))

col_order <- c("title", "original_title", "imdb_id", "id", "production_companies", "production_countries", "status", "release_date", "runtime", "revenue", "budget", "original_language", "spoken_languages", "genres")
movie <- movie[, col_order]

colnames(movie)[4] <- "movielens_id"

Cleaning data in JSON nested list format

The values in some certain columns such as “genres”, “production_companies”, “production_countries”,“spoken_languages” are in a JSON list format for example:

” [{‘id’: XXXX, ‘content(name/genre/title)’: XXX}, {‘id’: XXXX, ’content(name/genre/title): XXX}] ”

Code
movie %>% select(c("genres", "production_companies", "production_countries", "spoken_languages"))
# A tibble: 45,466 × 4
   genres                                                produ…¹ produ…² spoke…³
   <chr>                                                 <chr>   <chr>   <chr>  
 1 [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name':… [{'nam… [{'iso… [{'iso…
 2 [{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name':… [{'nam… [{'iso… [{'iso…
 3 [{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name'… [{'nam… [{'iso… [{'iso…
 4 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'D… [{'nam… [{'iso… [{'iso…
 5 [{'id': 35, 'name': 'Comedy'}]                        [{'nam… [{'iso… [{'iso…
 6 [{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'C… [{'nam… [{'iso… [{'iso…
 7 [{'id': 35, 'name': 'Comedy'}, {'id': 10749, 'name':… [{'nam… [{'iso… [{'iso…
 8 [{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'A… [{'nam… [{'iso… [{'iso…
 9 [{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'A… [{'nam… [{'iso… [{'iso…
10 [{'id': 12, 'name': 'Adventure'}, {'id': 28, 'name':… [{'nam… [{'iso… [{'iso…
# … with 45,456 more rows, and abbreviated variable names
#   ¹​production_companies, ²​production_countries, ³​spoken_languages
Code
# These columns include [] {} and ' in their values so I made a function remove_simbols that removes these unnecessary symbols.
remove_symbols <- function(x) {
  removed_x <- str_remove_all(x, "\\{|\\}|\\[|\\}|\\]|'|,|id|name|:| ")
  return(removed_x)
}

(1). Tidying Genres

Each movie contains information on one or more (up to 8) genres in a single column. To organize the information, the multiple genres contained in genre are split by “],” so that each genre is contained in eight new columns, genre1-genre8. Then, I removed unnecessary symbols such as ,.

Code
# Separating the `genre` by "},"
movie <- movie %>% 
  separate(genres, c("genre1", "genre2", "genre3", "genre4", "genre5", "genre6", "genre7", "genre8"), "\\},", remove = FALSE)
Warning: Expected 8 pieces. Missing pieces filled with `NA` in 45463 rows [1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
Code
# Remove unnecessary symbols from each columns

movie$genre1 <- remove_symbols(movie$genre1)
movie$genre2 <- remove_symbols(movie$genre2)
movie$genre3 <- remove_symbols(movie$genre3)
movie$genre4 <- remove_symbols(movie$genre4)
movie$genre5 <- remove_symbols(movie$genre5)
movie$genre6 <- remove_symbols(movie$genre6)
movie$genre7 <- remove_symbols(movie$genre7)
movie$genre8 <- remove_symbols(movie$genre8)

movie %>% 
  select(matches("[1-9]"))
# A tibble: 45,466 × 8
   genre1       genre2       genre3       genre4     genre5 genre6 genre7 genre8
   <chr>        <chr>        <chr>        <chr>      <chr>  <chr>  <chr>  <chr> 
 1 16Animation  35Comedy     10751Family  <NA>       <NA>   <NA>   <NA>   <NA>  
 2 12Adventure  14Fantasy    10751Family  <NA>       <NA>   <NA>   <NA>   <NA>  
 3 10749Romance 35Comedy     <NA>         <NA>       <NA>   <NA>   <NA>   <NA>  
 4 35Comedy     18Drama      10749Romance <NA>       <NA>   <NA>   <NA>   <NA>  
 5 35Comedy     <NA>         <NA>         <NA>       <NA>   <NA>   <NA>   <NA>  
 6 28Action     80Crime      18Drama      53Thriller <NA>   <NA>   <NA>   <NA>  
 7 35Comedy     10749Romance <NA>         <NA>       <NA>   <NA>   <NA>   <NA>  
 8 28Action     12Adventure  18Drama      10751Fami… <NA>   <NA>   <NA>   <NA>  
 9 28Action     12Adventure  53Thriller   <NA>       <NA>   <NA>   <NA>   <NA>  
10 12Adventure  28Action     53Thriller   <NA>       <NA>   <NA>   <NA>   <NA>  
# … with 45,456 more rows

I managed to separate the genre information above. Now all values in genre 1 - 8 has a value in the format of “number + name of genre” (example: 35Comedy)

However, there are 2,445 observations whose genre value doesn’t fit the above format. Most of them don’t have any genre assigned so their genre1 values are empty or blank, which means they don’t have any genre assigned to them.

Code
movie %>% filter(!str_detect(genre1, "^[0-9]")) %>% select(starts_with("genre"))
# A tibble: 2,445 × 9
   genres genre1 genre2 genre3 genre4 genre5 genre6 genre7 genre8
   <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
 1 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 2 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 3 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 4 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 5 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 6 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 7 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 8 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 9 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
10 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
# … with 2,435 more rows

However, looking at them closely, there are three observations whose genre value that is NOT blank but have strange strings.

The values in these observations such as “Carousel Production” and “Aniplex” don’t look like a name of genres but that of production studios. Also their original_title values don’t look like a title of movies but information of languages. It seems like these observations were not correctly read in because their values in other columns look weird.

Since these observations are not reliable, I decided to delete them from this dataset.

Code
movie %>% filter(!str_detect(genre1, "^[0-9]") &!str_detect(genre1, "^[ \t\n]*$")) %>% #^[ \t\n]*$ is a regular expression for blank.
  select(c(original_title, production_countries, genre1:genre3))
# A tibble: 3 × 5
  original_title                           production_cou…¹ genre1 genre2 genre3
  <chr>                                    <chr>            <chr>  <chr>  <chr> 
1 [{'iso_639_1': 'en', 'name': 'English'}] 6.0              Carou… Visio… Teles…
2 [{'iso_639_1': 'ja', 'name': '日本語'}]  7.0              Anipl… GoHan… BROST…
3 [{'iso_639_1': 'en', 'name': 'English'}] 4.3              Odyss… Pulse… Rogue…
# … with abbreviated variable name ¹​production_countries
Code
movie_clean1 <- movie %>% filter(!str_detect(genre1, "^[A-Z]"))

Genre information is more organized but is still not easy to be analyzed. Thus, I decided to make dummy variables of each genre.

For example, if movie A is categorized as “comedy” and “adventure”, the line of movie A should have 1 in the “comedy” column and the “adventure” respectively and 0 in columns of other genres.

Code
#https://community.rstudio.com/t/creating-dummy-columns-based-on-multiple-columns/58145/3
movie_clean1 <- movie_clean1 %>% 
  pivot_longer(cols = matches("genre[1-9]")) %>%
  add_column(count = 1) %>%
  arrange(value) %>%
  filter(str_detect(value, "^[0-9]")) %>%
  mutate(value1 = str_replace_all(value, "[0-9]+", "")) %>%
  select(-c(value, name)) %>%
  arrange(title) %>%
  pivot_wider(
    names_from = value1, 
    values_from = count, 
    values_fill = list(count=0),
    values_fn = list(count = mean)) %>%
  arrange(title)

# I also deleted genre "TVMovie" and "Foreign" because they're more format or origin country information rather than genre.
movie_clean1 <- movie_clean1 %>% select(-c("TVMovie","Foreign"))

(2). Tyding Production_countries

The values in production_countries are written in the following way.

[{‘iso_3166_1’: ‘abbreviation of country’,‘name’: ‘full country name’}]

When there is more than one country in the value, I took the first country in account for this analysis.

Code
movie_clean1$production_countries <- remove_symbols(movie_clean1$production_countries)
movie_clean1$production_countries <- str_extract(movie_clean1$production_countries,"(?<=_1)\\w{2}")

movie_clean1 %>% select(production_countries)
# A tibble: 42,991 × 1
   production_countries
   <chr>               
 1 GB                  
 2 IN                  
 3 US                  
 4 <NA>                
 5 US                  
 6 US                  
 7 FR                  
 8 JP                  
 9 <NA>                
10 US                  
# … with 42,981 more rows

(3). Tyding production_companies & spoken_languages

The values in both of production_companies and spoken_languages are in the following format.

[{‘name’: ‘XXXX’, ‘id’: —}] > [{‘iso_639_1’:‘XXXX’, ‘name’:“—-}]

I only need the information that is written as XXXX in this format.

Code
movie_clean1$production_companies <-str_remove_all(movie_clean1$production_companies, "'id': [0-9]*|'name':|\\[|\\]|'|\\{|,") %>%
  str_replace_all("\\},", ",") %>%
  str_replace_all(" \\} ", ",")%>%
  str_remove(" \\}")

movie_clean1$spoken_languages <-str_remove_all(movie_clean1$spoken_languages, "'iso_639_1':|'name': '\\w+'|\\[|\\]|'|\\{|,") %>%
  str_replace_all(" \\} ", ",") %>%
  str_remove_all(" \\}")

movie_clean1 %>% select(production_companies, spoken_languages)
# A tibble: 42,991 × 2
   production_companies                                                  spoke…¹
   <chr>                                                                 <chr>  
 1 " Screen Yorkshire, British Film Institute (BFI), Creative Scotland,… " en"  
 2 ""                                                                    " hi"  
 3 " Sebastian Films Limited"                                            " en"  
 4 " Lorimar Pictures"                                                   " en"  
 5 " Lone Star Production, Monogram Pictures"                            " en, …
 6 ""                                                                    " en"  
 7 " Barnholtz Entertainment"                                            " en, …
 8 ""                                                                    " zh"  
 9 ""                                                                    " fr"  
10 " Disney Channel"                                                     " en"  
# … with 42,981 more rows, and abbreviated variable name ¹​spoken_languages

Adding a new column

A new column years was created to group years by decade.

Code
movie_clean1 <- movie_clean1 %>% 
  mutate(
    years = case_when(
    lubridate::year(release_date) < 1920 ~ "Before 1920",
    lubridate::year(release_date) >= 1920 & lubridate::year(release_date) < 1930 ~ "1920s",
    lubridate::year(release_date) >= 1930 & lubridate::year(release_date) < 1940 ~ "1930s",
    lubridate::year(release_date) >= 1940 & lubridate::year(release_date) < 1950 ~ "1940s",
    lubridate::year(release_date) >= 1950 & lubridate::year(release_date) < 1960 ~ "1950s",
    lubridate::year(release_date) >= 1960 & lubridate::year(release_date) < 1970 ~ "1960s",
    lubridate::year(release_date) >= 1970 & lubridate::year(release_date) < 1980 ~ "1970s",
    lubridate::year(release_date) >= 1980 & lubridate::year(release_date) < 1990 ~ "1990s",
    lubridate::year(release_date) >= 1990 & lubridate::year(release_date) < 2000 ~ "1990s",
    lubridate::year(release_date) >= 2000 & lubridate::year(release_date) < 2010 ~ "2000s",
    TRUE ~ "2010s"
  )
  )

Now, the dataset is cleaner. Look at the summary again.

Code
print(summarytools::dfSummary(movie_clean1),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

movie_clean1

Dimensions: 42991 x 33
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
title [character]
1. Cinderella
2. Alice in Wonderland
3. Hamlet
4. Les Misérables
5. A Christmas Carol
6. Beauty and the Beast
7. The Three Musketeers
8. Treasure Island
9. Aftermath
10. Countdown
[ 40025 others ]
10(0.0%)
9(0.0%)
9(0.0%)
8(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
6(0.0%)
6(0.0%)
42912(99.8%)
3 (0.0%)
original_title [character]
1. Alice in Wonderland
2. Hamlet
3. A Christmas Carol
4. Les Misérables
5. Macbeth
6. The Three Musketeers
7. Cinderella
8. Frankenstein
9. Heidi
10. The Hound of the Baskervi
[ 41043 others ]
8(0.0%)
8(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
6(0.0%)
6(0.0%)
6(0.0%)
6(0.0%)
42923(99.8%)
0 (0.0%)
imdb_id [character]
1. tt0000001
2. tt0000003
3. tt0000005
4. tt0000008
5. tt0000010
6. tt0000012
7. tt0000014
8. tt0000023
9. tt0000029
10. tt0000070
[ 42969 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
42969(100.0%)
12 (0.0%)
movielens_id [numeric]
Mean (sd) : 104553.2 (111978.6)
min ≤ med ≤ max:
2 ≤ 55734 ≤ 469172
IQR (CV) : 118823.5 (1.1)
42991 distinct values 0 (0.0%)
production_companies [character]
1. (Empty string)
2.  
·
Metro-Goldwyn-Mayer (MGM
3.  
·
Warner Bros.
4.  
·
Paramount Pictures
5.  
·
Twentieth Century Fox Fi
6.  
·
Universal Pictures
7.  
·
RKO Radio Pictures
8.  
·
Columbia Pictures Corpor
9.  
·
Columbia Pictures
10.  
·
Mosfilm
[ 22474 others ]
9770(22.7%)
737(1.7%)
538(1.3%)
500(1.2%)
434(1.0%)
320(0.7%)
247(0.6%)
207(0.5%)
146(0.3%)
144(0.3%)
29945(69.7%)
3 (0.0%)
production_countries [character]
1. US
2. GB
3. FR
4. CA
5. JP
6. DE
7. IT
8. RU
9. IN
10. ES
[ 132 others ]
18202(47.5%)
3031(7.9%)
2639(6.9%)
1469(3.8%)
1430(3.7%)
1389(3.6%)
1382(3.6%)
777(2.0%)
754(2.0%)
576(1.5%)
6653(17.4%)
4689 (10.9%)
status [character]
1. In Production
2. Planned
3. Post Production
4. Released
5. Rumored
20(0.0%)
15(0.0%)
96(0.2%)
42601(99.2%)
205(0.5%)
54 (0.1%)
release_date [Date]
min : 1874-12-09
med : 2001-10-29
max : 2020-12-16
range : 146y 0m 7d
16860 distinct values 29 (0.1%)
runtime [numeric]
Mean (sd) : 95.5 (36.6)
min ≤ med ≤ max:
0 ≤ 95 ≤ 1256
IQR (CV) : 21 (0.4)
343 distinct values 182 (0.4%)
revenue [numeric]
Mean (sd) : 11845510 (66095743)
min ≤ med ≤ max:
0 ≤ 0 ≤ 2787965087
IQR (CV) : 0 (5.6)
6844 distinct values 3 (0.0%)
budget [numeric]
Mean (sd) : 4461834 (17886326)
min ≤ med ≤ max:
0 ≤ 0 ≤ 3.8e+08
IQR (CV) : 0 (4)
1215 distinct values 0 (0.0%)
original_language [character]
1. en
2. fr
3. it
4. ja
5. de
6. es
7. ru
8. hi
9. ko
10. zh
[ 79 others ]
30566(71.1%)
2348(5.5%)
1359(3.2%)
1287(3.0%)
1029(2.4%)
927(2.2%)
786(1.8%)
478(1.1%)
431(1.0%)
391(0.9%)
3379(7.9%)
10 (0.0%)
spoken_languages [character]
1.  
·
en
2. (Empty string)
3.  
·
fr
4.  
·
ja
5.  
·
it
6.  
·
es
7.  
·
ru
8.  
·
de
9.  
·
en, fr
10.  
·
en, es
[ 1896 others ]
21943(51.0%)
2660(6.2%)
1785(4.2%)
1220(2.8%)
1097(2.6%)
826(1.9%)
769(1.8%)
722(1.7%)
679(1.6%)
568(1.3%)
10719(24.9%)
3 (0.0%)
genres [character]
1. [{'id': 18, 'name': 'Dram
2. [{'id': 35, 'name': 'Come
3. [{'id': 99, 'name': 'Docu
4. [{'id': 18, 'name': 'Dram
5. [{'id': 35, 'name': 'Come
6. [{'id': 27, 'name': 'Horr
7. [{'id': 35, 'name': 'Come
8. [{'id': 35, 'name': 'Come
9. [{'id': 18, 'name': 'Dram
10. [{'id': 27, 'name': 'Horr
[ 4055 others ]
4996(11.6%)
3620(8.4%)
2721(6.3%)
1300(3.0%)
1133(2.6%)
974(2.3%)
930(2.2%)
593(1.4%)
531(1.2%)
528(1.2%)
25665(59.7%)
0 (0.0%)
War [numeric]
Min : 0
Mean : 0
Max : 1
0:41669(96.9%)
1:1322(3.1%)
0 (0.0%)
Drama [numeric]
Min : 0
Mean : 0.5
Max : 1
0:22747(52.9%)
1:20244(47.1%)
0 (0.0%)
Action [numeric]
Min : 0
Mean : 0.2
Max : 1
0:36399(84.7%)
1:6592(15.3%)
0 (0.0%)
Thriller [numeric]
Min : 0
Mean : 0.2
Max : 1
0:35372(82.3%)
1:7619(17.7%)
0 (0.0%)
Crime [numeric]
Min : 0
Mean : 0.1
Max : 1
0:38687(90.0%)
1:4304(10.0%)
0 (0.0%)
Western [numeric]
Min : 0
Mean : 0
Max : 1
0:41949(97.6%)
1:1042(2.4%)
0 (0.0%)
Documentary [numeric]
Min : 0
Mean : 0.1
Max : 1
0:39061(90.9%)
1:3930(9.1%)
0 (0.0%)
Romance [numeric]
Min : 0
Mean : 0.2
Max : 1
0:36261(84.3%)
1:6730(15.7%)
0 (0.0%)
Family [numeric]
Min : 0
Mean : 0.1
Max : 1
0:40224(93.6%)
1:2767(6.4%)
0 (0.0%)
Animation [numeric]
Min : 0
Mean : 0
Max : 1
0:41060(95.5%)
1:1931(4.5%)
0 (0.0%)
Comedy [numeric]
Min : 0
Mean : 0.3
Max : 1
0:29815(69.4%)
1:13176(30.6%)
0 (0.0%)
Horror [numeric]
Min : 0
Mean : 0.1
Max : 1
0:38320(89.1%)
1:4671(10.9%)
0 (0.0%)
Mystery [numeric]
Min : 0
Mean : 0.1
Max : 1
0:40527(94.3%)
1:2464(5.7%)
0 (0.0%)
Adventure [numeric]
Min : 0
Mean : 0.1
Max : 1
0:39501(91.9%)
1:3490(8.1%)
0 (0.0%)
History [numeric]
Min : 0
Mean : 0
Max : 1
0:41593(96.7%)
1:1398(3.3%)
0 (0.0%)
ScienceFiction [numeric]
Min : 0
Mean : 0.1
Max : 1
0:39947(92.9%)
1:3044(7.1%)
0 (0.0%)
Fantasy [numeric]
Min : 0
Mean : 0.1
Max : 1
0:40682(94.6%)
1:2309(5.4%)
0 (0.0%)
Music [numeric]
Min : 0
Mean : 0
Max : 1
0:41394(96.3%)
1:1597(3.7%)
0 (0.0%)
years [character]
1. 1920s
2. 1930s
3. 1940s
4. 1950s
5. 1960s
6. 1970s
7. 1990s
8. 2000s
9. 2010s
10. Before 1920
398(0.9%)
1264(2.9%)
1444(3.4%)
1997(4.6%)
2474(5.8%)
3270(7.6%)
8965(20.9%)
10605(24.7%)
12291(28.6%)
283(0.7%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

(2) Bechdel data

Removing and Mutating Columns

imdbid and titleId are duplicated because titleId is “tt+imdbid”. movie_clean1 data frame has IDs in the same format (starting with tt) as titleId of bechdel data frame so I deleted imdbid column.

Also, year and title information is available in movie data frame, thus I deleted them.

I made a new column bechdel_pass where 1 means the movie passes bechdel test and 0 means otherwise.

Code
#

bechdel <- bechdel %>% select(-c("imdbid","...1","title"))
colnames(bechdel) <- c("year", "bechdel_rating", "id", "titleid")

summary(bechdel)
      year      bechdel_rating        id          titleid         
 Min.   :1010   Min.   :0.000   Min.   :    1   Length:9802       
 1st Qu.:1988   1st Qu.:1.000   1st Qu.: 2558   Class :character  
 Median :2006   Median :3.000   Median : 5212   Mode  :character  
 Mean   :1996   Mean   :2.133   Mean   : 5225                     
 3rd Qu.:2013   3rd Qu.:3.000   3rd Qu.: 7792                     
 Max.   :2022   Max.   :3.000   Max.   :10641                     
Code
# Changing the column order
col_order <- c("id", "titleid","year", "bechdel_rating")
bechdel <- bechdel[, col_order]

# Mutate a new column bechdel_pass
bechdel <- bechdel %>% mutate(
  bechdel_pass = case_when(
    bechdel_rating == 3 ~ 1,
    TRUE ~ 0
  )
)

# Change data types of columns
bechdel$bechdel_rating <- factor(bechdel$bechdel_rating, levels = c("0", "1", "2", "3"))
bechdel$bechdel_pass <- factor(bechdel$bechdel_pass, levels = c("0", "1"))

Sense Check

This data looks quite clean, however, I realized that the minimum number of year is 1010, which is weird. “Inazuma Eleven The Movie” was released in 2010, but it seems to have been mistakenly recorded as 1010. So I manually corrected the data.

Code
summary(bechdel)
       id          titleid               year      bechdel_rating bechdel_pass
 Min.   :    1   Length:9802        Min.   :1010   0:1084         0:4208      
 1st Qu.: 2558   Class :character   1st Qu.:1988   1:2124         1:5594      
 Median : 5212   Mode  :character   Median :2006   2:1000                     
 Mean   : 5225                      Mean   :1996   3:5594                     
 3rd Qu.: 7792                      3rd Qu.:2013                              
 Max.   :10641                      Max.   :2022                              
Code
bechdel %>% 
  filter(year < 1800)
# A tibble: 1 × 5
     id titleid    year bechdel_rating bechdel_pass
  <dbl> <chr>     <dbl> <fct>          <fct>       
1 10556 tt1794796  1010 3              1           
Code
bechdel$year[bechdel$year == 1010] <- 2010

Join data sets

We have two data frames movie and bechdel , which have imdb id as a foreign key.

Code
movie_bechdel_join <- inner_join(movie_clean1, bechdel, 
                                 by=c("imdb_id" = "titleid"),
                                 copy = TRUE)

Read in data / Describe data (OMDb)

As we have a joined data, I read in the movie popularity data from OMDb using OMDb API.

omdb_df has 5 columns

  • imdbID: can be used as a foreign key when joining with other data frames

  • Director: Director’s name

  • Metascore: Review scores from 0 to 100 on the metacritic website (https://www.metacritic.com/movie). They are weighted averages. and reviews by certain reviewers are given more weight, however the detail is not revealed.

  • imdbRating: Review scores from 0 to 10 on IMDb website (https://www.imdb.com/). They are weighted averages.

  • imdbVotes: Number of votes on IMDb website.

Code
# Generating OMDb dataframe
# imdb_id_list <- c()
# 
# for (i in c(1:length(movie_bechdel_join$imdb_id))) {
#   imdb_id_list[[i]] <- movie_bechdel_join$imdb_id[i]
# }
# 
# omdb_list <- lapply(imdb_id_list, function(movie_id) {
#   actor_vector <- find_by_id(movie_id)
#   actor_vector
# })
# 
# omdb_df <- tibble(place = omdb_list) %>%
#   unnest_wider(place) %>%
#   select(c("imdbID", "Director", "Metascore", "imdbRating", "imdbVotes", "BoxOffice")) %>% unnest() %>% distinct()

#write.csv(omdb_df, "_data/omdb.csv")
omdb_df <- read_csv("_data/omdb.csv")
New names:
Rows: 7729 Columns: 7
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): imdbID, Director, Metascore, BoxOffice dbl (3): ...1, imdbRating,
imdbVotes
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
omdb_df$Metascore <- as.numeric(omdb_df$Metascore)
Warning: NAs introduced by coercion
Code
omdb_df$BoxOffice <- omdb_df$BoxOffice %>%
  str_replace_all(",", "") %>% 
  str_extract("[0-9]*(?=$)") %>% 
  as.numeric()

omdb_df <- omdb_df %>% select(-...1)
glimpse(omdb_df)
Rows: 7,729
Columns: 6
$ imdbID     <chr> "tt2614684", "tt0074080", "tt0090556", "tt1022603", "tt1132…
$ Director   <chr> "Yann Demange", "Beverly Sebastian, Ferd Sebastian", "Tom M…
$ Metascore  <dbl> 83, NA, NA, 76, 58, NA, 71, 52, 45, 52, 60, 76, NA, NA, 70,…
$ imdbRating <dbl> 7.2, 5.4, 7.6, 7.7, 6.3, 6.3, 7.4, 6.5, 5.0, 6.5, 5.5, 7.2,…
$ imdbVotes  <dbl> 56942, 1278, 2371, 516486, 2230, 8603, 181857, 74038, 35911…
$ BoxOffice  <dbl> 1270847, NA, 441863, 32391374, 230600, NA, NA, 27766, 9600,…

Now I have the data from OMDb as omdb_df. Let’s join movie_bechdel_join and omdb_df.

Code
data <- inner_join(movie_bechdel_join, omdb_df, 
                                 by=c("imdb_id" = "imdbID"),
                                 copy = TRUE)

Data Analysis and Visualization

Is female representation in movie stories improving over time?

To find out the answer for this questions, I looked at the result of the Bechdel test.

The number of movies that pass the Bechdel test increases over time, especially after around 1980 as the total number of released movies increased.

Code
movie_bechdel_join %>%
  filter(lubridate::year(release_date) > 1920) %>%
  group_by(year = lubridate::year(release_date), bechdel_rating) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=year, y= n_total, fill = bechdel_rating)) + geom_area(stat = "identity") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two women", 
                             "1.Two women", 
                             "2.Two women talk to each other about men", 
                             "3.Two women talk to each other about other than men (Pass)")) +
  labs(x= "year", y = "Number", title = "The number of movies by Bechdel test ratings", fill = "Bechdel rating")+
  scale_x_continuous(n.breaks=14) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

However, the total number of released movies is increasing as well. Is the percentage of films that pass the Bechdel test increasing?

In the 1920s, which is 100 years ago, less than 20% of films passed the Bechdel test. There was a big increase in the percentage of Bechdel test-passing movies in the 1930s, however, the percentage of films passing the Bechdel test was stagnant at a bit lower than 50% from the 1930s to the 1950s. From 1950s to 1970s, the representation of women in the film somehow went backward, with less than 45% of films meeting the requirements of the Bechdel test; from the 1970s to the present, the percentage of films that pass the Bechdel test has continued to increase, reaching approximately 70% now. Although the representation of women seems to have improved since 1970, it has not improved dramatically compared to the 1930s and 1940s.

Code
data_with_p <- data %>%
  filter(lubridate::year(release_date) > 1920) %>%
  group_by(years, bechdel_rating) %>%
  dplyr::summarize(n = n()) %>%
  mutate(percentage = n/sum(n)*100)
`summarise()` has grouped output by 'years'. You can override using the
`.groups` argument.
Code
data_with_p %>%
  ggplot(aes(x=years, y= percentage, fill = bechdel_rating)) + 
  geom_col() +
  geom_text(aes(label=paste0(round(percentage), "%")), color="white", font="bold", position = position_stack(vjust = 0.5)) +
  
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two women", 
                             "1.Two women", 
                             "2.Two women talk to each other about men", 
                             "3.Two women talk to each other about other than men (Pass)")) +
  labs(x="year", y="percent", title = "% of movies that pass/don't pass Bechdel Test", fill="Bechdel Rating") +
  scale_y_continuous(n.breaks=10) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(ncol = 2))
Warning in geom_text(aes(label = paste0(round(percentage), "%")), color =
"white", : Ignoring unknown parameters: `font`

Perhaps the percentage of films that pass the Bechdel test has not changed too dramatically compared to 70 years ago. However, if we focus on the movies that do NOT pass the Bechdel test, we can see a different trend happening.

In the 1920s, the majority of films did not feature two significant women; this trend changed greatly in the 1930s, with the percentage of films that did not feature two women decreasing significantly, peaking in the 1960s to about 20%, and then declining.
The percentage of films in which the only conversation between female characters is about men also decreased from 1930 to the 2010s.
Interestingly, however, the percentage of films in which female characters do not speak to each other (dark gray in the figure) has changed very little from the 1930s to the 2010s.

Code
data_with_p %>%
  filter(bechdel_rating!="3") %>%
  ggplot(aes(x=years, y= percentage, fill = bechdel_rating)) + 
  geom_col()+
  geom_text(aes(label=paste0(round(percentage), "%")), color="white", font="bold", position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two women", 
                             "1.Two women", 
                             "2.Two women talk to each other about men")) +
  labs(x="year", y="percent", title = "% of movies that do NOT pass Bechdel test by Bechdel test criteria", fill = "Bechdel Rating", subtitle = "% doesn't add up to 100% because it is missing the % of the movies that pass Bechdel test") +
  scale_y_continuous(n.breaks=10) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))
Warning in geom_text(aes(label = paste0(round(percentage), "%")), color =
"white", : Ignoring unknown parameters: `font`

What category represents women better?

I focused on 4428 movies released between 2000 and 2020 because I want to do category-based analysis of a recent situation, and it’s better to focus on the period where more movies were released to observe the trend.

The degree to which women are portrayed in a film’s story varies widely from genre to genre.

The genre with the highest percentage of films passing the bechdel test is Romance (72%) and the lowest is Western (22%), showing a large difference.

Interestingly, not only the percentage of films that pass the Bechdel test, but also the percentage that meet each of the three criteria for passing the Bechdel test varies widely by genre.

For example, Western, War, and Documentary struggle more than other genres to feature two female characters. (To be precise, the Bechdel Test is for evaluating fictional stories, so it may not be appropriate to evaluate Documentary.)

Code
viz_by_genre <- data %>%
  pivot_longer(cols = c(War:Music),names_to = "genre_name", values_drop_na = TRUE) %>%
  filter(value == 1) 


viz_by_genre_with_p <- viz_by_genre %>%
  group_by(genre_name, years, bechdel_rating) %>%
  dplyr::summarize(n=n()) %>%
  mutate(proportion = n/sum(n),
         percentage=proportion*100)
`summarise()` has grouped output by 'genre_name', 'years'. You can override
using the `.groups` argument.
Code
viz_by_genre_with_p %>%
  filter(years %in% c("2000s", "2010s")) %>%
  group_by(genre_name, bechdel_rating) %>%
  summarise(per=mean(percentage)) %>%
  arrange(desc(bechdel_rating), desc(per)) -> order_genre
`summarise()` has grouped output by 'genre_name'. You can override using the
`.groups` argument.
Code
viz_by_genre_with_p$genre_name <-factor(viz_by_genre_with_p$genre_name, levels = order_genre$genre_name[1:18])

viz_by_genre_with_p %>%
  filter(years=="2000s"|years=="2010s") %>%
  group_by(genre_name, bechdel_rating) %>%
  summarize(n=sum(n)) %>%
  mutate(p=n/sum(n),
         percentage = 100*p) %>%
  ggplot(aes(x=genre_name, y=percentage, fill= bechdel_rating)) +
  geom_col() +
  geom_text(aes(label=paste0(round(percentage), "%")), color="white", position=position_stack(vjust=0.5), size=3)+
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two women", 
                             "1.Two women", 
                             "2.Two women talk to each other about men", 
                             "3.Two women talk to each other about other than men (Pass)")) +
  labs(x="Genre", y="%", fill="Bechdel rating")+
  coord_flip()  +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))
`summarise()` has grouped output by 'genre_name'. You can override using the
`.groups` argument.

Above we looked at trends over the past 20 years, but what if we look at the historical trends over past 100 years?
Some genres continue to improve their representation of women compared to the past, while others remain the same.
For example, Horror, Adventure, Science Fiction, Animation, and Fantasy have gradually increased the percentage of films that pass the Bechdel test.
Western, History, War, and Crime, Comedy on the other hand, have not changed much in terms of female representation compared to 100 years ago.

Code
viz_by_genre_with_p %>%
  filter(years!="Before 1920") %>%
  ggplot(aes(x=years, y= percentage, fill = bechdel_rating)) + 
  geom_col() +
  facet_wrap(~genre_name, ncol = 5)+
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No two female characters", "Two female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  labs(x = "year", y = "percent", title = "Proportion of movies that pass / don't pass Bechdel test by categories")+
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2)) +
  scale_x_discrete(breaks = c('1920s', '1960s', '2010s'),
                   labels = c('1920', '1960', '2010'))

Do movies with good female representation succeed in gaining more popularity on online review site?

Again, I focused on the movies released after 2000 to see the recent trends.

I compared Metascore and IMDb ratings of the films that pass and don’t pass the Bechdel test.

The movies that don’t pass the Bechdel test seem to score higher on Metascore, however, the difference is quite small, and there seems to be no relationship between whether a film passes the Bechdel test and its reputation (Metascore).

Code
data %>%
  filter(lubridate::year(release_date) > 2000) %>%
  ggplot(aes(x=bechdel_pass, y=Metascore)) + 
  geom_boxplot() +
  scale_x_discrete(labels=c("0: Don't pass Bechel", "1: Pass Bechdel")) +
  labs(x="Bechdel Test")
Warning: Removed 915 rows containing non-finite values (`stat_boxplot()`).

T-test shows that there is no statistically significant difference in Metascores of the movies that pass and that don’t pass Bechdel test.

Code
ttest_pass <- data %>% 
  filter(bechdel_pass == "1" & lubridate::year(release_date) > 2000) %>%
  select(c("bechdel_pass", "imdbRating", "imdbVotes", "Metascore", "BoxOffice"))

ttest_not_pass <- data %>%
  filter(bechdel_pass == "0" & lubridate::year(release_date) > 2000) %>%
  select(c("bechdel_pass", "imdbRating", "imdbVotes", "Metascore", "BoxOffice"))

t.test(ttest_pass$Metascore, ttest_not_pass$Metascore)

    Welch Two Sample t-test

data:  ttest_pass$Metascore and ttest_not_pass$Metascore
t = 0.37842, df = 2936, p-value = 0.7051
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.9448254  1.3967391
sample estimates:
mean of x mean of y 
 57.81359  57.58764 

Then, I compared the IMDb scores of the movies that pass and don’t pass Bechdel test. Those that don’t pass Bechdel test seem to score a bit higher.

Code
data %>%
  filter(lubridate::year(release_date) > 2000) %>%
  ggplot(aes(x=bechdel_pass, y=imdbRating)) + 
  geom_boxplot() +
  scale_x_discrete(labels=c("0: Don't pass Bechel", "1: Pass Bechdel")) +
  labs(x="Bechdel Test", y="IMDb Rating")
Warning: Removed 8 rows containing non-finite values (`stat_boxplot()`).

T-test shows that the movies that do not meet Bechdel test criteria score statistically higher than those that do.

Code
t.test(ttest_pass$imdbRating, ttest_not_pass$imdbRating)

    Welch Two Sample t-test

data:  ttest_pass$imdbRating and ttest_not_pass$imdbRating
t = -5.6621, df = 3804.6, p-value = 1.605e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2230059 -0.1082905
sample estimates:
mean of x mean of y 
 6.383584  6.549233 

Do movies with good female representation succeed in making more money?

Code
data %>%
  filter(lubridate::year(release_date) > 2000) %>%
  ggplot(aes(x=bechdel_pass, y=BoxOffice)) + 
  geom_boxplot() +
  scale_x_discrete(labels=c("0: Don't pass Bechel", "1: Pass Bechdel"))
Warning: Removed 1115 rows containing non-finite values (`stat_boxplot()`).

Even though the average box office amount of not-Bechdel-passing movies is higher than that of Bechdel-passing movies, T-test shows that the difference is not significant.

Code
t.test(ttest_pass$BoxOffice, ttest_not_pass$BoxOffice)

    Welch Two Sample t-test

data:  ttest_pass$BoxOffice and ttest_not_pass$BoxOffice
t = -1.5212, df = 3042.1, p-value = 0.1283
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -9207124  1162341
sample estimates:
mean of x mean of y 
 43773430  47795821 

Conclusion

The representation of women in film stories stagnated from the 1930s to the 1970s, but has gradually improved, especially since the 1970s.
However, this is not true for all genres: the percentage of films that pass the Bechdel test varies widely from the 20% range (Western) to the 70% range (Romance), depending on the genre. Some genres (Horror, Adventure, Science Fiction, Fantasy) have consistently increased the percentage of films that pass the Bechdel test, while others (War, Crime, History, Western) have remained stagnant. The percentage of films that pass the test is increasing constantly.

It turns out that whether or not a film passes the Bechdel test has little impact on its success (i.e., popularity and revenue). However, the reviews on the imdb website are higher for films that do not pass the Bechdel Test.

This analysis has so far shown how women’s representation has (or has not) improved, and which genres have been particularly well represented by women, but it has not explained why this has happened. I would like to conduct an analysis that can answer the question “Why?”. For example, I would like to test the hypothesis that the increase in female representation in fantasy and horror is due to the increase in female audiences.
Also, I would like to further investigate the relationship between female representation and film success (popularity and revenue). In particular, I am interested in how the gender of the audience affects their evaluation of films in which women are represented and those in which they are not.

Reflection

One of the most difficult parts of this analysis was finding the right data. Even though there are many movie database, the information was not complete or not clear enough to be used. Thus, I ended up using two different movie databases and one Bechdel test database because one movie database (The Movie Dataset) didn’t have a clear definition of review ratings and popularity.

Another difficulty that I faced is that I could not find a good dataset of the gender of directors/writers of movies. Even though I found “credit” dataset that included the name and the gender of people who were involved in the movie production, the information on director was limited and only a few thousand movies had director information. Originally, I wanted to visualize the relationship between the gender of movie directors and the female representation, however, I gave up doing that for this project due to the lack of an appropriate dataset.

Contact

The dataset that I used for this analysis can be found on my GitHub (https://github.com/Enagai-nagai/601_Fall_2022/tree/template/posts/_data).

I’m interested in analyzing and visualizing data related to the entertainment industry, behavior on digital platforms, and gender issues.

If you have any comments or questions, please contact me through Email.

  • enagai ★ umass.edu (please replace ★ with @)

  • nagainagai.e ★ gmail.com (please replace ★ with @)

Bibliography and References

I used R and RStudio to realize this analysis.

The below is the source of data sets:

  • Bechdel API: https://bechdeltest.com/api/v1/doc

  • The Movie Dataset from Kaggle: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset

  • OMDb API: https://www.omdbapi.com/

The below is the source of information:

  • Women and Hollywood. (2022, March 15) Study: Women made up 34% of speaking roles in 2021’s top films, majority of those characters were white. . Retrieved December 13, 2022, from https://womenandhollywood.com/study-women-made-up-34-of-speaking-roles-in-2021s-top-films-majority-of-those-characters-were-white/#:~:text=In%202021’s%20top%20films%2C%20females,and%2037%20percent%20in%202019.

  • Lauzen, M. M. (2021). (rep.). It’s a Man’s (Celluloid) World, Even in a Pandemic Year: Portrayals of Female Characters in the Top U.S. Films of 2021. San Diego, California: San Diego State University and The Center for the Study of Women in Television and Film.

Source Code
---
title: "Final Project Erika Nagai"
author: "Erika Nagai"
description: "Female representation in movie stories"
date: "`r Sys.Date()`"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - movie
  - gender
  - bechdel test
  - female representation
  - Erika Nagai
editor: 
  markdown: 
    wrap: 72
---

## Introduction

Even though half of the moviegoers in the United States are women, their
story is not represented as much as that of men. Only 35% of the main
characters in the best movies of 2021 were female, and male characters
outnumbered female characters in 85% of the films. (Women and Hollywood,
2022) Also lacking are women behind the scenes in the film business. In
2021, just 12% of the top 100 films are directed by women. (Lauzen,
n.d.)

In this project, I focused on analyzing female representation in movie
stories rather than in their production. I used the data from three
major movie databases, The Movie Database (TMDb), MovieLens, Open Movie
Database (OMDb), and the Bechdel Test, one of the most common criteria
to measure female presentation.

I would like to answer the following questions in this analysis:

1.  **Is female representation in movie stories improving over time?**

2.  **What genres are doing better than others in terms of female
    representation?**

3.  **Are movies with good female representation more popular?**

4.  **Do movies with good female representation more money?**

This analysis intends to document the data cleaning process, which is
quite long. If you are curious only about the analysis results, please
skip to [Data Analysis and Visualization] part.

```{r}
# install libraries

library(tidyverse)
library(ggplot2)
library(stringr)
library(tidyr)
library(dplyr)
library(summarytools)
library(lubridate)
library(ggridges)
```

## Read in data

For this analysis, I used the following different data sets.

### (1) "The Movie Data set" from Kaggle :

This dataset was obtained from Kaggle "The Movie Data
set"(<https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download&select=movies_metadata.csv>).

```{r}
#(1) movies_metadata.csv obtained from Kaggle
movie = read_csv("_data/movies_metadata.csv")
```

### (2) Bechdel test API:

I generated the data by using the bechdel test API
<https://bechdeltest.com/api/v1/doc>.

```{r}
#(2) bechdel test obtained by using bechdel API

library(rjson)
library(jsonlite)

#json_file <- "http://bechdeltest.com/api/v1/getAllMovies"
#bechdel <- read_json(path = json_file, simplifyVector = TRUE)
#bechdel$titleId <- paste("tt",bechdel$imdbid, sep = "")

#write.csv(bechdel, file = "_data/bechdel.csv")
bechdel <- read_csv("_data/bechdel.csv")

```

### (3) Open Movie Database (OMDb):

This dataset was generated by using Open Movie Database (OMDb) API
<https://www.omdbapi.com/>. OMDb provides detailed information such as
director, writer, nominated award, the number/score of reviews on IMDb,
etc... if you provide the name or IMDb id of movies. I decided to use
this database because it gives me data related to movies' popularity and
financial success of the films, which "The Movie Data set" doesn't
include.

OMDb doesn't give you a list of all movies registered on it. Instead,
you need to provide the exact movie title or the IMDb id to get the list
of information. Thus, I will use this API once I have the data that join
movies_metadata.csv and the Bechdel test. (Please refer to "[Read in
data / Describe data (OMDb)](#read-in-data-describe-data-omdb)" for this
process.)

## Describe data

### (1) "The Movie Data set"

This data was originally created from The Movie Database
(<https://www.themoviedb.org/>) and MovieLens
(<https://movielens.org/>).This movie dataset was generated by
Movielens, a (non-profit) movie review website
(<https://movielens.org/>), and was obtained from the following Kaggle
link.
(<https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download&select=movies_metadata.csv>)

The movie dataset contains 45,466 movies with release date between
December 9th, 1874 and December 16th, 2020.

The data includes the general information of movies, such as genres,
revenue, run time, languages, status (released/in production etc...).

```{r}
print(summarytools::dfSummary(movie),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')
```

### (2) Bechdel test

This data (`bechdel`) dataset documents the Bechdel test rating of 9630
movies released between 1874 and 2022. Each row represents a movie.

According to "Merriam-Webster", the Bechdel test is "*a set of criteria
used as a test to evaluate a work of fiction (such as a film) based on
its inclusion and representation of female characters*"
(<https://www.merriam-webster.com/dictionary/Bechdel%20Test>)

A movie passes the Bechdel test when **it has at least two female
featured characters that talk to each other about other than a man
(men)**.

In `bechdel`, the Bechdel test rating is registered in the following
manner.

0 \~ No two female characters

1 \~ Two female characters who don't talk to each other

2 \~ Two female characters talk to each other about a man (men)

3 \~ Passes Bechdel test: Two female characters talk to each other about
other than a man (men)

Apart from Bechdel ratings, this dataset (`bechdel`) contains the
released year, title of movies, and id. id that starts with "tt"
followed by 7 digits is the IMDb id and can work as a foreign key when
joining this data with the `movie` dataset.

```{r}
print(summarytools::dfSummary(bechdel),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')
```

## Tidy data

### (1) TMDb data

### **Checking missing values**

First of all, I took a look at missing values (NA values) in the data.

```{r}
movie %>% select(everything()) %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  t()
```

### **Deleting unnecessary columns**

I deleted the following columns, which seem irrelevant for this analysis
or have too many missing values.\
- `adult`: This information doesn't add any significant meaning as
almost all of the observations have FALSE value for this column

\- `belong_to_collection`: This column has 40000+ missing values\
- `homepage`: Unnecessary for this analysis

\- `overview`: Unnecessary for this analysis

\- `poster_path`: Unnecessary for this analysis

\- `tagline`: Unnecessary for this analysis\
- `video`: This information doesn't add any significant meaning as
almost all of observations have FALSE value for this column

\- `popularity`: This information may be interesting, however, it is not
clear how this popularity is measured or where it was generated.
Therefore, I decided to use popularity data from OMDb.

\- `vote_average`: Same as `popularity`

\-`vote_count`: Same as `popularity`

Then, I changed the order of the columns.

```{r}
movie <- movie %>% select(-c("adult", "homepage", "overview", "poster_path", "tagline", "belongs_to_collection", "poster_path", "video", "popularity", "vote_average", "vote_count"))

col_order <- c("title", "original_title", "imdb_id", "id", "production_companies", "production_countries", "status", "release_date", "runtime", "revenue", "budget", "original_language", "spoken_languages", "genres")
movie <- movie[, col_order]

colnames(movie)[4] <- "movielens_id"

```

### **Cleaning data in JSON nested list format**

The values in some certain columns such as "genres",
"production_companies", "production_countries","spoken_languages" are in
a JSON list format for example:

> " \[{'id': XXXX, 'content(name/genre/title)': XXX}, {'id': XXXX,
> 'content(name/genre/title): XXX}\] "

```{r}
movie %>% select(c("genres", "production_companies", "production_countries", "spoken_languages"))

# These columns include [] {} and ' in their values so I made a function remove_simbols that removes these unnecessary symbols.
remove_symbols <- function(x) {
  removed_x <- str_remove_all(x, "\\{|\\}|\\[|\\}|\\]|'|,|id|name|:| ")
  return(removed_x)
}
```

#### (1). Tidying `Genres`

Each movie contains information on one or more (up to 8) genres in a
single column. To organize the information, the multiple genres
contained in `genre` are split by "\]," so that each genre is contained
in eight new columns, genre1-genre8. Then, I removed unnecessary symbols
such as ,.

```{r}
# Separating the `genre` by "},"
movie <- movie %>% 
  separate(genres, c("genre1", "genre2", "genre3", "genre4", "genre5", "genre6", "genre7", "genre8"), "\\},", remove = FALSE)

# Remove unnecessary symbols from each columns

movie$genre1 <- remove_symbols(movie$genre1)
movie$genre2 <- remove_symbols(movie$genre2)
movie$genre3 <- remove_symbols(movie$genre3)
movie$genre4 <- remove_symbols(movie$genre4)
movie$genre5 <- remove_symbols(movie$genre5)
movie$genre6 <- remove_symbols(movie$genre6)
movie$genre7 <- remove_symbols(movie$genre7)
movie$genre8 <- remove_symbols(movie$genre8)

movie %>% 
  select(matches("[1-9]"))

```

I managed to separate the genre information above. Now all values in
genre 1 - 8 has a value in the format of "number + name of genre"
(example: 35Comedy)

However, there are 2,445 observations whose genre value doesn't fit the
above format. Most of them don't have any genre assigned so their
`genre1` values are empty or blank, which means they don't have any
genre assigned to them.

```{r}
movie %>% filter(!str_detect(genre1, "^[0-9]")) %>% select(starts_with("genre"))
```

However, looking at them closely, there are three observations whose
`genre` value that is NOT blank but have strange strings.

The values in these observations such as "Carousel Production" and
"Aniplex" don't look like a name of genres but that of production
studios. Also their `original_title` values don't look like a title of
movies but information of languages. It seems like these observations
were not correctly read in because their values in other columns look
weird.

Since these observations are not reliable, I decided to delete them from
this dataset.

```{r}
movie %>% filter(!str_detect(genre1, "^[0-9]") &!str_detect(genre1, "^[ \t\n]*$")) %>% #^[ \t\n]*$ is a regular expression for blank.
  select(c(original_title, production_countries, genre1:genre3))

movie_clean1 <- movie %>% filter(!str_detect(genre1, "^[A-Z]"))
```

Genre information is more organized but is still not easy to be
analyzed. Thus, I decided to make dummy variables of each genre.

For example, if movie A is categorized as "comedy" and "adventure", the
line of movie A should have 1 in the "comedy" column and the "adventure"
respectively and 0 in columns of other genres.

```{r}
#https://community.rstudio.com/t/creating-dummy-columns-based-on-multiple-columns/58145/3
movie_clean1 <- movie_clean1 %>% 
  pivot_longer(cols = matches("genre[1-9]")) %>%
  add_column(count = 1) %>%
  arrange(value) %>%
  filter(str_detect(value, "^[0-9]")) %>%
  mutate(value1 = str_replace_all(value, "[0-9]+", "")) %>%
  select(-c(value, name)) %>%
  arrange(title) %>%
  pivot_wider(
    names_from = value1, 
    values_from = count, 
    values_fill = list(count=0),
    values_fn = list(count = mean)) %>%
  arrange(title)

# I also deleted genre "TVMovie" and "Foreign" because they're more format or origin country information rather than genre.
movie_clean1 <- movie_clean1 %>% select(-c("TVMovie","Foreign"))

```

#### (2). Tyding `Production_countries`

The values in `production_countries` are written in the following way.

> \[{'iso_3166_1': 'abbreviation of country','name': 'full country
> name'}\]

When there is more than one country in the value, I took the first
country in account for this analysis.

```{r}
movie_clean1$production_countries <- remove_symbols(movie_clean1$production_countries)
movie_clean1$production_countries <- str_extract(movie_clean1$production_countries,"(?<=_1)\\w{2}")

movie_clean1 %>% select(production_countries)
```

#### (3). Tyding `production_companies` & `spoken_languages`

The values in both of `production_companies` and `spoken_languages` are
in the following format.

> \[{'name': 'XXXX', 'id': ---}\] \> \[{'iso_639_1':'XXXX',
> 'name':"----}\]

I only need the information that is written as XXXX in this format.

```{r}
movie_clean1$production_companies <-str_remove_all(movie_clean1$production_companies, "'id': [0-9]*|'name':|\\[|\\]|'|\\{|,") %>%
  str_replace_all("\\},", ",") %>%
  str_replace_all(" \\} ", ",")%>%
  str_remove(" \\}")

movie_clean1$spoken_languages <-str_remove_all(movie_clean1$spoken_languages, "'iso_639_1':|'name': '\\w+'|\\[|\\]|'|\\{|,") %>%
  str_replace_all(" \\} ", ",") %>%
  str_remove_all(" \\}")

movie_clean1 %>% select(production_companies, spoken_languages)


```

### Adding a new column

A new column `years` was created to group years by decade.

```{r}
movie_clean1 <- movie_clean1 %>% 
  mutate(
    years = case_when(
    lubridate::year(release_date) < 1920 ~ "Before 1920",
    lubridate::year(release_date) >= 1920 & lubridate::year(release_date) < 1930 ~ "1920s",
    lubridate::year(release_date) >= 1930 & lubridate::year(release_date) < 1940 ~ "1930s",
    lubridate::year(release_date) >= 1940 & lubridate::year(release_date) < 1950 ~ "1940s",
    lubridate::year(release_date) >= 1950 & lubridate::year(release_date) < 1960 ~ "1950s",
    lubridate::year(release_date) >= 1960 & lubridate::year(release_date) < 1970 ~ "1960s",
    lubridate::year(release_date) >= 1970 & lubridate::year(release_date) < 1980 ~ "1970s",
    lubridate::year(release_date) >= 1980 & lubridate::year(release_date) < 1990 ~ "1990s",
    lubridate::year(release_date) >= 1990 & lubridate::year(release_date) < 2000 ~ "1990s",
    lubridate::year(release_date) >= 2000 & lubridate::year(release_date) < 2010 ~ "2000s",
    TRUE ~ "2010s"
  )
  )
```

Now, the dataset is cleaner. Look at the summary again.

```{r}
print(summarytools::dfSummary(movie_clean1),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')

```

### (2) Bechdel data

#### Removing and Mutating Columns

`imdbid` and `titleId` are duplicated because titleId is "tt+imdbid".
`movie_clean1` data frame has IDs in the same format (starting with tt)
as `titleId` of `bechdel` data frame so I deleted `imdbid` column.

Also, `year` and `title` information is available in movie data frame,
thus I deleted them.

I made a new column `bechdel_pass` where 1 means the movie passes
bechdel test and 0 means otherwise.

```{r}
#

bechdel <- bechdel %>% select(-c("imdbid","...1","title"))
colnames(bechdel) <- c("year", "bechdel_rating", "id", "titleid")

summary(bechdel)
# Changing the column order
col_order <- c("id", "titleid","year", "bechdel_rating")
bechdel <- bechdel[, col_order]

# Mutate a new column bechdel_pass
bechdel <- bechdel %>% mutate(
  bechdel_pass = case_when(
    bechdel_rating == 3 ~ 1,
    TRUE ~ 0
  )
)

# Change data types of columns
bechdel$bechdel_rating <- factor(bechdel$bechdel_rating, levels = c("0", "1", "2", "3"))
bechdel$bechdel_pass <- factor(bechdel$bechdel_pass, levels = c("0", "1"))
```

### Sense Check

This data looks quite clean, however, I realized that the minimum number
of year is 1010, which is weird. "Inazuma Eleven The Movie" was released
in 2010, but it seems to have been mistakenly recorded as 1010. So I
manually corrected the data.

```{r}
summary(bechdel)

bechdel %>% 
  filter(year < 1800)

bechdel$year[bechdel$year == 1010] <- 2010
```

## Join data sets

We have two data frames `movie` and `bechdel` , which have imdb id as a
foreign key.

```{r}
movie_bechdel_join <- inner_join(movie_clean1, bechdel, 
                                 by=c("imdb_id" = "titleid"),
                                 copy = TRUE)

```

## Read in data / Describe data (OMDb) {#read-in-data-describe-data-omdb}

As we have a joined data, I read in the movie popularity data from OMDb
using OMDb API.

`omdb_df` has 5 columns

-   imdbID: can be used as a foreign key when joining with other data
    frames

-   Director: Director's name

-   Metascore: Review scores from 0 to 100 on the metacritic website
    (<https://www.metacritic.com/movie>). They are weighted averages.
    and reviews by certain reviewers are given more weight, however the
    detail is not revealed.

-   imdbRating: Review scores from 0 to 10 on IMDb website
    (<https://www.imdb.com/>). They are weighted averages.

-   imdbVotes: Number of votes on IMDb website.

```{r}
# Generating OMDb dataframe
# imdb_id_list <- c()
# 
# for (i in c(1:length(movie_bechdel_join$imdb_id))) {
#   imdb_id_list[[i]] <- movie_bechdel_join$imdb_id[i]
# }
# 
# omdb_list <- lapply(imdb_id_list, function(movie_id) {
#   actor_vector <- find_by_id(movie_id)
#   actor_vector
# })
# 
# omdb_df <- tibble(place = omdb_list) %>%
#   unnest_wider(place) %>%
#   select(c("imdbID", "Director", "Metascore", "imdbRating", "imdbVotes", "BoxOffice")) %>% unnest() %>% distinct()

#write.csv(omdb_df, "_data/omdb.csv")
omdb_df <- read_csv("_data/omdb.csv")
omdb_df$Metascore <- as.numeric(omdb_df$Metascore)
omdb_df$BoxOffice <- omdb_df$BoxOffice %>%
  str_replace_all(",", "") %>% 
  str_extract("[0-9]*(?=$)") %>% 
  as.numeric()

omdb_df <- omdb_df %>% select(-...1)
glimpse(omdb_df)
```

Now I have the data from OMDb as `omdb_df`. Let's join
`movie_bechdel_join` and `omdb_df.`

```{r}
data <- inner_join(movie_bechdel_join, omdb_df, 
                                 by=c("imdb_id" = "imdbID"),
                                 copy = TRUE)

```

## Data Analysis and Visualization

### Is female representation in movie stories improving over time?

To find out the answer for this questions, I looked at the result of the
Bechdel test.

The number of movies that pass the Bechdel test increases over time,
especially after around 1980 as the total number of released movies
increased.

```{r}
movie_bechdel_join %>%
  filter(lubridate::year(release_date) > 1920) %>%
  group_by(year = lubridate::year(release_date), bechdel_rating) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=year, y= n_total, fill = bechdel_rating)) + geom_area(stat = "identity") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two women", 
                             "1.Two women", 
                             "2.Two women talk to each other about men", 
                             "3.Two women talk to each other about other than men (Pass)")) +
  labs(x= "year", y = "Number", title = "The number of movies by Bechdel test ratings", fill = "Bechdel rating")+
  scale_x_continuous(n.breaks=14) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))
  

```

**However, the total number of released movies is increasing as well. Is
the percentage of films that pass the Bechdel test increasing?**

In the 1920s, which is 100 years ago, less than 20% of films passed the
Bechdel test. There was a big increase in the percentage of Bechdel
test-passing movies in the 1930s, however, the percentage of films
passing the Bechdel test was stagnant at a bit lower than 50% from the
1930s to the 1950s. From 1950s to 1970s, the representation of women in
the film somehow went backward, with less than 45% of films meeting the
requirements of the Bechdel test; from the 1970s to the present, the
percentage of films that pass the Bechdel test has continued to
increase, reaching approximately 70% now. Although the representation of
women seems to have improved since 1970, it has not improved
dramatically compared to the 1930s and 1940s.

```{r}
data_with_p <- data %>%
  filter(lubridate::year(release_date) > 1920) %>%
  group_by(years, bechdel_rating) %>%
  dplyr::summarize(n = n()) %>%
  mutate(percentage = n/sum(n)*100)
```

```{r}
data_with_p %>%
  ggplot(aes(x=years, y= percentage, fill = bechdel_rating)) + 
  geom_col() +
  geom_text(aes(label=paste0(round(percentage), "%")), color="white", font="bold", position = position_stack(vjust = 0.5)) +
  
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two women", 
                             "1.Two women", 
                             "2.Two women talk to each other about men", 
                             "3.Two women talk to each other about other than men (Pass)")) +
  labs(x="year", y="percent", title = "% of movies that pass/don't pass Bechdel Test", fill="Bechdel Rating") +
  scale_y_continuous(n.breaks=10) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(ncol = 2))
```

Perhaps the percentage of films that pass the Bechdel test has not
changed too dramatically compared to 70 years ago. However, if we focus
on the movies that do NOT pass the Bechdel test, we can see a different
trend happening.

In the 1920s, the majority of films did not feature two significant
women; this trend changed greatly in the 1930s, with the percentage of
films that did not feature two women decreasing significantly, peaking
in the 1960s to about 20%, and then declining.\
The percentage of films in which the only conversation between female
characters is about men also decreased from 1930 to the 2010s.\
Interestingly, however, the percentage of films in which female
characters do not speak to each other (dark gray in the figure) has
changed very little from the 1930s to the 2010s.\

```{r}
data_with_p %>%
  filter(bechdel_rating!="3") %>%
  ggplot(aes(x=years, y= percentage, fill = bechdel_rating)) + 
  geom_col()+
  geom_text(aes(label=paste0(round(percentage), "%")), color="white", font="bold", position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two women", 
                             "1.Two women", 
                             "2.Two women talk to each other about men")) +
  labs(x="year", y="percent", title = "% of movies that do NOT pass Bechdel test by Bechdel test criteria", fill = "Bechdel Rating", subtitle = "% doesn't add up to 100% because it is missing the % of the movies that pass Bechdel test") +
  scale_y_continuous(n.breaks=10) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))
```

### **What category represents women better?**

I focused on 4428 movies released between 2000 and 2020 because I want
to do category-based analysis of a recent situation, and it's better to
focus on the period where more movies were released to observe the
trend.

The degree to which women are portrayed in a film's story varies widely
from genre to genre.

The genre with the highest percentage of films passing the bechdel test
is Romance (72%) and the lowest is Western (22%), showing a large
difference.

Interestingly, not only the percentage of films that pass the Bechdel
test, but also the percentage that meet each of the three criteria for
passing the Bechdel test varies widely by genre.

For example, Western, War, and Documentary struggle more than other
genres to feature two female characters. (To be precise, the Bechdel
Test is for evaluating fictional stories, so it may not be appropriate
to evaluate Documentary.)

```{r}
viz_by_genre <- data %>%
  pivot_longer(cols = c(War:Music),names_to = "genre_name", values_drop_na = TRUE) %>%
  filter(value == 1) 


viz_by_genre_with_p <- viz_by_genre %>%
  group_by(genre_name, years, bechdel_rating) %>%
  dplyr::summarize(n=n()) %>%
  mutate(proportion = n/sum(n),
         percentage=proportion*100)

viz_by_genre_with_p %>%
  filter(years %in% c("2000s", "2010s")) %>%
  group_by(genre_name, bechdel_rating) %>%
  summarise(per=mean(percentage)) %>%
  arrange(desc(bechdel_rating), desc(per)) -> order_genre


viz_by_genre_with_p$genre_name <-factor(viz_by_genre_with_p$genre_name, levels = order_genre$genre_name[1:18])

viz_by_genre_with_p %>%
  filter(years=="2000s"|years=="2010s") %>%
  group_by(genre_name, bechdel_rating) %>%
  summarize(n=sum(n)) %>%
  mutate(p=n/sum(n),
         percentage = 100*p) %>%
  ggplot(aes(x=genre_name, y=percentage, fill= bechdel_rating)) +
  geom_col() +
  geom_text(aes(label=paste0(round(percentage), "%")), color="white", position=position_stack(vjust=0.5), size=3)+
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two women", 
                             "1.Two women", 
                             "2.Two women talk to each other about men", 
                             "3.Two women talk to each other about other than men (Pass)")) +
  labs(x="Genre", y="%", fill="Bechdel rating")+
  coord_flip()  +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))
```

Above we looked at trends over the past 20 years, but what if we look at
the historical trends over past 100 years?\
Some genres continue to improve their representation of women compared
to the past, while others remain the same.\
For example, Horror, Adventure, Science Fiction, Animation, and Fantasy
have gradually increased the percentage of films that pass the Bechdel
test.\
Western, History, War, and Crime, Comedy on the other hand, have not
changed much in terms of female representation compared to 100 years
ago.\

```{r}
viz_by_genre_with_p %>%
  filter(years!="Before 1920") %>%
  ggplot(aes(x=years, y= percentage, fill = bechdel_rating)) + 
  geom_col() +
  facet_wrap(~genre_name, ncol = 5)+
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No two female characters", "Two female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  labs(x = "year", y = "percent", title = "Proportion of movies that pass / don't pass Bechdel test by categories")+
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2)) +
  scale_x_discrete(breaks = c('1920s', '1960s', '2010s'),
                   labels = c('1920', '1960', '2010'))
  

```

### **Do movies with good female representation succeed in gaining more popularity on online review site?**

Again, I focused on the movies released after 2000 to see the recent
trends.

I compared **Metascore** and **IMDb** ratings of the films that pass and
don't pass the Bechdel test.

The movies that don't pass the Bechdel test seem to score higher on
Metascore, however, the difference is quite small, and there seems to be
no relationship between whether a film passes the Bechdel test and its
reputation (Metascore).

```{r}
data %>%
  filter(lubridate::year(release_date) > 2000) %>%
  ggplot(aes(x=bechdel_pass, y=Metascore)) + 
  geom_boxplot() +
  scale_x_discrete(labels=c("0: Don't pass Bechel", "1: Pass Bechdel")) +
  labs(x="Bechdel Test")
```

T-test shows that there is no statistically significant difference in
Metascores of the movies that pass and that don't pass Bechdel test.

```{r}
ttest_pass <- data %>% 
  filter(bechdel_pass == "1" & lubridate::year(release_date) > 2000) %>%
  select(c("bechdel_pass", "imdbRating", "imdbVotes", "Metascore", "BoxOffice"))

ttest_not_pass <- data %>%
  filter(bechdel_pass == "0" & lubridate::year(release_date) > 2000) %>%
  select(c("bechdel_pass", "imdbRating", "imdbVotes", "Metascore", "BoxOffice"))

t.test(ttest_pass$Metascore, ttest_not_pass$Metascore)
```

Then, I compared the IMDb scores of the movies that pass and don't pass
Bechdel test. Those that don't pass Bechdel test seem to score a bit
higher.

```{r}
data %>%
  filter(lubridate::year(release_date) > 2000) %>%
  ggplot(aes(x=bechdel_pass, y=imdbRating)) + 
  geom_boxplot() +
  scale_x_discrete(labels=c("0: Don't pass Bechel", "1: Pass Bechdel")) +
  labs(x="Bechdel Test", y="IMDb Rating")
```

T-test shows that the movies that do not meet Bechdel test criteria
score statistically higher than those that do.

```{r}
t.test(ttest_pass$imdbRating, ttest_not_pass$imdbRating)
```

### **Do movies with good female representation succeed in making more money?**

```{r}
data %>%
  filter(lubridate::year(release_date) > 2000) %>%
  ggplot(aes(x=bechdel_pass, y=BoxOffice)) + 
  geom_boxplot() +
  scale_x_discrete(labels=c("0: Don't pass Bechel", "1: Pass Bechdel"))

```

Even though the average box office amount of not-Bechdel-passing movies
is higher than that of Bechdel-passing movies, T-test shows that the
difference is not significant.

```{r}
t.test(ttest_pass$BoxOffice, ttest_not_pass$BoxOffice)
```

## Conclusion

The representation of women in film stories stagnated from the 1930s to
the 1970s, but has gradually improved, especially since the 1970s.\
However, this is not true for all genres: the percentage of films that
pass the Bechdel test varies widely from the 20% range (Western) to the
70% range (Romance), depending on the genre. Some genres (Horror,
Adventure, Science Fiction, Fantasy) have consistently increased the
percentage of films that pass the Bechdel test, while others (War,
Crime, History, Western) have remained stagnant. The percentage of films
that pass the test is increasing constantly.\
\
It turns out that whether or not a film passes the Bechdel test has
little impact on its success (i.e., popularity and revenue). However,
the reviews on the imdb website are higher for films that do not pass
the Bechdel Test.

This analysis has so far shown how women's representation has (or has
not) improved, and which genres have been particularly well represented
by women, but it has not explained why this has happened. I would like
to conduct an analysis that can answer the question "Why?". For example,
I would like to test the hypothesis that the increase in female
representation in fantasy and horror is due to the increase in female
audiences.\
Also, I would like to further investigate the relationship between
female representation and film success (popularity and revenue). In
particular, I am interested in how the gender of the audience affects
their evaluation of films in which women are represented and those in
which they are not.

## Reflection

One of the most difficult parts of this analysis was finding the right
data. Even though there are many movie database, the information was not
complete or not clear enough to be used. Thus, I ended up using two
different movie databases and one Bechdel test database because one
movie database (The Movie Dataset) didn't have a clear definition of
review ratings and popularity.

Another difficulty that I faced is that I could not find a good dataset
of the gender of directors/writers of movies. Even though I found
"credit" dataset that included the name and the gender of people who
were involved in the movie production, the information on director was
limited and only a few thousand movies had director information.
Originally, I wanted to visualize the relationship between the gender of
movie directors and the female representation, however, I gave up doing
that for this project due to the lack of an appropriate dataset.\

## Contact

The dataset that I used for this analysis can be found on my GitHub
(<https://github.com/Enagai-nagai/601_Fall_2022/tree/template/posts/_data>).

I'm interested in analyzing and visualizing data related to the
entertainment industry, behavior on digital platforms, and gender
issues.

If you have any comments or questions, please contact me through Email.

-   enagai ★ umass.edu (please replace ★ with \@)

-   nagainagai.e ★ gmail.com (please replace ★ with \@)

## Bibliography and References

I used R and RStudio to realize this analysis.

The below is the source of data sets:

-   Bechdel API: <https://bechdeltest.com/api/v1/doc>

-   The Movie Dataset from Kaggle:
    <https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset>

-   OMDb API: <https://www.omdbapi.com/>

The below is the source of information:

-   Women and Hollywood. (2022, March 15) *Study: Women made up 34% of
    speaking roles in 2021's top films, majority of those characters
    were white*. . Retrieved December 13, 2022, from
    [https://womenandhollywood.com/study-women-made-up-34-of-speaking-roles-in-2021s-top-films-majority-of-those-characters-were-white/#:\~:text=In%202021's%20top%20films%2C%20females,and%2037%20percent%20in%202019.](https://womenandhollywood.com/study-women-made-up-34-of-speaking-roles-in-2021s-top-films-majority-of-those-characters-were-white/#:~:text=In%202021's%20top%20films%2C%20females,and%2037%20percent%20in%202019.)

-   Lauzen, M. M. (2021). (rep.). *It's a Man's (Celluloid) World, Even
    in a Pandemic Year: Portrayals of Female Characters in the Top U.S.
    Films of 2021*. San Diego, California: San Diego State University
    and The Center for the Study of Women in Television and Film.