DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Assignment3 Erika Nagai

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Introduction
  • Read in data
    • (1) “The Movie Dataset” from Kaggle :
    • (2) Bechdel test API:
    • (3) Open Movie Database (OMDb):
  • Describe data
    • (1) TMDb
    • (2) Bechdel test
  • Tidy data
    • (1) TMDb data
    • Checking missing values
    • Deleting unnecessary columns
    • Cleaning data in JSON nested list format
    • Adding a new column
    • (2) Bechdel data
    • Sense Check
  • Join datasets
  • Read in data / Describe data (OMDb)
  • Data Analysis and Visualization
    • Is female representation in movie stories improving over time?
    • What category represents women better?
    • Why do movies do NOT pass Bechdel test?
    • Do movies with good female representation succeed in gaining more popularity on online review site?
    • Do movies with good female representation succeed in making more money?
  • Conclusion & Discussion
  • Contact
  • References

Assignment3 Erika Nagai

hw2
hw3
movie
gender
bechdel test
female representation
Erika Nagai
Female representation in movie stories
Author

Erika Nagai

Published

December 20, 2022

Introduction

Even though half of moviegoers in the United States is women, their story is not represented as much as that of men. Only 35% of the main characters in the best movies of 2021 were female, and male characters outnumbered female characters in 85% of the films. (Women and Hollywood, 2022) Also lacking are women behind the scenes in the film business. In 2021, just 12% of the top 100 films are directed by women.(Lauzen, n.d.)

In this project, I focused on analyzing female representation in the movie stories rather than in their production. I used the data from three major movie databases, The Movie Database (TMDb), MovieLens, Open Movie Database (OMDb) and Bechdel Test, one of the most common criterion to measure female presentation.

I would like to answer the following questions in this analysis:

  1. Is female representation in movie stories improving over time?

  2. What genres are doing better than others in terms of female representation?

  3. Are movies with good female representation more popular?

  4. Do movies with good female representation more money?

This analysis intends to document all the data cleaning process, which is quite long. If you are curious only about the analysis results, please skip to Data Analysis and Visualization part.

# install libraries

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'readr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(stringr)
library(tidyr)
library(dplyr)
library(summarytools)

Attaching package: 'summarytools'

The following object is masked from 'package:tibble':

    view
library(plyr)
------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------

Attaching package: 'plyr'

The following objects are masked from 'package:dplyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following object is masked from 'package:purrr':

    compact
library(lubridate)

Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union
library(ggridges)

Read in data

For this analysis, I used the following different datasets.

(1) “The Movie Dataset” from Kaggle :

This dataset was obtained from Kaggle “The Movie Dataset”(https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download&select=movies_metadata.csv).

#(1) movies_metadata.csv obtained from Kaggle
movie = read_csv("_data/movies_metadata.csv")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 45466 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (14): belongs_to_collection, genres, homepage, imdb_id, original_langua...
dbl   (7): budget, id, popularity, revenue, runtime, vote_average, vote_count
lgl   (2): adult, video
date  (1): release_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

(2) Bechdel test API:

I generated the data by using bechdel test API https://bechdeltest.com/api/v1/doc.

#(2) bechdel test obtained by using bechdel API

library(rjson)
library(jsonlite)
Warning: package 'jsonlite' was built under R version 4.2.2

Attaching package: 'jsonlite'
The following objects are masked from 'package:rjson':

    fromJSON, toJSON
The following object is masked from 'package:purrr':

    flatten
#json_file <- "http://bechdeltest.com/api/v1/getAllMovies"
#bechdel <- read_json(path = json_file, simplifyVector = TRUE)
#bechdel$titleId <- paste("tt",bechdel$imdbid, sep = "")

#write.csv(bechdel, file = "_data/bechdel.csv")
bechdel <- read_csv("_data/bechdel.csv")
New names:
• `` -> `...1`
Rows: 9802 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): imdbid, title, titleId
dbl (4): ...1, year, rating, id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

(3) Open Movie Database (OMDb):

This dataset was generated by using Open Movie Database (OMDb) API https://www.omdbapi.com/. OMDb provides the detail information such as director, writer, nominated award, the number/score of reviews on imdb etc… if you provide the name or imdb id of movies. I decided to use this database because it gives me the data related to movies popularity and financial success of the films, which “The Movie Dataset” doesn’t include.

OMDb doesn’t give you a list of all movies registered on it. Instead, you need to provide the exact movie title or the imdb id to get the list of information. Thus, I will use this API once I have the data that join movies_metadata.csv and bechdel test. (Please refer to “Read in data / Describe data (OMDb)” for this process.)

Describe data

(1) TMDb

This data was originally created from The Movie Database (https://www.themoviedb.org/) and MovieLens (https://movielens.org/).This movie dataset was generated by Movielens, a (non-profit) movie review website (https://movielens.org/), and was obtained from the following Kaggle link. (https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download&select=movies_metadata.csv)

The movie dataset contains 45466 movies with a released date between December 9th 1874 and December 16th 2020.

The data includes the general information of movies, such as genres, revenue, runtime, languages, status (released/in production etc…).

print(summarytools::dfSummary(movie),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

movie

Dimensions: 45466 x 24
Duplicates: 17
Variable Stats / Values Freqs (% of Valid) Graph Missing
adult [logical]
1. FALSE
2. TRUE
45454(100.0%)
9(0.0%)
3 (0.0%)
belongs_to_collection [character]
1. {'id': 415931, 'name': 'T
2. {'id': 421566, 'name': 'T
3. {'id': 645, 'name': 'Jame
4. {'id': 96887, 'name': 'Za
5. {'id': 37261, 'name': 'Th
6. {'id': 34055, 'name': 'Po
7. {'id': 413661, 'name': 'C
8. {'id': 374509, 'name': 'G
9. {'id': 148324, 'name': 'U
10. {'id': 38451, 'name': 'Ch
[ 1688 others ]
29(0.6%)
27(0.6%)
26(0.6%)
26(0.6%)
25(0.6%)
22(0.5%)
21(0.5%)
16(0.4%)
15(0.3%)
15(0.3%)
4272(95.1%)
40972 (90.1%)
budget [numeric]
Mean (sd) : 4224579 (17424133)
min ≤ med ≤ max:
0 ≤ 0 ≤ 3.8e+08
IQR (CV) : 0 (4.1)
1223 distinct values 3 (0.0%)
genres [character]
1. [{'id': 18, 'name': 'Dram
2. [{'id': 35, 'name': 'Come
3. [{'id': 99, 'name': 'Docu
4. []
5. [{'id': 18, 'name': 'Dram
6. [{'id': 35, 'name': 'Come
7. [{'id': 27, 'name': 'Horr
8. [{'id': 35, 'name': 'Come
9. [{'id': 35, 'name': 'Come
10. [{'id': 18, 'name': 'Dram
[ 4059 others ]
5000(11.0%)
3621(8.0%)
2723(6.0%)
2442(5.4%)
1301(2.9%)
1135(2.5%)
974(2.1%)
930(2.0%)
593(1.3%)
532(1.2%)
26215(57.7%)
0 (0.0%)
homepage [character]
1. http://www.georgecarlin.c
2. http://www.wernerherzog.c
3. http://breakblade.jp/
4. http://phantasm.com
5. http://www.crownintlpictu
6. http://www.crownintlpictu
7. http://www.crownintlpictu
8. http://www.kungfupanda.co
9. http://www.missionimpossi
10. http://www.thehungergames
[ 7663 others ]
12(0.2%)
7(0.1%)
6(0.1%)
4(0.1%)
4(0.1%)
4(0.1%)
4(0.1%)
4(0.1%)
4(0.1%)
4(0.1%)
7729(99.3%)
37684 (82.9%)
id [numeric]
Mean (sd) : 108359.9 (112460.7)
min ≤ med ≤ max:
2 ≤ 60003 ≤ 469172
IQR (CV) : 130878.5 (1)
45433 distinct values 3 (0.0%)
imdb_id [character]
1. 0
2. tt1180333
3. tt0022537
4. tt0022879
5. tt0046468
6. tt0062229
7. tt0067306
8. tt0080000
9. tt0082992
10. tt0084387
[ 45407 others ]
3(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
45427(100.0%)
17 (0.0%)
original_language [character]
1. en
2. fr
3. it
4. ja
5. de
6. es
7. ru
8. hi
9. ko
10. zh
[ 82 others ]
32269(71.0%)
2438(5.4%)
1529(3.4%)
1350(3.0%)
1080(2.4%)
994(2.2%)
826(1.8%)
508(1.1%)
444(1.0%)
409(0.9%)
3608(7.9%)
11 (0.0%)
original_title [character]
1. Alice in Wonderland
2. Hamlet
3. A Christmas Carol
4. Cinderella
5. Les Misérables
6. Macbeth
7. The Three Musketeers
8. Blackout
9. Frankenstein
10. Heidi
[ 43363 others ]
8(0.0%)
8(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
6(0.0%)
6(0.0%)
6(0.0%)
45397(99.8%)
0 (0.0%)
overview [character]
1. No overview found.
2. No Overview
3. A few funny little novels
4. Adaptation of the Jane Au
5. King Lear, old and tired,
6. No movie overview availab
7. Recovering from a nail gu
8. Released
9. A film by Jem Cohen
10. A former aristocrat Ippol
[ 44296 others ]
133(0.3%)
7(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
44345(99.6%)
959 (2.1%)
popularity [numeric]
Mean (sd) : 2.9 (6)
min ≤ med ≤ max:
0 ≤ 1.1 ≤ 547.5
IQR (CV) : 3.3 (2.1)
43757 distinct values 6 (0.0%)
poster_path [character]
1. /5D7UBSEgdyONE6Lql6xS7s6O
2. /2kslZXOaW0HmnGuVPCnQlCdX
3. /qW1oQlOHizRHXZQrpkimYr0o
4. /8VSZ9coCzxOCW2wE2Qene1H1
5. /cdwVC18URfEdQjjxqJyRMoGD
6. /2ngnUQX9Abkesfq72uvBF3uj
7. /2trk2fape4S0CYnRg38kNVe8
8. /4J6Ai4C5YRgfRUTlirrJ7Qsm
9. /5GasjPRAy5rlEyDOH7MeOyxy
10. /5ILjS6XB5deiHop8SXPsYxXW
[ 45014 others ]
5(0.0%)
4(0.0%)
4(0.0%)
3(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
45051(99.9%)
386 (0.8%)
production_companies [character]
1. []
2. [{'name': 'Metro-Goldwyn-
3. [{'name': 'Warner Bros.',
4. [{'name': 'Paramount Pict
5. [{'name': 'Twentieth Cent
6. [{'name': 'Universal Pict
7. [{'name': 'RKO Radio Pict
8. [{'name': 'Columbia Pictu
9. [{'name': 'Columbia Pictu
10. [{'name': 'Mosfilm', 'id'
[ 22698 others ]
11875(26.1%)
742(1.6%)
540(1.2%)
505(1.1%)
439(1.0%)
320(0.7%)
247(0.5%)
207(0.5%)
146(0.3%)
145(0.3%)
30297(66.6%)
3 (0.0%)
production_countries [character]
1. [{'iso_3166_1': 'US', 'na
2. []
3. [{'iso_3166_1': 'GB', 'na
4. [{'iso_3166_1': 'FR', 'na
5. [{'iso_3166_1': 'JP', 'na
6. [{'iso_3166_1': 'IT', 'na
7. [{'iso_3166_1': 'CA', 'na
8. [{'iso_3166_1': 'DE', 'na
9. [{'iso_3166_1': 'IN', 'na
10. [{'iso_3166_1': 'RU', 'na
[ 2383 others ]
17851(39.3%)
6282(13.8%)
2238(4.9%)
1654(3.6%)
1356(3.0%)
1030(2.3%)
840(1.8%)
749(1.6%)
735(1.6%)
735(1.6%)
11993(26.4%)
3 (0.0%)
release_date [Date]
min : 1874-12-09
med : 2001-08-30
max : 2020-12-16
range : 146y 0m 7d
17333 distinct values 90 (0.2%)
revenue [numeric]
Mean (sd) : 11209349 (64332247)
min ≤ med ≤ max:
0 ≤ 0 ≤ 2787965087
IQR (CV) : 0 (5.7)
6863 distinct values 6 (0.0%)
runtime [numeric]
Mean (sd) : 94.1 (38.4)
min ≤ med ≤ max:
0 ≤ 95 ≤ 1256
IQR (CV) : 22 (0.4)
353 distinct values 263 (0.6%)
spoken_languages [character]
1. [{'iso_639_1': 'en', 'nam
2. []
3. [{'iso_639_1': 'fr', 'nam
4. [{'iso_639_1': 'ja', 'nam
5. [{'iso_639_1': 'it', 'nam
6. [{'iso_639_1': 'es', 'nam
7. [{'iso_639_1': 'ru', 'nam
8. [{'iso_639_1': 'de', 'nam
9. [{'iso_639_1': 'en', 'nam
10. [{'iso_639_1': 'en', 'nam
[ 1921 others ]
22395(49.3%)
3829(8.4%)
1853(4.1%)
1289(2.8%)
1218(2.7%)
902(2.0%)
807(1.8%)
762(1.7%)
681(1.5%)
572(1.3%)
11152(24.5%)
6 (0.0%)
status [character]
1. Canceled
2. In Production
3. Planned
4. Post Production
5. Released
6. Rumored
2(0.0%)
20(0.0%)
15(0.0%)
98(0.2%)
45014(99.2%)
230(0.5%)
87 (0.2%)
tagline [character]
1. Based on a true story.
2. -
3. Be careful what you wish
4. Trust no one.
5. A Love Story
6. Classic Albums
7. Documentary
8. Drama
9. How far would you go?
10. Know Your Enemy
[ 20272 others ]
7(0.0%)
4(0.0%)
4(0.0%)
4(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
20374(99.8%)
25055 (55.1%)
title [character]
1. Cinderella
2. Alice in Wonderland
3. Hamlet
4. Beauty and the Beast
5. Les Misérables
6. A Christmas Carol
7. Blackout
8. The Three Musketeers
9. Treasure Island
10. Aftermath
[ 42267 others ]
11(0.0%)
9(0.0%)
9(0.0%)
8(0.0%)
8(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
6(0.0%)
45381(99.8%)
6 (0.0%)
video [logical]
1. FALSE
2. TRUE
45367(99.8%)
93(0.2%)
6 (0.0%)
vote_average [numeric]
Mean (sd) : 5.6 (1.9)
min ≤ med ≤ max:
0 ≤ 6 ≤ 10
IQR (CV) : 1.8 (0.3)
92 distinct values 6 (0.0%)
vote_count [numeric]
Mean (sd) : 109.9 (491.3)
min ≤ med ≤ max:
0 ≤ 10 ≤ 14075
IQR (CV) : 31 (4.5)
1820 distinct values 6 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

(2) Bechdel test

This data (bechdel) dataset documents the Bechdel test rating of 9630 movies released between 1874 and 2022.

According to “Merriam-Webster”, Bechdel test is “a set of criteria used as a test to evaluate a work of fiction (such as a film) on the basis of its inclusion and representation of female characters” (https://www.merriam-webster.com/dictionary/Bechdel%20Test)

A movie passes Bechdel test when it has at least two female featured characters that talk each other about other than a man (men).

In bechdel, Bechdel test rating is registered in a following manner.

0 ~ No two female characters

1 ~ Two female characters but don’t talk each other

2 ~ Two female characters talk each other about a man (men)

3 ~ Passes bechdel test: Two female characters talk each other about other than a man (men)

Apart from Bechdel ratings, this dataset (bechdel) contains the released year, title of movies, and id. id that starts with “tt” followed by 7 digits is imdb id and can work as a foreign key when joining this data with movie dataset.

print(summarytools::dfSummary(bechdel),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

bechdel

Dimensions: 9802 x 7
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
...1 [numeric]
Mean (sd) : 4901.5 (2829.7)
min ≤ med ≤ max:
1 ≤ 4901.5 ≤ 9802
IQR (CV) : 4900.5 (0.6)
9802 distinct values 0 (0.0%)
year [numeric]
Mean (sd) : 1996.2 (27)
min ≤ med ≤ max:
1010 ≤ 2006 ≤ 2022
IQR (CV) : 25 (0)
142 distinct values 0 (0.0%)
rating [numeric]
Mean (sd) : 2.1 (1.1)
min ≤ med ≤ max:
0 ≤ 3 ≤ 3
IQR (CV) : 2 (0.5)
0:1084(11.1%)
1:2124(21.7%)
2:1000(10.2%)
3:5594(57.1%)
0 (0.0%)
id [numeric]
Mean (sd) : 5224.9 (3052.9)
min ≤ med ≤ max:
1 ≤ 5211.5 ≤ 10641
IQR (CV) : 5233.5 (0.6)
9802 distinct values 0 (0.0%)
imdbid [character]
1. 0035279
2. 0086425
3. 0117056
4. 2043900
5. 2457282
6. 0000001
7. 0000002
8. 0000003
9. 0000004
10. 0000005
[ 9784 others ]
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
9784(99.8%)
3 (0.0%)
title [character]
1. Cinderella
2. Dracula
3. Little Women
4. Pride and Prejudice
5. Robin Hood
6. Shelter
7. A Star Is Born
8. Alice in Wonderland
9. Anna Karenina
10. Annie
[ 9547 others ]
5(0.1%)
4(0.0%)
4(0.0%)
4(0.0%)
4(0.0%)
4(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
9765(99.6%)
0 (0.0%)
titleId [character]
1. tt
2. tt0035279
3. tt0086425
4. tt0117056
5. tt2043900
6. tt2457282
7. tt0000001
8. tt0000002
9. tt0000003
10. tt0000004
[ 9785 others ]
3(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
9785(99.8%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

Tidy data

(1) TMDb data

Checking missing values

First of all, I took a look at missing values (NA values) in the data.

movie %>% select(everything()) %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  t()
Warning: `funs()` was deprecated in dplyr 0.8.0.
ℹ Please use a list of either functions or lambdas:

# Simple named list: list(mean = mean, median = median)

# Auto named with `tibble::lst()`: tibble::lst(mean, median)

# Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
                       [,1]
adult                     3
belongs_to_collection 40972
budget                    3
genres                    0
homepage              37684
id                        3
imdb_id                  17
original_language        11
original_title            0
overview                959
popularity                6
poster_path             386
production_companies      3
production_countries      3
release_date             90
revenue                   6
runtime                 263
spoken_languages          6
status                   87
tagline               25055
title                     6
video                     6
vote_average              6
vote_count                6

Deleting unnecessary columns

I deleted several columns, which seem to irrelevant for this analysis or have too many missing values.
- adult: This information doesn’t add any significant meaning as almost all of observations have FALSE value for this column
- homepage: URL of homepage. Unnecessary for this analysis
- belong_to_collection: Has 40000+ missing values - poster_path
- video: This information doesn’t add any significant meaning as almost all of observations have FALSE value for this column

- popularity: This information may be interesting, however, it is not clear how this popularity is measured or where it was generated. Therefore, I decided to use popularity data from OMDb.

- vote_average: Same as popularity

-vote_count: Same as popularity

Then, I changed the order of the columns. Each row represents a movie so the movie title should be the first column.

movie <- movie %>% select(-c("adult", "homepage", "belongs_to_collection", "poster_path", "video", "popularity", "vote_average", "vote_count"))

col_order <- c("title", "original_title", "tagline", "imdb_id", "id", "overview", "production_companies", "production_countries", "status", "release_date", "runtime", "revenue", "budget", "original_language", "spoken_languages", "genres")
movie <- movie[, col_order]

colnames(movie)[5] <- "movielens_id"

Cleaning data in JSON nested list format

The values in some certain columns such as “genres”, “production_companies”, “production_countries”,“spoken_languages” are in a JSON list format for example:

” [{‘id’: XXXX, ‘content(name/genre/title)’: XXX}, {‘id’: XXXX, ’content(name/genre/title): XXX}] ”

movie %>% select(c("genres", "production_companies", "production_countries", "spoken_languages"))
# A tibble: 45,466 × 4
   genres                                                produ…¹ produ…² spoke…³
   <chr>                                                 <chr>   <chr>   <chr>  
 1 [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name':… [{'nam… [{'iso… [{'iso…
 2 [{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name':… [{'nam… [{'iso… [{'iso…
 3 [{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name'… [{'nam… [{'iso… [{'iso…
 4 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'D… [{'nam… [{'iso… [{'iso…
 5 [{'id': 35, 'name': 'Comedy'}]                        [{'nam… [{'iso… [{'iso…
 6 [{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'C… [{'nam… [{'iso… [{'iso…
 7 [{'id': 35, 'name': 'Comedy'}, {'id': 10749, 'name':… [{'nam… [{'iso… [{'iso…
 8 [{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'A… [{'nam… [{'iso… [{'iso…
 9 [{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'A… [{'nam… [{'iso… [{'iso…
10 [{'id': 12, 'name': 'Adventure'}, {'id': 28, 'name':… [{'nam… [{'iso… [{'iso…
# … with 45,456 more rows, and abbreviated variable names
#   ¹​production_companies, ²​production_countries, ³​spoken_languages
# These columns include [] {} and ' in their values so I made a function remove_simbols that removes these unnecessary symbols.
remove_symbols <- function(x) {
  removed_x <- str_remove_all(x, "\\{|\\}|\\[|\\}|\\]|'|,|id|name|:| ")
  return(removed_x)
}

(1). Tidying Genres

Each movie contains information on one or more (up to 8) genres in a single column. To organize the information, the multiple genres contained in genre are split by “],” so that each genre is contained in eight new columns, genre1-genre8. Then, I removed unnecessary symbols such as ,.

# Separating the `genre` by "},"
movie <- movie %>% 
  separate(genres, c("genre1", "genre2", "genre3", "genre4", "genre5", "genre6", "genre7", "genre8"), "\\},", remove = FALSE)
Warning: Expected 8 pieces. Missing pieces filled with `NA` in 45463 rows [1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
# Remove unnecessary symbols from each columns

movie$genre1 <- remove_symbols(movie$genre1)
movie$genre2 <- remove_symbols(movie$genre2)
movie$genre3 <- remove_symbols(movie$genre3)
movie$genre4 <- remove_symbols(movie$genre4)
movie$genre5 <- remove_symbols(movie$genre5)
movie$genre6 <- remove_symbols(movie$genre6)
movie$genre7 <- remove_symbols(movie$genre7)
movie$genre8 <- remove_symbols(movie$genre8)

movie %>% 
  select(matches("[1-9]"))
# A tibble: 45,466 × 8
   genre1       genre2       genre3       genre4     genre5 genre6 genre7 genre8
   <chr>        <chr>        <chr>        <chr>      <chr>  <chr>  <chr>  <chr> 
 1 16Animation  35Comedy     10751Family  <NA>       <NA>   <NA>   <NA>   <NA>  
 2 12Adventure  14Fantasy    10751Family  <NA>       <NA>   <NA>   <NA>   <NA>  
 3 10749Romance 35Comedy     <NA>         <NA>       <NA>   <NA>   <NA>   <NA>  
 4 35Comedy     18Drama      10749Romance <NA>       <NA>   <NA>   <NA>   <NA>  
 5 35Comedy     <NA>         <NA>         <NA>       <NA>   <NA>   <NA>   <NA>  
 6 28Action     80Crime      18Drama      53Thriller <NA>   <NA>   <NA>   <NA>  
 7 35Comedy     10749Romance <NA>         <NA>       <NA>   <NA>   <NA>   <NA>  
 8 28Action     12Adventure  18Drama      10751Fami… <NA>   <NA>   <NA>   <NA>  
 9 28Action     12Adventure  53Thriller   <NA>       <NA>   <NA>   <NA>   <NA>  
10 12Adventure  28Action     53Thriller   <NA>       <NA>   <NA>   <NA>   <NA>  
# … with 45,456 more rows

I managed to separate the genre information above. Now all values in genre1-genre8 has a value in the format of “number name of genre” (example: 35Comedy)

However, there are 2,445 observations whose genre value doesn’t fit the above format. Most of them don’t have any genre assigned so their genre1 values are empty or blank, which means they don’t have any genre assigned to them.

movie %>% filter(!str_detect(genre1, "^[0-9]")) %>% select(starts_with("genre"))
# A tibble: 2,445 × 9
   genres genre1 genre2 genre3 genre4 genre5 genre6 genre7 genre8
   <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
 1 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 2 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 3 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 4 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 5 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 6 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 7 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 8 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 9 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
10 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
# … with 2,435 more rows

However, looking at them closely, there are three observations whose genre value that is NOT blank but have strange strings.

The values in these observations such as “Carousel Production” and “Aniplex” don’t look like a name of genres but that of production studios. Also their original_title values don’t look like a title of movies but information of languages. It seems like these observations were not correctly read in because their values in other columns look weird.

Since these observations are not reliable, I decided to delete them from this dataset.

movie %>% filter(!str_detect(genre1, "^[0-9]") &!str_detect(genre1, "^[ \t\n]*$")) %>% #^[ \t\n]*$ is a regular expression for blank.
  select(c(original_title, title, production_countries, genre1:genre8))
# A tibble: 3 × 11
  origin…¹ title produ…² genre1 genre2 genre3 genre4 genre5 genre6 genre7 genre8
  <chr>    <chr> <chr>   <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
1 [{'iso_… <NA>  6.0     Carou… Visio… Teles… <NA>   <NA>   <NA>   <NA>   <NA>  
2 [{'iso_… <NA>  7.0     Anipl… GoHan… BROST… Mardo… Senta… <NA>   <NA>   <NA>  
3 [{'iso_… <NA>  4.3     Odyss… Pulse… Rogue… TheCa… <NA>   <NA>   <NA>   <NA>  
# … with abbreviated variable names ¹​original_title, ²​production_countries
movie_clean1 <- movie %>% filter(!str_detect(genre1, "^[A-Z]"))

Genre information is more organized but is still not easy to be analyzed. Thus, I decided to make dummy variables of each genre.

For example, if movie A is categorized as “comedy” and “adventure”, the line of movie A should have 1 in “comedy” column and “adventure” respectively and 0 in columns of other genres.

#https://community.rstudio.com/t/creating-dummy-columns-based-on-multiple-columns/58145/3
movie_clean1 <- movie_clean1 %>% 
  pivot_longer(cols = matches("genre[1-9]")) %>%
  add_column(count = 1) %>%
  arrange(value) %>%
  filter(str_detect(value, "^[0-9]")) %>%
  mutate(value1 = str_replace_all(value, "[0-9]+", "")) %>%
  select(-c(value, name)) %>%
  arrange(title) %>%
  pivot_wider(
    names_from = value1, 
    values_from = count, 
    values_fill = list(count=0),
    values_fn = list(count = mean)) %>%
  arrange(title)

# I also deleted genre "TVMovie" and "Foreign" because they're more format or origin country information rather than genre.
movie_clean1 <- movie_clean1 %>% select(-c("TVMovie","Foreign"))

(2). Tyding Production_countries

The values in production_countries are written in the following way. >[{‘iso_3166_1’: ‘abbreviation of country’,‘name’: ‘full country name’}]

When there are more than one country in the value, I took the first country in account for this analysis.

movie_clean1$production_countries <- remove_symbols(movie_clean1$production_countries)
movie_clean1$production_countries <- str_extract(movie_clean1$production_countries,"(?<=_1)\\w{2}")

movie_clean1 %>% select(production_countries)
# A tibble: 42,991 × 1
   production_countries
   <chr>               
 1 GB                  
 2 IN                  
 3 US                  
 4 <NA>                
 5 US                  
 6 US                  
 7 FR                  
 8 JP                  
 9 <NA>                
10 US                  
# … with 42,981 more rows

(3). Tyding production_companies & spoken_languages

The values in both of production_companies and spoken_languages are in the following format.

> [{‘name’: ‘XXXX’, ‘id’: —}] > [{‘iso_639_1’:‘XXXX’, ‘name’:“—-}]

I only need the information that is written as XXXX in this format.

movie_clean1$production_companies <-str_remove_all(movie_clean1$production_companies, "'id': [0-9]*|'name':|\\[|\\]|'|\\{|,") %>%
  str_replace_all("\\},", ",") %>%
  str_replace_all(" \\} ", ",")%>%
  str_remove(" \\}")

movie_clean1$spoken_languages <-str_remove_all(movie_clean1$spoken_languages, "'iso_639_1':|'name': '\\w+'|\\[|\\]|'|\\{|,") %>%
  str_replace_all(" \\} ", ",") %>%
  str_remove_all(" \\}")

movie_clean1 %>% select(production_companies, spoken_languages)
# A tibble: 42,991 × 2
   production_companies                                                  spoke…¹
   <chr>                                                                 <chr>  
 1 " Screen Yorkshire, British Film Institute (BFI), Creative Scotland,… " en"  
 2 ""                                                                    " hi"  
 3 " Sebastian Films Limited"                                            " en"  
 4 " Lorimar Pictures"                                                   " en"  
 5 " Lone Star Production, Monogram Pictures"                            " en, …
 6 ""                                                                    " en"  
 7 " Barnholtz Entertainment"                                            " en, …
 8 ""                                                                    " zh"  
 9 ""                                                                    " fr"  
10 " Disney Channel"                                                     " en"  
# … with 42,981 more rows, and abbreviated variable name ¹​spoken_languages

Adding a new column

A new column years was created to group years by decade.

movie_clean1 <- movie_clean1 %>% 
  mutate(
    years = case_when(
    lubridate::year(release_date) < 1920 ~ "Before 1920",
    lubridate::year(release_date) >= 1920 & lubridate::year(release_date) < 1930 ~ "1920s",
    lubridate::year(release_date) >= 1930 & lubridate::year(release_date) < 1940 ~ "1930s",
    lubridate::year(release_date) >= 1940 & lubridate::year(release_date) < 1950 ~ "1940s",
    lubridate::year(release_date) >= 1950 & lubridate::year(release_date) < 1960 ~ "1950s",
    lubridate::year(release_date) >= 1960 & lubridate::year(release_date) < 1970 ~ "1960s",
    lubridate::year(release_date) >= 1970 & lubridate::year(release_date) < 1980 ~ "1970s",
    lubridate::year(release_date) >= 1980 & lubridate::year(release_date) < 1990 ~ "1990s",
    lubridate::year(release_date) >= 1990 & lubridate::year(release_date) < 2000 ~ "1990s",
    lubridate::year(release_date) >= 2000 & lubridate::year(release_date) < 2010 ~ "2000s",
    TRUE ~ "2010s"
  )
  )

Now, the dataset is cleaner. Look at the summary again.

print(summarytools::dfSummary(movie_clean1),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

movie_clean1

Dimensions: 42991 x 35
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
title [character]
1. Cinderella
2. Alice in Wonderland
3. Hamlet
4. Les Misérables
5. A Christmas Carol
6. Beauty and the Beast
7. The Three Musketeers
8. Treasure Island
9. Aftermath
10. Countdown
[ 40025 others ]
10(0.0%)
9(0.0%)
9(0.0%)
8(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
6(0.0%)
6(0.0%)
42912(99.8%)
3 (0.0%)
original_title [character]
1. Alice in Wonderland
2. Hamlet
3. A Christmas Carol
4. Les Misérables
5. Macbeth
6. The Three Musketeers
7. Cinderella
8. Frankenstein
9. Heidi
10. The Hound of the Baskervi
[ 41043 others ]
8(0.0%)
8(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
7(0.0%)
6(0.0%)
6(0.0%)
6(0.0%)
6(0.0%)
42923(99.8%)
0 (0.0%)
tagline [character]
1. Based on a true story.
2. -
3. Be careful what you wish
4. Trust no one.
5. A Love Story
6. Classic Albums
7. Drama
8. How far would you go?
9. Know Your Enemy
10. Some doors should never b
[ 20008 others ]
7(0.0%)
4(0.0%)
4(0.0%)
4(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
3(0.0%)
20097(99.8%)
22857 (53.2%)
imdb_id [character]
1. tt0000001
2. tt0000003
3. tt0000005
4. tt0000008
5. tt0000010
6. tt0000012
7. tt0000014
8. tt0000023
9. tt0000029
10. tt0000070
[ 42969 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
42969(100.0%)
12 (0.0%)
movielens_id [numeric]
Mean (sd) : 104553.2 (111978.6)
min ≤ med ≤ max:
2 ≤ 55734 ≤ 469172
IQR (CV) : 118823.5 (1.1)
42991 distinct values 0 (0.0%)
overview [character]
1. No overview found.
2. A few funny little novels
3. No movie overview availab
4. A former aristocrat Ippol
5. A group of travelers, inc
6. A Russian engineer Petr G
7. A wooden boy Buratino tri
8. Adventurer Allan Quarterm
9. Alien pods come to Earth
10. Director Michael Apted re
[ 42137 others ]
120(0.3%)
3(0.0%)
3(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
2(0.0%)
42151(99.7%)
700 (1.6%)
production_companies [character]
1. (Empty string)
2.  
·
Metro-Goldwyn-Mayer (MGM
3.  
·
Warner Bros.
4.  
·
Paramount Pictures
5.  
·
Twentieth Century Fox Fi
6.  
·
Universal Pictures
7.  
·
RKO Radio Pictures
8.  
·
Columbia Pictures Corpor
9.  
·
Columbia Pictures
10.  
·
Mosfilm
[ 22474 others ]
9770(22.7%)
737(1.7%)
538(1.3%)
500(1.2%)
434(1.0%)
320(0.7%)
247(0.6%)
207(0.5%)
146(0.3%)
144(0.3%)
29945(69.7%)
3 (0.0%)
production_countries [character]
1. US
2. GB
3. FR
4. CA
5. JP
6. DE
7. IT
8. RU
9. IN
10. ES
[ 132 others ]
18202(47.5%)
3031(7.9%)
2639(6.9%)
1469(3.8%)
1430(3.7%)
1389(3.6%)
1382(3.6%)
777(2.0%)
754(2.0%)
576(1.5%)
6653(17.4%)
4689 (10.9%)
status [character]
1. In Production
2. Planned
3. Post Production
4. Released
5. Rumored
20(0.0%)
15(0.0%)
96(0.2%)
42601(99.2%)
205(0.5%)
54 (0.1%)
release_date [Date]
min : 1874-12-09
med : 2001-10-29
max : 2020-12-16
range : 146y 0m 7d
16860 distinct values 29 (0.1%)
runtime [numeric]
Mean (sd) : 95.5 (36.6)
min ≤ med ≤ max:
0 ≤ 95 ≤ 1256
IQR (CV) : 21 (0.4)
343 distinct values 182 (0.4%)
revenue [numeric]
Mean (sd) : 11845510 (66095743)
min ≤ med ≤ max:
0 ≤ 0 ≤ 2787965087
IQR (CV) : 0 (5.6)
6844 distinct values 3 (0.0%)
budget [numeric]
Mean (sd) : 4461834 (17886326)
min ≤ med ≤ max:
0 ≤ 0 ≤ 3.8e+08
IQR (CV) : 0 (4)
1215 distinct values 0 (0.0%)
original_language [character]
1. en
2. fr
3. it
4. ja
5. de
6. es
7. ru
8. hi
9. ko
10. zh
[ 79 others ]
30566(71.1%)
2348(5.5%)
1359(3.2%)
1287(3.0%)
1029(2.4%)
927(2.2%)
786(1.8%)
478(1.1%)
431(1.0%)
391(0.9%)
3379(7.9%)
10 (0.0%)
spoken_languages [character]
1.  
·
en
2. (Empty string)
3.  
·
fr
4.  
·
ja
5.  
·
it
6.  
·
es
7.  
·
ru
8.  
·
de
9.  
·
en, fr
10.  
·
en, es
[ 1896 others ]
21943(51.0%)
2660(6.2%)
1785(4.2%)
1220(2.8%)
1097(2.6%)
826(1.9%)
769(1.8%)
722(1.7%)
679(1.6%)
568(1.3%)
10719(24.9%)
3 (0.0%)
genres [character]
1. [{'id': 18, 'name': 'Dram
2. [{'id': 35, 'name': 'Come
3. [{'id': 99, 'name': 'Docu
4. [{'id': 18, 'name': 'Dram
5. [{'id': 35, 'name': 'Come
6. [{'id': 27, 'name': 'Horr
7. [{'id': 35, 'name': 'Come
8. [{'id': 35, 'name': 'Come
9. [{'id': 18, 'name': 'Dram
10. [{'id': 27, 'name': 'Horr
[ 4055 others ]
4996(11.6%)
3620(8.4%)
2721(6.3%)
1300(3.0%)
1133(2.6%)
974(2.3%)
930(2.2%)
593(1.4%)
531(1.2%)
528(1.2%)
25665(59.7%)
0 (0.0%)
War [numeric]
Min : 0
Mean : 0
Max : 1
0:41669(96.9%)
1:1322(3.1%)
0 (0.0%)
Drama [numeric]
Min : 0
Mean : 0.5
Max : 1
0:22747(52.9%)
1:20244(47.1%)
0 (0.0%)
Action [numeric]
Min : 0
Mean : 0.2
Max : 1
0:36399(84.7%)
1:6592(15.3%)
0 (0.0%)
Thriller [numeric]
Min : 0
Mean : 0.2
Max : 1
0:35372(82.3%)
1:7619(17.7%)
0 (0.0%)
Crime [numeric]
Min : 0
Mean : 0.1
Max : 1
0:38687(90.0%)
1:4304(10.0%)
0 (0.0%)
Western [numeric]
Min : 0
Mean : 0
Max : 1
0:41949(97.6%)
1:1042(2.4%)
0 (0.0%)
Documentary [numeric]
Min : 0
Mean : 0.1
Max : 1
0:39061(90.9%)
1:3930(9.1%)
0 (0.0%)
Romance [numeric]
Min : 0
Mean : 0.2
Max : 1
0:36261(84.3%)
1:6730(15.7%)
0 (0.0%)
Family [numeric]
Min : 0
Mean : 0.1
Max : 1
0:40224(93.6%)
1:2767(6.4%)
0 (0.0%)
Animation [numeric]
Min : 0
Mean : 0
Max : 1
0:41060(95.5%)
1:1931(4.5%)
0 (0.0%)
Comedy [numeric]
Min : 0
Mean : 0.3
Max : 1
0:29815(69.4%)
1:13176(30.6%)
0 (0.0%)
Horror [numeric]
Min : 0
Mean : 0.1
Max : 1
0:38320(89.1%)
1:4671(10.9%)
0 (0.0%)
Mystery [numeric]
Min : 0
Mean : 0.1
Max : 1
0:40527(94.3%)
1:2464(5.7%)
0 (0.0%)
Adventure [numeric]
Min : 0
Mean : 0.1
Max : 1
0:39501(91.9%)
1:3490(8.1%)
0 (0.0%)
History [numeric]
Min : 0
Mean : 0
Max : 1
0:41593(96.7%)
1:1398(3.3%)
0 (0.0%)
ScienceFiction [numeric]
Min : 0
Mean : 0.1
Max : 1
0:39947(92.9%)
1:3044(7.1%)
0 (0.0%)
Fantasy [numeric]
Min : 0
Mean : 0.1
Max : 1
0:40682(94.6%)
1:2309(5.4%)
0 (0.0%)
Music [numeric]
Min : 0
Mean : 0
Max : 1
0:41394(96.3%)
1:1597(3.7%)
0 (0.0%)
years [character]
1. 1920s
2. 1930s
3. 1940s
4. 1950s
5. 1960s
6. 1970s
7. 1990s
8. 2000s
9. 2010s
10. Before 1920
398(0.9%)
1264(2.9%)
1444(3.4%)
1997(4.6%)
2474(5.8%)
3270(7.6%)
8965(20.9%)
10605(24.7%)
12291(28.6%)
283(0.7%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

(2) Bechdel data

Removing and Mutating Columns

imdbid and titleId are duplicated because titleId is “tt+imdbid”. movie_clean1 dataframe has IDs in the same format (starting with tt) as titleId of bechdel dataframe so I deleted imdbid column.

Also, year and title information is available in movie dataframe, thus I deleted them.

I made a new column bechdel_pass where 1 means the movie passes bechdel test and 0 means otherwise.

#

bechdel <- bechdel %>% select(-c("imdbid","...1","title"))
colnames(bechdel) <- c("year", "bechdel_rating", "id", "titleid")

summary(bechdel)
      year      bechdel_rating        id          titleid         
 Min.   :1010   Min.   :0.000   Min.   :    1   Length:9802       
 1st Qu.:1988   1st Qu.:1.000   1st Qu.: 2558   Class :character  
 Median :2006   Median :3.000   Median : 5212   Mode  :character  
 Mean   :1996   Mean   :2.133   Mean   : 5225                     
 3rd Qu.:2013   3rd Qu.:3.000   3rd Qu.: 7792                     
 Max.   :2022   Max.   :3.000   Max.   :10641                     
# Changing the column order
col_order <- c("id", "titleid","year", "bechdel_rating")
bechdel <- bechdel[, col_order]

# Mutate a new column bechdel_pass
bechdel <- bechdel %>% mutate(
  bechdel_pass = case_when(
    bechdel_rating == 3 ~ 1,
    TRUE ~ 0
  )
)

# Change data types of columns
bechdel$bechdel_rating <- factor(bechdel$bechdel_rating, levels = c("0", "1", "2", "3"))
bechdel$bechdel_pass <- factor(bechdel$bechdel_pass, levels = c("0", "1"))

Sense Check

This data look quite clean, however, I looked at the summary for sense check, then I realized that the minimum number of year is 1010, which is weird. “Inazuma Eleven The Movie” was released in 2010, but it seems to have been mistakenly recorded as 1010. So I manually corrected the data.

summary(bechdel)
       id          titleid               year      bechdel_rating bechdel_pass
 Min.   :    1   Length:9802        Min.   :1010   0:1084         0:4208      
 1st Qu.: 2558   Class :character   1st Qu.:1988   1:2124         1:5594      
 Median : 5212   Mode  :character   Median :2006   2:1000                     
 Mean   : 5225                      Mean   :1996   3:5594                     
 3rd Qu.: 7792                      3rd Qu.:2013                              
 Max.   :10641                      Max.   :2022                              
bechdel %>% 
  filter(year < 1800)
# A tibble: 1 × 5
     id titleid    year bechdel_rating bechdel_pass
  <dbl> <chr>     <dbl> <fct>          <fct>       
1 10556 tt1794796  1010 3              1           
bechdel$year[bechdel$year == 1010] <- 2010

Join datasets

We have two dataframes movie and bechdel , which have imdb id as a foreign key.

movie_bechdel_join <- inner_join(movie_clean1, bechdel, 
                                 by=c("imdb_id" = "titleid"),
                                 copy = TRUE)

glimpse(movie_bechdel_join)
Rows: 7,736
Columns: 39
$ title                <chr> "'71", "'Gator Bait", "'night, Mother", "(500) Da…
$ original_title       <chr> "'71", "'Gator Bait", "'night, Mother", "(500) Da…
$ tagline              <chr> NA, "UNTAMED AND DEADLY, she ruled the swamp with…
$ imdb_id              <chr> "tt2614684", "tt0074080", "tt0090556", "tt1022603…
$ movielens_id         <dbl> 252178, 88950, 34760, 19913, 45261, 8420, 8329, 1…
$ overview             <chr> "A young British soldier must find his way back t…
$ production_companies <chr> " Screen Yorkshire, British Film Institute (BFI),…
$ production_countries <chr> "GB", "US", "US", "US", "US", "FR", "ES", "ES", "…
$ status               <chr> "Released", "Released", "Released", "Released", "…
$ release_date         <date> 2014-10-10, 1974-06-01, 1986-09-12, 2009-07-17, …
$ runtime              <dbl> 99, 88, 96, 95, 96, 95, 78, 85, 80, 102, 95, 103,…
$ revenue              <dbl> 1625847, 0, 442000, 60722734, 230600, 0, 30448000…
$ budget               <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 7.50e+06, 0.00e+00,…
$ original_language    <chr> "en", "en", "en", "en", "en", "fr", "es", "es", "…
$ spoken_languages     <chr> " en", " en", " en", " en, fr, sv", " en", " en, …
$ genres               <chr> "[{'id': 53, 'name': 'Thriller'}, {'id': 28, 'nam…
$ War                  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Drama                <dbl> 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0…
$ Action               <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ Thriller             <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0…
$ Crime                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
$ Western              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
$ Documentary          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Romance              <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ Family               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ Animation            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Comedy               <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1…
$ Horror               <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ Mystery              <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Adventure            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ History              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ScienceFiction       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0…
$ Fantasy              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ Music                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ years                <chr> "2010s", "1970s", "1990s", "2000s", "2000s", "195…
$ id                   <dbl> 5159, 9738, 3155, 434, 7382, 3409, 3646, 4764, 50…
$ year                 <dbl> 2014, 1973, 1986, 2009, 2009, 1956, 2007, 2009, 2…
$ bechdel_rating       <fct> 0, 0, 3, 1, 1, 3, 3, 1, 3, 2, 3, 3, 3, 1, 3, 1, 3…
$ bechdel_pass         <fct> 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1…

Read in data / Describe data (OMDb)

As we have a joined data, I read in the movie popularity data from OMDb using OMDb API.

omdb_df has 5 columns

  • imdbID: can be used as a foreign key when joining with other dataframes

  • Director: Director’s name

  • Metascore: Review scores from 0 to 100 on the metacritic website (https://www.metacritic.com/movie). They are weighted averages. and reviews by certain reviewers are given more weight, however the detail is not revealed.

  • imdbRating: Review scores from 0 to 10 on IMDb website (https://www.imdb.com/). They are weighted averages.

  • imdbVotes: Number of votes on IMDb website.

# Generating OMDb dataframe
# imdb_id_list <- c()
# 
# for (i in c(1:length(movie_bechdel_join$imdb_id))) {
#   imdb_id_list[[i]] <- movie_bechdel_join$imdb_id[i]
# }
# 
# omdb_list <- lapply(imdb_id_list, function(movie_id) {
#   actor_vector <- find_by_id(movie_id)
#   actor_vector
# })
# 
# omdb_df <- tibble(place = omdb_list) %>%
#   unnest_wider(place) %>%
#   select(c("imdbID", "Director", "Metascore", "imdbRating", "imdbVotes", "BoxOffice")) %>% unnest() %>% distinct()

#write.csv(omdb_df, "_data/omdb.csv")
omdb_df <- read_csv("_data/omdb.csv")
New names:
Rows: 7729 Columns: 7
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): imdbID, Director, Metascore, BoxOffice dbl (3): ...1, imdbRating,
imdbVotes
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
omdb_df$Metascore <- as.numeric(omdb_df$Metascore)
Warning: NAs introduced by coercion
omdb_df$BoxOffice <- omdb_df$BoxOffice %>%
  str_replace_all(",", "") %>% 
  str_extract("[0-9]*(?=$)") %>% 
  as.numeric()

omdb_df <- omdb_df %>% select(-...1)
glimpse(omdb_df)
Rows: 7,729
Columns: 6
$ imdbID     <chr> "tt2614684", "tt0074080", "tt0090556", "tt1022603", "tt1132…
$ Director   <chr> "Yann Demange", "Beverly Sebastian, Ferd Sebastian", "Tom M…
$ Metascore  <dbl> 83, NA, NA, 76, 58, NA, 71, 52, 45, 52, 60, 76, NA, NA, 70,…
$ imdbRating <dbl> 7.2, 5.4, 7.6, 7.7, 6.3, 6.3, 7.4, 6.5, 5.0, 6.5, 5.5, 7.2,…
$ imdbVotes  <dbl> 56942, 1278, 2371, 516486, 2230, 8603, 181857, 74038, 35911…
$ BoxOffice  <dbl> 1270847, NA, 441863, 32391374, 230600, NA, NA, 27766, 9600,…

Now I have the data from OMDb as omdb_df. Let’s join movie_bechdel_join and omdb_df.

data <- inner_join(movie_bechdel_join, omdb_df, 
                                 by=c("imdb_id" = "imdbID"),
                                 copy = TRUE)

Data Analysis and Visualization

Is female representation in movie stories improving over time?

To find out the answer for this questions, I looked at the result of the Bechdel test.

The number of movies that pass the Bechdel test increases over time, especially after around 1980 as the total number of released movies increased.

movie_bechdel_join %>%
  filter(lubridate::year(release_date) > 1920) %>%
  group_by(year = lubridate::year(release_date), bechdel_rating) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=year, y= n_total, fill = bechdel_rating)) + geom_area(stat = "identity") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No 2 female characters", "2 female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  labs(x= "year", y = "Total number of movies", title = "The number of movies by Bechdel test ratings")+
  scale_x_continuous(n.breaks=14) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

However, the total number of released movies is increasing as well. Is the percentage of films that pass the Bechdel test increasing?

In 1920’s, which is 100 years ago, less than 20% of films passed Bechdel test. There was a big increase in percentage of Bechdel test passing movies in 1930’s. However, the percentage of films passing Bechdel test was stagnant a bit lower than 50% from 1930’s to 1950’s. Somehow, from 1950 to 1970, the representation of women in film went backwards, with less than 45% of films meeting the requirements of the Bechdel test; from the 1970s to the present, the percentage of films that pass the Bechdel test has continued to increase, reaching approximately 70% now. Although the representation of women seems to have improved since 1970, it has not improved dramatically compared to the 1930s and 1940s.

data %>%
  filter(lubridate::year(release_date) > 1920) %>%
  group_by(years, bechdel_rating) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=years, y= n_total, fill = bechdel_rating)) + geom_bar(stat = "identity", position="fill") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two female characters", "1.Two female characters", "2.That talk each other", "3.Other than a man (Passes Bechdel Test)")) +
  labs(x="year", y="percentage", title = "% of movies that pass/don't pass Bechdel Test") +
  scale_y_continuous(n.breaks=10) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(ncol = 2))
`summarise()` has grouped output by 'years'. You can override using the
`.groups` argument.

If we look at the number of each year, we can see that some years in 1930’s and 1940’s did as good as in 2010s. For example, 1943 - 1944 had a quite high percentage of Bechdel Test passing rate.

data %>%
  filter(lubridate::year(release_date) > 1920) %>%
  group_by(year = lubridate::year(release_date), bechdel_rating) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=year, y= n_total, fill = bechdel_rating)) + geom_area(stat = "identity", position = "fill") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No 2 female characters", "2 female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  labs(x= "year", y = "Total number of movies", title = "The number of movies by Bechdel test ratings")+
  scale_x_continuous(n.breaks=14) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

Perhaps the percentage of films that pass the Bechdel test has not changed too dramatically compared to 70 years ago. However, a breakdown of the Bechdel test criteria for films that do not pass the Bechdel Test shows a different trend compared to 70 years ago.

Currently, most of not-bechdel-passing films fail to Bechdel test because there are more than one female character but they don’t talk each other. There are few movies (still corresponding nearly 10% of total movies released every year) that don’t have two female characters.

It has not been always the case. In 1920s, quite many movies didn’t even have two featured female characters.

data %>%
  filter(lubridate::year(release_date) > 1920 & bechdel_pass=="0") %>%
  group_by(years, bechdel_rating) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=years, y= n_total, fill = bechdel_rating)) + geom_bar(stat = "identity", position="fill") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two female characters", "1.Two female characters", "2.That talk each other about men", "3.Other than a man (Passes Bechdel Test)")) +
  labs(x="year", y="percentage", title = "% of movies that do NOT pass Bechdel test by Bechdel test criteria") +
  scale_y_continuous(n.breaks=10) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))
`summarise()` has grouped output by 'years'. You can override using the
`.groups` argument.

What category represents women better?

I focused on 6514 movies released between 1970 and 2022 because there is a small number of movies released before 1970 (25 or less movies a year) and it’s better to focus on the priod where more movies were released to observe the trend.

The degree to which women are portrayed in a film’s story varies widely from genre to genre.

For example, many movies in Comedy, Animation, Drama, Family, Fantasy, Romance, and Music pass the Bechdel test, while Action, Adventure, Crime, History, War, and Western The percentage of films that pass the Bechdel test is about 50% or less.

viz_by_genre <- data %>%
  pivot_longer(cols = c(War:Music),names_to = "genre_name", values_drop_na = TRUE) %>%
  filter(value == 1) 

viz_by_genre %>%
  filter(lubridate::year(release_date) > 1970) %>%
  group_by(year = lubridate::year(release_date), bechdel_rating, genre_name) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=year, y= n_total, fill = bechdel_rating)) + 
  geom_bar(stat = "identity", position = "fill") +
  facet_wrap(~genre_name, ncol = 3)+
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No two female characters", "Two female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  labs(x = "year", y = "proportion", title = "Proportion of movies that pass / don't pass Bechdel test by categories")+
  scale_x_continuous(n.breaks=7) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))
`summarise()` has grouped output by 'year', 'bechdel_rating'. You can override
using the `.groups` argument.

viz_by_genre %>%
  filter(lubridate::year(release_date)>1970) %>%
  group_by(bechdel_rating, genre_name) %>%
  dplyr::summarize(number = n()) %>%
  ggplot(aes(x=genre_name, y=number, fill=as.character(bechdel_rating))) +
  geom_bar(stat="identity", position = "fill") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No two female characters", "Two female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  coord_flip() +
  labs(x="Genre", y="%")
`summarise()` has grouped output by 'bechdel_rating'. You can override using
the `.groups` argument.

Why do movies do NOT pass Bechdel test?

viz_by_genre %>%
  filter(lubridate::year(release_date)>1970 & bechdel_pass == "0") %>%
  group_by(bechdel_rating, genre_name) %>%
  dplyr::summarize(number = n()) %>%
  ggplot(aes(x=genre_name, y=number, fill=as.character(bechdel_rating))) +
  geom_bar(stat="identity", position = "fill") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No two female characters", "Two female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  coord_flip()+
  labs(x="Genre", y="%")
`summarise()` has grouped output by 'bechdel_rating'. You can override using
the `.groups` argument.

Do movies with good female representation succeed in gaining more popularity on online review site?

I compared Metascore and IMDb ratings of the films that pass and don’t pass Bechdel test.

The movies that don’t pass Bechdel test seem to score higher on Metascore, however the difference is quite small, and there seems to be no relationship between whether a film passes the bechdel test and its reputation (Metascore).

data %>%
  filter(lubridate::year(release_date) > 1970) %>%
  ggplot(aes(x=bechdel_pass, y=Metascore)) + 
  geom_boxplot() +
  scale_x_discrete(labels=c("0: Don't pass Bechel", "1: Pass Bechdel")) +
  labs(x="Bechdel Test")
Warning: Removed 1530 rows containing non-finite values (stat_boxplot).

T-test shows that there is no statistically significant difference in Metascores of the movies that pass and that don’t pass Bechdel test.

ttest_pass <- data %>% 
  filter(bechdel_pass == "1" & lubridate::year(release_date) > 1970) %>%
  select(c("bechdel_pass", "imdbRating", "imdbVotes", "Metascore", "BoxOffice"))

ttest_not_pass <- data %>%
  filter(bechdel_pass == "0" & lubridate::year(release_date) > 1970) %>%
  select(c("bechdel_pass", "imdbRating", "imdbVotes", "Metascore", "BoxOffice"))

t.test(ttest_pass$Metascore, ttest_not_pass$Metascore)

    Welch Two Sample t-test

data:  ttest_pass$Metascore and ttest_not_pass$Metascore
t = 0.059821, df = 4647, p-value = 0.9523
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.9335741  0.9923409
sample estimates:
mean of x mean of y 
 58.66466  58.63527 

Then, I compared the IMDb scores of the movies that pass and don’t pass Bechdel test. Those don’t pass Bechdel test seem to score a bit higher.

data %>%
  filter(lubridate::year(release_date) > 1970) %>%
  ggplot(aes(x=bechdel_pass, y=imdbRating)) + 
  geom_boxplot() +
  scale_x_discrete(labels=c("0: Don't pass Bechel", "1: Pass Bechdel")) +
  labs(x="Bechdel Test", y="IMDb Rating")
Warning: Removed 18 rows containing non-finite values (stat_boxplot).

T-test shows that the movies that do not meet Bechdel test criteria score statistically higher than those that do.

t.test(ttest_pass$imdbRating, ttest_not_pass$imdbRating)

    Welch Two Sample t-test

data:  ttest_pass$imdbRating and ttest_not_pass$imdbRating
t = -8.029, df = 6104.8, p-value = 1.17e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2395005 -0.1454990
sample estimates:
mean of x mean of y 
 6.463997  6.656497 

Do movies with good female representation succeed in making more money?

data %>%
  filter(lubridate::year(release_date) > 1970) %>%
  ggplot(aes(x=as.character(bechdel_pass), y=BoxOffice)) + geom_boxplot() 
Warning: Removed 1642 rows containing non-finite values (stat_boxplot).

t.test(ttest_pass$BoxOffice, ttest_not_pass$BoxOffice)

    Welch Two Sample t-test

data:  ttest_pass$BoxOffice and ttest_not_pass$BoxOffice
t = -2.0793, df = 4805.5, p-value = 0.03765
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7909685.6  -232633.7
sample estimates:
mean of x mean of y 
 41825853  45897013 

Conclusion & Discussion

% of the movies that pass Bechdel test:

Compared to 100 years ago, the rate of movies that pass Bechdel test significantly increased, however,

Contact

The dataset that I used for this analysis can be found on my GitHub (https://github.com/Enagai-nagai/601_Fall_2022/tree/template/posts/_data).

I’m interested in analyzing and visualizing data related to entertainment industry, behavior on digital platforms, and gender issues,

If you have any comments or questions, please contact me through Email.

  • enagai ★ umass.edu (please replace ★ with @)

  • nagainagai.e ★ gmail.com (please replace ★ with @)

References

Study: Women made up 34% of speaking roles in 2021’s top films, majority of those characters were white. Women and Hollywood. (2022, March 15). Retrieved December 13, 2022, from https://womenandhollywood.com/study-women-made-up-34-of-speaking-roles-in-2021s-top-films-majority-of-those-characters-were-white/#:~:text=In%202021’s%20top%20films%2C%20females,and%2037%20percent%20in%202019.

Source Code
---
title: "Assignment3 Erika Nagai"
author: "Erika Nagai"
description: "Female representation in movie stories"
date: "`r Sys.Date()`"
format:
  html:
    toc: true
    code-copy: true
    code-tools: true
categories:
  - hw2
  - hw3
  - movie
  - gender
  - bechdel test
  - female representation
  - Erika Nagai
editor: 
  markdown: 
    wrap: 72
---

## Introduction

Even though half of moviegoers in the United States is women, their
story is not represented as much as that of men. Only 35% of the main
characters in the best movies of 2021 were female, and male characters
outnumbered female characters in 85% of the films. (Women and Hollywood,
2022) Also lacking are women behind the scenes in the film business. In
2021, just 12% of the top 100 films are directed by women.(Lauzen, n.d.)

In this project, I focused on analyzing female representation in the
movie stories rather than in their production. I used the data from
three major movie databases, The Movie Database (TMDb), MovieLens, Open
Movie Database (OMDb) and Bechdel Test, one of the most common criterion
to measure female presentation.

I would like to answer the following questions in this analysis:

1.  **Is female representation in movie stories improving over time?**

2.  **What genres are doing better than others in terms of female
    representation?**

3.  **Are movies with good female representation more popular?**

4.  **Do movies with good female representation more money?**

This analysis intends to document all the data cleaning process, which
is quite long. If you are curious only about the analysis results,
please skip to [Data Analysis and Visualization] part.

```{r}
# install libraries

library(tidyverse)
library(ggplot2)
library(stringr)
library(tidyr)
library(dplyr)
library(summarytools)
library(plyr)
library(lubridate)
library(ggridges)
```

## Read in data

For this analysis, I used the following different datasets.

### **(1) "The Movie Dataset" from Kaggle** :

This dataset was obtained from Kaggle "The Movie
Dataset"(<https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download&select=movies_metadata.csv>).

```{r}
#(1) movies_metadata.csv obtained from Kaggle
movie = read_csv("_data/movies_metadata.csv")
```

### **(2) Bechdel test** API:

I generated the data by using bechdel test API
<https://bechdeltest.com/api/v1/doc>.

```{r}
#(2) bechdel test obtained by using bechdel API

library(rjson)
library(jsonlite)

#json_file <- "http://bechdeltest.com/api/v1/getAllMovies"
#bechdel <- read_json(path = json_file, simplifyVector = TRUE)
#bechdel$titleId <- paste("tt",bechdel$imdbid, sep = "")

#write.csv(bechdel, file = "_data/bechdel.csv")
bechdel <- read_csv("_data/bechdel.csv")

```

### **(3) Open Movie Database (OMDb)**:

This dataset was generated by using Open Movie Database (OMDb) API
<https://www.omdbapi.com/>. OMDb provides the detail information such as
director, writer, nominated award, the number/score of reviews on imdb
etc... if you provide the name or imdb id of movies. I decided to use
this database because it gives me the data related to movies popularity
and financial success of the films, which "The Movie Dataset" doesn't
include.

OMDb doesn't give you a list of all movies registered on it. Instead,
you need to provide the exact movie title or the imdb id to get the list
of information. Thus, I will use this API once I have the data that join
movies_metadata.csv and bechdel test. (Please refer to "[Read in data /
Describe data (OMDb)](#read-in-data-describe-data-omdb)" for this
process.)

## Describe data

### (1) TMDb

This data was originally created from The Movie Database
(<https://www.themoviedb.org/>) and MovieLens
(<https://movielens.org/>).This movie dataset was generated by
Movielens, a (non-profit) movie review website
(<https://movielens.org/>), and was obtained from the following Kaggle
link.
(<https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download&select=movies_metadata.csv>)

The movie dataset contains 45466 movies with a released date between
December 9th 1874 and December 16th 2020.

The data includes the general information of movies, such as genres,
revenue, runtime, languages, status (released/in production etc...).

```{r}
print(summarytools::dfSummary(movie),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')
```

### (2) Bechdel test

This data (`bechdel`) dataset documents the Bechdel test rating of 9630
movies released between 1874 and 2022.

According to "Merriam-Webster", Bechdel test is "*a set of criteria used
as a test to evaluate a work of fiction (such as a film) on the basis of
its inclusion and representation of female characters*"
(<https://www.merriam-webster.com/dictionary/Bechdel%20Test>)

A movie passes Bechdel test when **it has at least two female featured
characters that talk each other about other than a man (men)**.

In `bechdel`, Bechdel test rating is registered in a following manner.

0 \~ No two female characters

1 \~ Two female characters but don't talk each other

2 \~ Two female characters talk each other about a man (men)

3 \~ Passes bechdel test: Two female characters talk each other about
other than a man (men)

Apart from Bechdel ratings, this dataset (`bechdel`) contains the
released year, title of movies, and id. id that starts with "tt"
followed by 7 digits is imdb id and can work as a foreign key when
joining this data with `movie` dataset.

```{r}
print(summarytools::dfSummary(bechdel),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')
```

## Tidy data

### (1) TMDb data

### **Checking missing values**

First of all, I took a look at missing values (NA values) in the data.

```{r}
movie %>% select(everything()) %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  t()
```

### **Deleting unnecessary columns**

I deleted several columns, which seem to irrelevant for this analysis or
have too many missing values.\
- `adult`: This information doesn't add any significant meaning as
almost all of observations have FALSE value for this column\
- `homepage`: URL of homepage. Unnecessary for this analysis\
- `belong_to_collection`: Has 40000+ missing values - poster_path\
- `video`: This information doesn't add any significant meaning as
almost all of observations have FALSE value for this column

\- `popularity`: This information may be interesting, however, it is not
clear how this popularity is measured or where it was generated.
Therefore, I decided to use popularity data from OMDb.

\- `vote_average`: Same as `popularity`

\-`vote_count`: Same as `popularity`

Then, I changed the order of the columns. Each row represents a movie so
the movie title should be the first column.

```{r}
movie <- movie %>% select(-c("adult", "homepage", "belongs_to_collection", "poster_path", "video", "popularity", "vote_average", "vote_count"))

col_order <- c("title", "original_title", "tagline", "imdb_id", "id", "overview", "production_companies", "production_countries", "status", "release_date", "runtime", "revenue", "budget", "original_language", "spoken_languages", "genres")
movie <- movie[, col_order]

colnames(movie)[5] <- "movielens_id"

```

### **Cleaning data in JSON nested list format**

The values in some certain columns such as "genres",
"production_companies", "production_countries","spoken_languages" are in
a JSON list format for example:

> " \[{'id': XXXX, 'content(name/genre/title)': XXX}, {'id': XXXX,
> 'content(name/genre/title): XXX}\] "

```{r}
movie %>% select(c("genres", "production_companies", "production_countries", "spoken_languages"))

# These columns include [] {} and ' in their values so I made a function remove_simbols that removes these unnecessary symbols.
remove_symbols <- function(x) {
  removed_x <- str_remove_all(x, "\\{|\\}|\\[|\\}|\\]|'|,|id|name|:| ")
  return(removed_x)
}
```

#### (1). Tidying `Genres`

Each movie contains information on one or more (up to 8) genres in a
single column. To organize the information, the multiple genres
contained in `genre` are split by "\]," so that each genre is contained
in eight new columns, genre1-genre8. Then, I removed unnecessary symbols
such as ,.

```{r}
# Separating the `genre` by "},"
movie <- movie %>% 
  separate(genres, c("genre1", "genre2", "genre3", "genre4", "genre5", "genre6", "genre7", "genre8"), "\\},", remove = FALSE)

# Remove unnecessary symbols from each columns

movie$genre1 <- remove_symbols(movie$genre1)
movie$genre2 <- remove_symbols(movie$genre2)
movie$genre3 <- remove_symbols(movie$genre3)
movie$genre4 <- remove_symbols(movie$genre4)
movie$genre5 <- remove_symbols(movie$genre5)
movie$genre6 <- remove_symbols(movie$genre6)
movie$genre7 <- remove_symbols(movie$genre7)
movie$genre8 <- remove_symbols(movie$genre8)

movie %>% 
  select(matches("[1-9]"))

```

I managed to separate the genre information above. Now all values in
genre1-genre8 has a value in the format of "number name of genre"
(example: 35Comedy)

However, there are 2,445 observations whose genre value doesn't fit the
above format. Most of them don't have any genre assigned so their
`genre1` values are empty or blank, which means they don't have any
genre assigned to them.

```{r}
movie %>% filter(!str_detect(genre1, "^[0-9]")) %>% select(starts_with("genre"))
```

However, looking at them closely, there are three observations whose
`genre` value that is NOT blank but have strange strings.

The values in these observations such as "Carousel Production" and
"Aniplex" don't look like a name of genres but that of production
studios. Also their `original_title` values don't look like a title of
movies but information of languages. It seems like these observations
were not correctly read in because their values in other columns look
weird.

Since these observations are not reliable, I decided to delete them from
this dataset.

```{r}
movie %>% filter(!str_detect(genre1, "^[0-9]") &!str_detect(genre1, "^[ \t\n]*$")) %>% #^[ \t\n]*$ is a regular expression for blank.
  select(c(original_title, title, production_countries, genre1:genre8))

movie_clean1 <- movie %>% filter(!str_detect(genre1, "^[A-Z]"))
```

Genre information is more organized but is still not easy to be
analyzed. Thus, I decided to make dummy variables of each genre.

For example, if movie A is categorized as "comedy" and "adventure", the
line of movie A should have 1 in "comedy" column and "adventure"
respectively and 0 in columns of other genres.

```{r}
#https://community.rstudio.com/t/creating-dummy-columns-based-on-multiple-columns/58145/3
movie_clean1 <- movie_clean1 %>% 
  pivot_longer(cols = matches("genre[1-9]")) %>%
  add_column(count = 1) %>%
  arrange(value) %>%
  filter(str_detect(value, "^[0-9]")) %>%
  mutate(value1 = str_replace_all(value, "[0-9]+", "")) %>%
  select(-c(value, name)) %>%
  arrange(title) %>%
  pivot_wider(
    names_from = value1, 
    values_from = count, 
    values_fill = list(count=0),
    values_fn = list(count = mean)) %>%
  arrange(title)

# I also deleted genre "TVMovie" and "Foreign" because they're more format or origin country information rather than genre.
movie_clean1 <- movie_clean1 %>% select(-c("TVMovie","Foreign"))

```

#### (2). Tyding `Production_countries`

The values in `production_countries` are written in the following way.
\>\[{'iso_3166_1': 'abbreviation of country','name': 'full country
name'}\]

When there are more than one country in the value, I took the first
country in account for this analysis.

```{r}
movie_clean1$production_countries <- remove_symbols(movie_clean1$production_countries)
movie_clean1$production_countries <- str_extract(movie_clean1$production_countries,"(?<=_1)\\w{2}")

movie_clean1 %>% select(production_countries)
```

#### (3). Tyding `production_companies` & `spoken_languages`

The values in both of `production_companies` and `spoken_languages` are
in the following format.

> \> \[{'name': 'XXXX', 'id': ---}\] \> \[{'iso_639_1':'XXXX',
> 'name':"----}\]

I only need the information that is written as XXXX in this format.

```{r}
movie_clean1$production_companies <-str_remove_all(movie_clean1$production_companies, "'id': [0-9]*|'name':|\\[|\\]|'|\\{|,") %>%
  str_replace_all("\\},", ",") %>%
  str_replace_all(" \\} ", ",")%>%
  str_remove(" \\}")

movie_clean1$spoken_languages <-str_remove_all(movie_clean1$spoken_languages, "'iso_639_1':|'name': '\\w+'|\\[|\\]|'|\\{|,") %>%
  str_replace_all(" \\} ", ",") %>%
  str_remove_all(" \\}")

movie_clean1 %>% select(production_companies, spoken_languages)


```

### Adding a new column

A new column `years` was created to group years by decade.

```{r}
movie_clean1 <- movie_clean1 %>% 
  mutate(
    years = case_when(
    lubridate::year(release_date) < 1920 ~ "Before 1920",
    lubridate::year(release_date) >= 1920 & lubridate::year(release_date) < 1930 ~ "1920s",
    lubridate::year(release_date) >= 1930 & lubridate::year(release_date) < 1940 ~ "1930s",
    lubridate::year(release_date) >= 1940 & lubridate::year(release_date) < 1950 ~ "1940s",
    lubridate::year(release_date) >= 1950 & lubridate::year(release_date) < 1960 ~ "1950s",
    lubridate::year(release_date) >= 1960 & lubridate::year(release_date) < 1970 ~ "1960s",
    lubridate::year(release_date) >= 1970 & lubridate::year(release_date) < 1980 ~ "1970s",
    lubridate::year(release_date) >= 1980 & lubridate::year(release_date) < 1990 ~ "1990s",
    lubridate::year(release_date) >= 1990 & lubridate::year(release_date) < 2000 ~ "1990s",
    lubridate::year(release_date) >= 2000 & lubridate::year(release_date) < 2010 ~ "2000s",
    TRUE ~ "2010s"
  )
  )
```

Now, the dataset is cleaner. Look at the summary again.

```{r}
print(summarytools::dfSummary(movie_clean1),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')

```

### (2) Bechdel data

#### Removing and Mutating Columns

`imdbid` and `titleId` are duplicated because titleId is "tt+imdbid".
`movie_clean1` dataframe has IDs in the same format (starting with tt)
as `titleId` of `bechdel` dataframe so I deleted `imdbid` column.

Also, `year` and `title` information is available in movie dataframe,
thus I deleted them.

I made a new column `bechdel_pass` where 1 means the movie passes
bechdel test and 0 means otherwise.

```{r}
#

bechdel <- bechdel %>% select(-c("imdbid","...1","title"))
colnames(bechdel) <- c("year", "bechdel_rating", "id", "titleid")

summary(bechdel)
# Changing the column order
col_order <- c("id", "titleid","year", "bechdel_rating")
bechdel <- bechdel[, col_order]

# Mutate a new column bechdel_pass
bechdel <- bechdel %>% mutate(
  bechdel_pass = case_when(
    bechdel_rating == 3 ~ 1,
    TRUE ~ 0
  )
)

# Change data types of columns
bechdel$bechdel_rating <- factor(bechdel$bechdel_rating, levels = c("0", "1", "2", "3"))
bechdel$bechdel_pass <- factor(bechdel$bechdel_pass, levels = c("0", "1"))
```

### Sense Check

This data look quite clean, however, I looked at the summary for sense
check, then I realized that the minimum number of year is 1010, which is
weird. "Inazuma Eleven The Movie" was released in 2010, but it seems to
have been mistakenly recorded as 1010. So I manually corrected the data.

```{r}
summary(bechdel)

bechdel %>% 
  filter(year < 1800)

bechdel$year[bechdel$year == 1010] <- 2010
```

## Join datasets

We have two dataframes `movie` and `bechdel` , which have imdb id as a
foreign key.

```{r}
movie_bechdel_join <- inner_join(movie_clean1, bechdel, 
                                 by=c("imdb_id" = "titleid"),
                                 copy = TRUE)

glimpse(movie_bechdel_join)
```

## Read in data / Describe data (OMDb) {#read-in-data-describe-data-omdb}

As we have a joined data, I read in the movie popularity data from OMDb
using OMDb API.

`omdb_df` has 5 columns

-   imdbID: can be used as a foreign key when joining with other
    dataframes

-   Director: Director's name

-   Metascore: Review scores from 0 to 100 on the metacritic website
    (<https://www.metacritic.com/movie>). They are weighted averages.
    and reviews by certain reviewers are given more weight, however the
    detail is not revealed.

-   imdbRating: Review scores from 0 to 10 on IMDb website
    (<https://www.imdb.com/>). They are weighted averages.

-   imdbVotes: Number of votes on IMDb website.

```{r}
# Generating OMDb dataframe
# imdb_id_list <- c()
# 
# for (i in c(1:length(movie_bechdel_join$imdb_id))) {
#   imdb_id_list[[i]] <- movie_bechdel_join$imdb_id[i]
# }
# 
# omdb_list <- lapply(imdb_id_list, function(movie_id) {
#   actor_vector <- find_by_id(movie_id)
#   actor_vector
# })
# 
# omdb_df <- tibble(place = omdb_list) %>%
#   unnest_wider(place) %>%
#   select(c("imdbID", "Director", "Metascore", "imdbRating", "imdbVotes", "BoxOffice")) %>% unnest() %>% distinct()

#write.csv(omdb_df, "_data/omdb.csv")
omdb_df <- read_csv("_data/omdb.csv")
omdb_df$Metascore <- as.numeric(omdb_df$Metascore)
omdb_df$BoxOffice <- omdb_df$BoxOffice %>%
  str_replace_all(",", "") %>% 
  str_extract("[0-9]*(?=$)") %>% 
  as.numeric()

omdb_df <- omdb_df %>% select(-...1)
glimpse(omdb_df)
```

Now I have the data from OMDb as `omdb_df`. Let's join
`movie_bechdel_join` and `omdb_df.`

```{r}
data <- inner_join(movie_bechdel_join, omdb_df, 
                                 by=c("imdb_id" = "imdbID"),
                                 copy = TRUE)

```

## Data Analysis and Visualization

### Is female representation in movie stories improving over time?

To find out the answer for this questions, I looked at the result of the
Bechdel test.

The number of movies that pass the Bechdel test increases over time,
especially after around 1980 as the total number of released movies
increased.

```{r}
movie_bechdel_join %>%
  filter(lubridate::year(release_date) > 1920) %>%
  group_by(year = lubridate::year(release_date), bechdel_rating) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=year, y= n_total, fill = bechdel_rating)) + geom_area(stat = "identity") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No 2 female characters", "2 female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  labs(x= "year", y = "Total number of movies", title = "The number of movies by Bechdel test ratings")+
  scale_x_continuous(n.breaks=14) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))
  

```

**However, the total number of released movies is increasing as well. Is
the percentage of films that pass the Bechdel test increasing?**

In 1920's, which is 100 years ago, less than 20% of films passed Bechdel
test. There was a big increase in percentage of Bechdel test passing
movies in 1930's. However, the percentage of films passing Bechdel test
was stagnant a bit lower than 50% from 1930's to 1950's. Somehow, from
1950 to 1970, the representation of women in film went backwards, with
less than 45% of films meeting the requirements of the Bechdel test;
from the 1970s to the present, the percentage of films that pass the
Bechdel test has continued to increase, reaching approximately 70% now.
Although the representation of women seems to have improved since 1970,
it has not improved dramatically compared to the 1930s and 1940s.

```{r}
data %>%
  filter(lubridate::year(release_date) > 1920) %>%
  group_by(years, bechdel_rating) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=years, y= n_total, fill = bechdel_rating)) + geom_bar(stat = "identity", position="fill") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two female characters", "1.Two female characters", "2.That talk each other", "3.Other than a man (Passes Bechdel Test)")) +
  labs(x="year", y="percentage", title = "% of movies that pass/don't pass Bechdel Test") +
  scale_y_continuous(n.breaks=10) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(ncol = 2))
```

If we look at the number of each year, we can see that some years in
1930's and 1940's did as good as in 2010s. For example, 1943 - 1944 had
a quite high percentage of Bechdel Test passing rate.

```{r}
data %>%
  filter(lubridate::year(release_date) > 1920) %>%
  group_by(year = lubridate::year(release_date), bechdel_rating) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=year, y= n_total, fill = bechdel_rating)) + geom_area(stat = "identity", position = "fill") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No 2 female characters", "2 female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  labs(x= "year", y = "Total number of movies", title = "The number of movies by Bechdel test ratings")+
  scale_x_continuous(n.breaks=14) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))
```

Perhaps the percentage of films that pass the Bechdel test has not
changed too dramatically compared to 70 years ago. However, **a
breakdown of the Bechdel test criteria for films that do not pass the
Bechdel Test shows a different trend compared to 70 years ago**.

Currently, most of not-bechdel-passing films fail to Bechdel test
because there are more than one female character but they don't talk
each other. There are few movies (still corresponding nearly 10% of
total movies released every year) that don't have two female characters.

It has not been always the case. In 1920s, quite many movies didn't even
have two featured female characters.

```{r}
data %>%
  filter(lubridate::year(release_date) > 1920 & bechdel_pass=="0") %>%
  group_by(years, bechdel_rating) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=years, y= n_total, fill = bechdel_rating)) + geom_bar(stat = "identity", position="fill") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two female characters", "1.Two female characters", "2.That talk each other about men", "3.Other than a man (Passes Bechdel Test)")) +
  labs(x="year", y="percentage", title = "% of movies that do NOT pass Bechdel test by Bechdel test criteria") +
  scale_y_continuous(n.breaks=10) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))
```

### **What category represents women better?**

I focused on 6514 movies released between 1970 and 2022 because there is
a small number of movies released before 1970 (25 or less movies a year)
and it's better to focus on the priod where more movies were released to
observe the trend.

The degree to which women are portrayed in a film's story varies widely
from genre to genre.

For example, many movies in Comedy, Animation, Drama, Family, Fantasy,
Romance, and Music pass the Bechdel test, while Action, Adventure,
Crime, History, War, and Western The percentage of films that pass the
Bechdel test is about 50% or less.

```{r}
viz_by_genre <- data %>%
  pivot_longer(cols = c(War:Music),names_to = "genre_name", values_drop_na = TRUE) %>%
  filter(value == 1) 

viz_by_genre %>%
  filter(lubridate::year(release_date) > 1970) %>%
  group_by(year = lubridate::year(release_date), bechdel_rating, genre_name) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=year, y= n_total, fill = bechdel_rating)) + 
  geom_bar(stat = "identity", position = "fill") +
  facet_wrap(~genre_name, ncol = 3)+
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No two female characters", "Two female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  labs(x = "year", y = "proportion", title = "Proportion of movies that pass / don't pass Bechdel test by categories")+
  scale_x_continuous(n.breaks=7) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))
  

```

```{r}
viz_by_genre %>%
  filter(lubridate::year(release_date)>1970) %>%
  group_by(bechdel_rating, genre_name) %>%
  dplyr::summarize(number = n()) %>%
  ggplot(aes(x=genre_name, y=number, fill=as.character(bechdel_rating))) +
  geom_bar(stat="identity", position = "fill") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No two female characters", "Two female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  coord_flip() +
  labs(x="Genre", y="%")

```

### **Why do movies do NOT pass Bechdel test?**

```{r}
viz_by_genre %>%
  filter(lubridate::year(release_date)>1970 & bechdel_pass == "0") %>%
  group_by(bechdel_rating, genre_name) %>%
  dplyr::summarize(number = n()) %>%
  ggplot(aes(x=genre_name, y=number, fill=as.character(bechdel_rating))) +
  geom_bar(stat="identity", position = "fill") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No two female characters", "Two female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  coord_flip()+
  labs(x="Genre", y="%")
```

### **Do movies with good female representation succeed in gaining more popularity on online review site?**

I compared **Metascore** and **IMDb** ratings of the films that pass and
don't pass Bechdel test.

The movies that don't pass Bechdel test seem to score higher on
**Metascore**, however the difference is quite small, and there seems to
be no relationship between whether a film passes the bechdel test and
its reputation (Metascore).

```{r}
data %>%
  filter(lubridate::year(release_date) > 1970) %>%
  ggplot(aes(x=bechdel_pass, y=Metascore)) + 
  geom_boxplot() +
  scale_x_discrete(labels=c("0: Don't pass Bechel", "1: Pass Bechdel")) +
  labs(x="Bechdel Test")
```

T-test shows that there is no statistically significant difference in
Metascores of the movies that pass and that don't pass Bechdel test.

```{r}
ttest_pass <- data %>% 
  filter(bechdel_pass == "1" & lubridate::year(release_date) > 1970) %>%
  select(c("bechdel_pass", "imdbRating", "imdbVotes", "Metascore", "BoxOffice"))

ttest_not_pass <- data %>%
  filter(bechdel_pass == "0" & lubridate::year(release_date) > 1970) %>%
  select(c("bechdel_pass", "imdbRating", "imdbVotes", "Metascore", "BoxOffice"))

t.test(ttest_pass$Metascore, ttest_not_pass$Metascore)
```

Then, I compared the IMDb scores of the movies that pass and don't pass
Bechdel test. Those don't pass Bechdel test seem to score a bit higher.

```{r}
data %>%
  filter(lubridate::year(release_date) > 1970) %>%
  ggplot(aes(x=bechdel_pass, y=imdbRating)) + 
  geom_boxplot() +
  scale_x_discrete(labels=c("0: Don't pass Bechel", "1: Pass Bechdel")) +
  labs(x="Bechdel Test", y="IMDb Rating")
```

T-test shows that the movies that do not meet Bechdel test criteria
score statistically higher than those that do.

```{r}
t.test(ttest_pass$imdbRating, ttest_not_pass$imdbRating)
```

### **Do movies with good female representation succeed in making more money?**

```{r}
data %>%
  filter(lubridate::year(release_date) > 1970) %>%
  ggplot(aes(x=as.character(bechdel_pass), y=BoxOffice)) + geom_boxplot() 

```

```{r}
t.test(ttest_pass$BoxOffice, ttest_not_pass$BoxOffice)
```

## Conclusion & Discussion

**% of the movies that pass Bechdel test:**

Compared to 100 years ago, the rate of movies that pass Bechdel test
significantly increased, however,

## Contact

The dataset that I used for this analysis can be found on my GitHub
(<https://github.com/Enagai-nagai/601_Fall_2022/tree/template/posts/_data>).

I'm interested in analyzing and visualizing data related to
entertainment industry, behavior on digital platforms, and gender
issues,

If you have any comments or questions, please contact me through Email.

-   enagai ★ umass.edu (please replace ★ with \@)

-   nagainagai.e ★ gmail.com (please replace ★ with \@)

## References

*Study: Women made up 34% of speaking roles in 2021's top films,
majority of those characters were white*. Women and Hollywood. (2022,
March 15). Retrieved December 13, 2022, from
https://womenandhollywood.com/study-women-made-up-34-of-speaking-roles-in-2021s-top-films-majority-of-those-characters-were-white/#:\~:text=In%202021's%20top%20films%2C%20females,and%2037%20percent%20in%202019.