Final Project Erika Nagai

movie

gender

bechdel test

female representation

Erika Nagai

Female representation in movie stories

Author

Erika Nagai

Published

December 23, 2022

Introduction

Even though half of the moviegoers in the United States are women, their story is not represented as much as that of men. Only 35% of the main characters in the best movies of 2021 were female, and male characters outnumbered female characters in 85% of the films. (Women and Hollywood, 2022) Also lacking are women behind the scenes in the film business. In 2021, just 12% of the top 100 films are directed by women. (Lauzen, n.d.)

In this project, I focused on analyzing female representation in movie stories rather than in their production. I used the data from three major movie databases, The Movie Database (TMDb), MovieLens, Open Movie Database (OMDb), and the Bechdel Test, one of the most common criteria to measure female presentation.

I would like to answer the following questions in this analysis:

Is female representation in movie stories improving over time?
What genres are doing better than others in terms of female representation?
Are movies with good female representation more popular?
Do movies with good female representation more money?

This analysis intends to document the data cleaning process, which is quite long. If you are curious only about the analysis results, please skip to Data Analysis and Visualization part.

Code

# install libraries

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2

Warning: package 'ggplot2' was built under R version 4.2.2

Warning: package 'tibble' was built under R version 4.2.2

Warning: package 'stringr' was built under R version 4.2.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

library(ggplot2)
library(stringr)
library(tidyr)
library(dplyr)
library(summarytools)

Warning: package 'summarytools' was built under R version 4.2.2


Attaching package: 'summarytools'

The following object is masked from 'package:tibble':

    view

Code

library(lubridate)

Warning: package 'lubridate' was built under R version 4.2.2

Loading required package: timechange

Warning: package 'timechange' was built under R version 4.2.2


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

Code

library(ggridges)

Read in data

For this analysis, I used the following different data sets.

(1) “The Movie Data set” from Kaggle :

This dataset was obtained from Kaggle “The Movie Data set”(https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download&select=movies_metadata.csv).

Code

#(1) movies_metadata.csv obtained from Kaggle
movie = read_csv("_data/movies_metadata.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 45466 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (14): belongs_to_collection, genres, homepage, imdb_id, original_langua...
dbl   (7): budget, id, popularity, revenue, runtime, vote_average, vote_count
lgl   (2): adult, video
date  (1): release_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

(2) Bechdel test API:

I generated the data by using the bechdel test API https://bechdeltest.com/api/v1/doc.

Code

#(2) bechdel test obtained by using bechdel API

library(rjson)
library(jsonlite)

Warning: package 'jsonlite' was built under R version 4.2.2


Attaching package: 'jsonlite'

The following objects are masked from 'package:rjson':

    fromJSON, toJSON

The following object is masked from 'package:purrr':

    flatten

Code

#json_file <- "http://bechdeltest.com/api/v1/getAllMovies"
#bechdel <- read_json(path = json_file, simplifyVector = TRUE)
#bechdel$titleId <- paste("tt",bechdel$imdbid, sep = "")

#write.csv(bechdel, file = "_data/bechdel.csv")
bechdel <- read_csv("_data/bechdel.csv")

New names:
• `` -> `...1`

Rows: 9802 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): imdbid, title, titleId
dbl (4): ...1, year, rating, id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

(3) Open Movie Database (OMDb):

This dataset was generated by using Open Movie Database (OMDb) API https://www.omdbapi.com/. OMDb provides detailed information such as director, writer, nominated award, the number/score of reviews on IMDb, etc… if you provide the name or IMDb id of movies. I decided to use this database because it gives me data related to movies’ popularity and financial success of the films, which “The Movie Data set” doesn’t include.

OMDb doesn’t give you a list of all movies registered on it. Instead, you need to provide the exact movie title or the IMDb id to get the list of information. Thus, I will use this API once I have the data that join movies_metadata.csv and the Bechdel test. (Please refer to “Read in data / Describe data (OMDb)” for this process.)

Describe data

(1) “The Movie Data set”

This data was originally created from The Movie Database (https://www.themoviedb.org/) and MovieLens (https://movielens.org/).This movie dataset was generated by Movielens, a (non-profit) movie review website (https://movielens.org/), and was obtained from the following Kaggle link. (https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download&select=movies_metadata.csv)

The movie dataset contains 45,466 movies with release date between December 9th, 1874 and December 16th, 2020.

The data includes the general information of movies, such as genres, revenue, run time, languages, status (released/in production etc…).

Code

print(summarytools::dfSummary(movie),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

movie

Dimensions: 45466 x 24
Duplicates: 17

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

adult [logical]

1. FALSE

2. TRUE

45454	(	100.0%	)
9	(	0.0%	)

3 (0.0%)

belongs_to_collection [character]

1. {'id': 415931, 'name': 'T

2. {'id': 421566, 'name': 'T

3. {'id': 645, 'name': 'Jame

4. {'id': 96887, 'name': 'Za

5. {'id': 37261, 'name': 'Th

6. {'id': 34055, 'name': 'Po

7. {'id': 413661, 'name': 'C

8. {'id': 374509, 'name': 'G

9. {'id': 148324, 'name': 'U

10. {'id': 38451, 'name': 'Ch

[ 1688 others ]

29	(	0.6%	)
27	(	0.6%	)
26	(	0.6%	)
26	(	0.6%	)
25	(	0.6%	)
22	(	0.5%	)
21	(	0.5%	)
16	(	0.4%	)
15	(	0.3%	)
15	(	0.3%	)
4272	(	95.1%	)

40972 (90.1%)

budget [numeric]

Mean (sd) : 4224579 (17424133)

min ≤ med ≤ max:

0 ≤ 0 ≤ 3.8e+08

IQR (CV) : 0 (4.1)

1223 distinct values

3 (0.0%)

genres [character]

1. [{'id': 18, 'name': 'Dram

2. [{'id': 35, 'name': 'Come

3. [{'id': 99, 'name': 'Docu

4. []

5. [{'id': 18, 'name': 'Dram

6. [{'id': 35, 'name': 'Come

7. [{'id': 27, 'name': 'Horr

8. [{'id': 35, 'name': 'Come

9. [{'id': 35, 'name': 'Come

10. [{'id': 18, 'name': 'Dram

[ 4059 others ]

5000	(	11.0%	)
3621	(	8.0%	)
2723	(	6.0%	)
2442	(	5.4%	)
1301	(	2.9%	)
1135	(	2.5%	)
974	(	2.1%	)
930	(	2.0%	)
593	(	1.3%	)
532	(	1.2%	)
26215	(	57.7%	)

0 (0.0%)

homepage [character]

1. http://www.georgecarlin.c

2. http://www.wernerherzog.c

3. http://breakblade.jp/

4. http://phantasm.com

5. http://www.crownintlpictu

6. http://www.crownintlpictu

7. http://www.crownintlpictu

8. http://www.kungfupanda.co

9. http://www.missionimpossi

10. http://www.thehungergames

[ 7663 others ]

12	(	0.2%	)
7	(	0.1%	)
6	(	0.1%	)
4	(	0.1%	)
4	(	0.1%	)
4	(	0.1%	)
4	(	0.1%	)
4	(	0.1%	)
4	(	0.1%	)
4	(	0.1%	)
7729	(	99.3%	)

37684 (82.9%)

id [numeric]

Mean (sd) : 108359.9 (112460.7)

min ≤ med ≤ max:

2 ≤ 60003 ≤ 469172

IQR (CV) : 130878.5 (1)

45433 distinct values

3 (0.0%)

imdb_id [character]

1. 0

2. tt1180333

3. tt0022537

4. tt0022879

5. tt0046468

6. tt0062229

7. tt0067306

8. tt0080000

9. tt0082992

10. tt0084387

[ 45407 others ]

3	(	0.0%	)
3	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
45427	(	100.0%	)

17 (0.0%)

original_language [character]

1. en

2. fr

3. it

4. ja

5. de

6. es

7. ru

8. hi

9. ko

10. zh

[ 82 others ]

32269	(	71.0%	)
2438	(	5.4%	)
1529	(	3.4%	)
1350	(	3.0%	)
1080	(	2.4%	)
994	(	2.2%	)
826	(	1.8%	)
508	(	1.1%	)
444	(	1.0%	)
409	(	0.9%	)
3608	(	7.9%	)

11 (0.0%)

original_title [character]

1. Alice in Wonderland

2. Hamlet

3. A Christmas Carol

4. Cinderella

5. Les Misérables

6. Macbeth

7. The Three Musketeers

8. Blackout

9. Frankenstein

10. Heidi

[ 43363 others ]

8	(	0.0%	)
8	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
6	(	0.0%	)
6	(	0.0%	)
6	(	0.0%	)
45397	(	99.8%	)

0 (0.0%)

overview [character]

1. No overview found.

2. No Overview

3. A few funny little novels

4. Adaptation of the Jane Au

5. King Lear, old and tired,

6. No movie overview availab

7. Recovering from a nail gu

8. Released

9. A film by Jem Cohen

10. A former aristocrat Ippol

[ 44296 others ]

133	(	0.3%	)
7	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
44345	(	99.6%	)

959 (2.1%)

popularity [numeric]

Mean (sd) : 2.9 (6)

min ≤ med ≤ max:

0 ≤ 1.1 ≤ 547.5

IQR (CV) : 3.3 (2.1)

43757 distinct values

6 (0.0%)

poster_path [character]

1. /5D7UBSEgdyONE6Lql6xS7s6O

2. /2kslZXOaW0HmnGuVPCnQlCdX

3. /qW1oQlOHizRHXZQrpkimYr0o

4. /8VSZ9coCzxOCW2wE2Qene1H1

5. /cdwVC18URfEdQjjxqJyRMoGD

6. /2ngnUQX9Abkesfq72uvBF3uj

7. /2trk2fape4S0CYnRg38kNVe8

8. /4J6Ai4C5YRgfRUTlirrJ7Qsm

9. /5GasjPRAy5rlEyDOH7MeOyxy

10. /5ILjS6XB5deiHop8SXPsYxXW

[ 45014 others ]

5	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
45051	(	99.9%	)

386 (0.8%)

production_companies [character]

1. []

2. [{'name': 'Metro-Goldwyn-

3. [{'name': 'Warner Bros.',

4. [{'name': 'Paramount Pict

5. [{'name': 'Twentieth Cent

6. [{'name': 'Universal Pict

7. [{'name': 'RKO Radio Pict

8. [{'name': 'Columbia Pictu

9. [{'name': 'Columbia Pictu

10. [{'name': 'Mosfilm', 'id'

[ 22698 others ]

11875	(	26.1%	)
742	(	1.6%	)
540	(	1.2%	)
505	(	1.1%	)
439	(	1.0%	)
320	(	0.7%	)
247	(	0.5%	)
207	(	0.5%	)
146	(	0.3%	)
145	(	0.3%	)
30297	(	66.6%	)

3 (0.0%)

production_countries [character]

1. [{'iso_3166_1': 'US', 'na

2. []

3. [{'iso_3166_1': 'GB', 'na

4. [{'iso_3166_1': 'FR', 'na

5. [{'iso_3166_1': 'JP', 'na

6. [{'iso_3166_1': 'IT', 'na

7. [{'iso_3166_1': 'CA', 'na

8. [{'iso_3166_1': 'DE', 'na

9. [{'iso_3166_1': 'IN', 'na

10. [{'iso_3166_1': 'RU', 'na

[ 2383 others ]

17851	(	39.3%	)
6282	(	13.8%	)
2238	(	4.9%	)
1654	(	3.6%	)
1356	(	3.0%	)
1030	(	2.3%	)
840	(	1.8%	)
749	(	1.6%	)
735	(	1.6%	)
735	(	1.6%	)
11993	(	26.4%	)

3 (0.0%)

release_date [Date]

min : 1874-12-09

med : 2001-08-30

max : 2020-12-16

range : 146y 0m 7d

17333 distinct values

90 (0.2%)

revenue [numeric]

Mean (sd) : 11209349 (64332247)

min ≤ med ≤ max:

0 ≤ 0 ≤ 2787965087

IQR (CV) : 0 (5.7)

6863 distinct values

6 (0.0%)

runtime [numeric]

Mean (sd) : 94.1 (38.4)

min ≤ med ≤ max:

0 ≤ 95 ≤ 1256

IQR (CV) : 22 (0.4)

353 distinct values

263 (0.6%)

spoken_languages [character]

1. [{'iso_639_1': 'en', 'nam

2. []

3. [{'iso_639_1': 'fr', 'nam

4. [{'iso_639_1': 'ja', 'nam

5. [{'iso_639_1': 'it', 'nam

6. [{'iso_639_1': 'es', 'nam

7. [{'iso_639_1': 'ru', 'nam

8. [{'iso_639_1': 'de', 'nam

9. [{'iso_639_1': 'en', 'nam

10. [{'iso_639_1': 'en', 'nam

[ 1921 others ]

22395	(	49.3%	)
3829	(	8.4%	)
1853	(	4.1%	)
1289	(	2.8%	)
1218	(	2.7%	)
902	(	2.0%	)
807	(	1.8%	)
762	(	1.7%	)
681	(	1.5%	)
572	(	1.3%	)
11152	(	24.5%	)

6 (0.0%)

status [character]

1. Canceled

2. In Production

3. Planned

4. Post Production

5. Released

6. Rumored

2	(	0.0%	)
20	(	0.0%	)
15	(	0.0%	)
98	(	0.2%	)
45014	(	99.2%	)
230	(	0.5%	)

87 (0.2%)

tagline [character]

1. Based on a true story.

2. -

3. Be careful what you wish

4. Trust no one.

5. A Love Story

6. Classic Albums

7. Documentary

8. Drama

9. How far would you go?

10. Know Your Enemy

[ 20272 others ]

7	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
20374	(	99.8%	)

25055 (55.1%)

title [character]

1. Cinderella

2. Alice in Wonderland

3. Hamlet

4. Beauty and the Beast

5. Les Misérables

6. A Christmas Carol

7. Blackout

8. The Three Musketeers

9. Treasure Island

10. Aftermath

[ 42267 others ]

11	(	0.0%	)
9	(	0.0%	)
9	(	0.0%	)
8	(	0.0%	)
8	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
6	(	0.0%	)
45381	(	99.8%	)

6 (0.0%)

video [logical]

1. FALSE

2. TRUE

45367	(	99.8%	)
93	(	0.2%	)

6 (0.0%)

vote_average [numeric]

Mean (sd) : 5.6 (1.9)

min ≤ med ≤ max:

0 ≤ 6 ≤ 10

IQR (CV) : 1.8 (0.3)

92 distinct values

6 (0.0%)

vote_count [numeric]

Mean (sd) : 109.9 (491.3)

min ≤ med ≤ max:

0 ≤ 10 ≤ 14075

IQR (CV) : 31 (4.5)

1820 distinct values

6 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

(2) Bechdel test

This data (bechdel) dataset documents the Bechdel test rating of 9630 movies released between 1874 and 2022. Each row represents a movie.

According to “Merriam-Webster”, the Bechdel test is “a set of criteria used as a test to evaluate a work of fiction (such as a film) based on its inclusion and representation of female characters” (https://www.merriam-webster.com/dictionary/Bechdel%20Test)

A movie passes the Bechdel test when it has at least two female featured characters that talk to each other about other than a man (men).

In bechdel, the Bechdel test rating is registered in the following manner.

0 ~ No two female characters

1 ~ Two female characters who don’t talk to each other

2 ~ Two female characters talk to each other about a man (men)

3 ~ Passes Bechdel test: Two female characters talk to each other about other than a man (men)

Apart from Bechdel ratings, this dataset (bechdel) contains the released year, title of movies, and id. id that starts with “tt” followed by 7 digits is the IMDb id and can work as a foreign key when joining this data with the movie dataset.

Code

print(summarytools::dfSummary(bechdel),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

bechdel

Dimensions: 9802 x 7
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

...1 [numeric]

Mean (sd) : 4901.5 (2829.7)

min ≤ med ≤ max:

1 ≤ 4901.5 ≤ 9802

IQR (CV) : 4900.5 (0.6)

9802 distinct values

0 (0.0%)

year [numeric]

Mean (sd) : 1996.2 (27)

min ≤ med ≤ max:

1010 ≤ 2006 ≤ 2022

IQR (CV) : 25 (0)

142 distinct values

0 (0.0%)

rating [numeric]

Mean (sd) : 2.1 (1.1)

min ≤ med ≤ max:

0 ≤ 3 ≤ 3

IQR (CV) : 2 (0.5)

0	:	1084	(	11.1%	)
1	:	2124	(	21.7%	)
2	:	1000	(	10.2%	)
3	:	5594	(	57.1%	)

0 (0.0%)

id [numeric]

Mean (sd) : 5224.9 (3052.9)

min ≤ med ≤ max:

1 ≤ 5211.5 ≤ 10641

IQR (CV) : 5233.5 (0.6)

9802 distinct values

0 (0.0%)

imdbid [character]

1. 0035279
2. 0086425
3. 0117056
4. 2043900
5. 2457282
6. 0000001
7. 0000002
8. 0000003
9. 0000004
10. 0000005
[ 9784 others ]

2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
9784	(	99.8%	)

3 (0.0%)

title [character]

1. Cinderella

2. Dracula

3. Little Women

4. Pride and Prejudice

5. Robin Hood

6. Shelter

7. A Star Is Born

8. Alice in Wonderland

9. Anna Karenina

10. Annie

[ 9547 others ]

5	(	0.1%	)
4	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
9765	(	99.6%	)

0 (0.0%)

titleId [character]

1. tt

2. tt0035279

3. tt0086425

4. tt0117056

5. tt2043900

6. tt2457282

7. tt0000001

8. tt0000002

9. tt0000003

10. tt0000004

[ 9785 others ]

3	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
9785	(	99.8%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

Tidy data

(1) TMDb data

Checking missing values

First of all, I took a look at missing values (NA values) in the data.

Code

movie %>% select(everything()) %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  t()

Warning: `funs()` was deprecated in dplyr 0.8.0.
ℹ Please use a list of either functions or lambdas:

# Simple named list: list(mean = mean, median = median)

# Auto named with `tibble::lst()`: tibble::lst(mean, median)

# Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

                       [,1]
adult                     3
belongs_to_collection 40972
budget                    3
genres                    0
homepage              37684
id                        3
imdb_id                  17
original_language        11
original_title            0
overview                959
popularity                6
poster_path             386
production_companies      3
production_countries      3
release_date             90
revenue                   6
runtime                 263
spoken_languages          6
status                   87
tagline               25055
title                     6
video                     6
vote_average              6
vote_count                6

Deleting unnecessary columns

I deleted the following columns, which seem irrelevant for this analysis or have too many missing values.
- adult: This information doesn’t add any significant meaning as almost all of the observations have FALSE value for this column

- belong_to_collection: This column has 40000+ missing values
- homepage: Unnecessary for this analysis

- overview: Unnecessary for this analysis

- poster_path: Unnecessary for this analysis

- tagline: Unnecessary for this analysis
- video: This information doesn’t add any significant meaning as almost all of observations have FALSE value for this column

- popularity: This information may be interesting, however, it is not clear how this popularity is measured or where it was generated. Therefore, I decided to use popularity data from OMDb.

- vote_average: Same as popularity

-vote_count: Same as popularity

Then, I changed the order of the columns.

Code

movie <- movie %>% select(-c("adult", "homepage", "overview", "poster_path", "tagline", "belongs_to_collection", "poster_path", "video", "popularity", "vote_average", "vote_count"))

col_order <- c("title", "original_title", "imdb_id", "id", "production_companies", "production_countries", "status", "release_date", "runtime", "revenue", "budget", "original_language", "spoken_languages", "genres")
movie <- movie[, col_order]

colnames(movie)[4] <- "movielens_id"

Cleaning data in JSON nested list format

The values in some certain columns such as “genres”, “production_companies”, “production_countries”,“spoken_languages” are in a JSON list format for example:

” [{‘id’: XXXX, ‘content(name/genre/title)’: XXX}, {‘id’: XXXX, ’content(name/genre/title): XXX}] ”

Code

movie %>% select(c("genres", "production_companies", "production_countries", "spoken_languages"))

# A tibble: 45,466 × 4
   genres                                                produ…¹ produ…² spoke…³
   <chr>                                                 <chr>   <chr>   <chr>  
 1 [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name':… [{'nam… [{'iso… [{'iso…
 2 [{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name':… [{'nam… [{'iso… [{'iso…
 3 [{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name'… [{'nam… [{'iso… [{'iso…
 4 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'D… [{'nam… [{'iso… [{'iso…
 5 [{'id': 35, 'name': 'Comedy'}]                        [{'nam… [{'iso… [{'iso…
 6 [{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'C… [{'nam… [{'iso… [{'iso…
 7 [{'id': 35, 'name': 'Comedy'}, {'id': 10749, 'name':… [{'nam… [{'iso… [{'iso…
 8 [{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'A… [{'nam… [{'iso… [{'iso…
 9 [{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'A… [{'nam… [{'iso… [{'iso…
10 [{'id': 12, 'name': 'Adventure'}, {'id': 28, 'name':… [{'nam… [{'iso… [{'iso…
# … with 45,456 more rows, and abbreviated variable names
#   ¹production_companies, ²production_countries, ³spoken_languages

Code

# These columns include [] {} and ' in their values so I made a function remove_simbols that removes these unnecessary symbols.
remove_symbols <- function(x) {
  removed_x <- str_remove_all(x, "\\{|\\}|\\[|\\}|\\]|'|,|id|name|:| ")
  return(removed_x)
}

(1). Tidying `Genres`

Each movie contains information on one or more (up to 8) genres in a single column. To organize the information, the multiple genres contained in genre are split by “],” so that each genre is contained in eight new columns, genre1-genre8. Then, I removed unnecessary symbols such as ,.

Code

# Separating the `genre` by "},"
movie <- movie %>% 
  separate(genres, c("genre1", "genre2", "genre3", "genre4", "genre5", "genre6", "genre7", "genre8"), "\\},", remove = FALSE)

Warning: Expected 8 pieces. Missing pieces filled with `NA` in 45463 rows [1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

Code

# Remove unnecessary symbols from each columns

movie$genre1 <- remove_symbols(movie$genre1)
movie$genre2 <- remove_symbols(movie$genre2)
movie$genre3 <- remove_symbols(movie$genre3)
movie$genre4 <- remove_symbols(movie$genre4)
movie$genre5 <- remove_symbols(movie$genre5)
movie$genre6 <- remove_symbols(movie$genre6)
movie$genre7 <- remove_symbols(movie$genre7)
movie$genre8 <- remove_symbols(movie$genre8)

movie %>% 
  select(matches("[1-9]"))

# A tibble: 45,466 × 8
   genre1       genre2       genre3       genre4     genre5 genre6 genre7 genre8
   <chr>        <chr>        <chr>        <chr>      <chr>  <chr>  <chr>  <chr> 
 1 16Animation  35Comedy     10751Family  <NA>       <NA>   <NA>   <NA>   <NA>  
 2 12Adventure  14Fantasy    10751Family  <NA>       <NA>   <NA>   <NA>   <NA>  
 3 10749Romance 35Comedy     <NA>         <NA>       <NA>   <NA>   <NA>   <NA>  
 4 35Comedy     18Drama      10749Romance <NA>       <NA>   <NA>   <NA>   <NA>  
 5 35Comedy     <NA>         <NA>         <NA>       <NA>   <NA>   <NA>   <NA>  
 6 28Action     80Crime      18Drama      53Thriller <NA>   <NA>   <NA>   <NA>  
 7 35Comedy     10749Romance <NA>         <NA>       <NA>   <NA>   <NA>   <NA>  
 8 28Action     12Adventure  18Drama      10751Fami… <NA>   <NA>   <NA>   <NA>  
 9 28Action     12Adventure  53Thriller   <NA>       <NA>   <NA>   <NA>   <NA>  
10 12Adventure  28Action     53Thriller   <NA>       <NA>   <NA>   <NA>   <NA>  
# … with 45,456 more rows

I managed to separate the genre information above. Now all values in genre 1 - 8 has a value in the format of “number + name of genre” (example: 35Comedy)

However, there are 2,445 observations whose genre value doesn’t fit the above format. Most of them don’t have any genre assigned so their genre1 values are empty or blank, which means they don’t have any genre assigned to them.

Code

movie %>% filter(!str_detect(genre1, "^[0-9]")) %>% select(starts_with("genre"))

# A tibble: 2,445 × 9
   genres genre1 genre2 genre3 genre4 genre5 genre6 genre7 genre8
   <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
 1 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 2 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 3 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 4 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 5 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 6 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 7 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 8 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 9 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
10 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
# … with 2,435 more rows

However, looking at them closely, there are three observations whose genre value that is NOT blank but have strange strings.

The values in these observations such as “Carousel Production” and “Aniplex” don’t look like a name of genres but that of production studios. Also their original_title values don’t look like a title of movies but information of languages. It seems like these observations were not correctly read in because their values in other columns look weird.

Since these observations are not reliable, I decided to delete them from this dataset.

Code

movie %>% filter(!str_detect(genre1, "^[0-9]") &!str_detect(genre1, "^[ \t\n]*$")) %>% #^[ \t\n]*$ is a regular expression for blank.
  select(c(original_title, production_countries, genre1:genre3))

# A tibble: 3 × 5
  original_title                           production_cou…¹ genre1 genre2 genre3
  <chr>                                    <chr>            <chr>  <chr>  <chr> 
1 [{'iso_639_1': 'en', 'name': 'English'}] 6.0              Carou… Visio… Teles…
2 [{'iso_639_1': 'ja', 'name': '日本語'}]  7.0              Anipl… GoHan… BROST…
3 [{'iso_639_1': 'en', 'name': 'English'}] 4.3              Odyss… Pulse… Rogue…
# … with abbreviated variable name ¹production_countries

Code

movie_clean1 <- movie %>% filter(!str_detect(genre1, "^[A-Z]"))

Genre information is more organized but is still not easy to be analyzed. Thus, I decided to make dummy variables of each genre.

For example, if movie A is categorized as “comedy” and “adventure”, the line of movie A should have 1 in the “comedy” column and the “adventure” respectively and 0 in columns of other genres.

Code

#https://community.rstudio.com/t/creating-dummy-columns-based-on-multiple-columns/58145/3
movie_clean1 <- movie_clean1 %>% 
  pivot_longer(cols = matches("genre[1-9]")) %>%
  add_column(count = 1) %>%
  arrange(value) %>%
  filter(str_detect(value, "^[0-9]")) %>%
  mutate(value1 = str_replace_all(value, "[0-9]+", "")) %>%
  select(-c(value, name)) %>%
  arrange(title) %>%
  pivot_wider(
    names_from = value1, 
    values_from = count, 
    values_fill = list(count=0),
    values_fn = list(count = mean)) %>%
  arrange(title)

# I also deleted genre "TVMovie" and "Foreign" because they're more format or origin country information rather than genre.
movie_clean1 <- movie_clean1 %>% select(-c("TVMovie","Foreign"))

(2). Tyding `Production_countries`

The values in production_countries are written in the following way.

[{‘iso_3166_1’: ‘abbreviation of country’,‘name’: ‘full country name’}]

When there is more than one country in the value, I took the first country in account for this analysis.

Code

movie_clean1$production_countries <- remove_symbols(movie_clean1$production_countries)
movie_clean1$production_countries <- str_extract(movie_clean1$production_countries,"(?<=_1)\\w{2}")

movie_clean1 %>% select(production_countries)

# A tibble: 42,991 × 1
   production_countries
   <chr>               
 1 GB                  
 2 IN                  
 3 US                  
 4 <NA>                
 5 US                  
 6 US                  
 7 FR                  
 8 JP                  
 9 <NA>                
10 US                  
# … with 42,981 more rows

(3). Tyding `production_companies` & `spoken_languages`

The values in both of production_companies and spoken_languages are in the following format.

[{‘name’: ‘XXXX’, ‘id’: —}] > [{‘iso_639_1’:‘XXXX’, ‘name’:“—-}]

I only need the information that is written as XXXX in this format.

Code

movie_clean1$production_companies <-str_remove_all(movie_clean1$production_companies, "'id': [0-9]*|'name':|\\[|\\]|'|\\{|,") %>%
  str_replace_all("\\},", ",") %>%
  str_replace_all(" \\} ", ",")%>%
  str_remove(" \\}")

movie_clean1$spoken_languages <-str_remove_all(movie_clean1$spoken_languages, "'iso_639_1':|'name': '\\w+'|\\[|\\]|'|\\{|,") %>%
  str_replace_all(" \\} ", ",") %>%
  str_remove_all(" \\}")

movie_clean1 %>% select(production_companies, spoken_languages)

# A tibble: 42,991 × 2
   production_companies                                                  spoke…¹
   <chr>                                                                 <chr>  
 1 " Screen Yorkshire, British Film Institute (BFI), Creative Scotland,… " en"  
 2 ""                                                                    " hi"  
 3 " Sebastian Films Limited"                                            " en"  
 4 " Lorimar Pictures"                                                   " en"  
 5 " Lone Star Production, Monogram Pictures"                            " en, …
 6 ""                                                                    " en"  
 7 " Barnholtz Entertainment"                                            " en, …
 8 ""                                                                    " zh"  
 9 ""                                                                    " fr"  
10 " Disney Channel"                                                     " en"  
# … with 42,981 more rows, and abbreviated variable name ¹spoken_languages

Adding a new column

A new column years was created to group years by decade.

Code

movie_clean1 <- movie_clean1 %>% 
  mutate(
    years = case_when(
    lubridate::year(release_date) < 1920 ~ "Before 1920",
    lubridate::year(release_date) >= 1920 & lubridate::year(release_date) < 1930 ~ "1920s",
    lubridate::year(release_date) >= 1930 & lubridate::year(release_date) < 1940 ~ "1930s",
    lubridate::year(release_date) >= 1940 & lubridate::year(release_date) < 1950 ~ "1940s",
    lubridate::year(release_date) >= 1950 & lubridate::year(release_date) < 1960 ~ "1950s",
    lubridate::year(release_date) >= 1960 & lubridate::year(release_date) < 1970 ~ "1960s",
    lubridate::year(release_date) >= 1970 & lubridate::year(release_date) < 1980 ~ "1970s",
    lubridate::year(release_date) >= 1980 & lubridate::year(release_date) < 1990 ~ "1990s",
    lubridate::year(release_date) >= 1990 & lubridate::year(release_date) < 2000 ~ "1990s",
    lubridate::year(release_date) >= 2000 & lubridate::year(release_date) < 2010 ~ "2000s",
    TRUE ~ "2010s"
  )
  )

Now, the dataset is cleaner. Look at the summary again.

Code

print(summarytools::dfSummary(movie_clean1),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

movie_clean1

Dimensions: 42991 x 33
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

title [character]

1. Cinderella

2. Alice in Wonderland

3. Hamlet

4. Les Misérables

5. A Christmas Carol

6. Beauty and the Beast

7. The Three Musketeers

8. Treasure Island

9. Aftermath

10. Countdown

[ 40025 others ]

10	(	0.0%	)
9	(	0.0%	)
9	(	0.0%	)
8	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
6	(	0.0%	)
6	(	0.0%	)
42912	(	99.8%	)

3 (0.0%)

original_title [character]

1. Alice in Wonderland

2. Hamlet

3. A Christmas Carol

4. Les Misérables

5. Macbeth

6. The Three Musketeers

7. Cinderella

8. Frankenstein

9. Heidi

10. The Hound of the Baskervi

[ 41043 others ]

8	(	0.0%	)
8	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
6	(	0.0%	)
6	(	0.0%	)
6	(	0.0%	)
6	(	0.0%	)
42923	(	99.8%	)

0 (0.0%)

imdb_id [character]

1. tt0000001

2. tt0000003

3. tt0000005

4. tt0000008

5. tt0000010

6. tt0000012

7. tt0000014

8. tt0000023

9. tt0000029

10. tt0000070

[ 42969 others ]

1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
42969	(	100.0%	)

12 (0.0%)

movielens_id [numeric]

Mean (sd) : 104553.2 (111978.6)

min ≤ med ≤ max:

2 ≤ 55734 ≤ 469172

IQR (CV) : 118823.5 (1.1)

42991 distinct values

0 (0.0%)

production_companies [character]

1. (Empty string)

Metro-Goldwyn-Mayer (MGM

Warner Bros.

Paramount Pictures

Twentieth Century Fox Fi

Universal Pictures

RKO Radio Pictures

Columbia Pictures Corpor

Columbia Pictures

10.

Mosfilm

[ 22474 others ]

9770	(	22.7%	)
737	(	1.7%	)
538	(	1.3%	)
500	(	1.2%	)
434	(	1.0%	)
320	(	0.7%	)
247	(	0.6%	)
207	(	0.5%	)
146	(	0.3%	)
144	(	0.3%	)
29945	(	69.7%	)

3 (0.0%)

production_countries [character]

1. US

2. GB

3. FR

4. CA

5. JP

6. DE

7. IT

8. RU

9. IN

10. ES

[ 132 others ]

18202	(	47.5%	)
3031	(	7.9%	)
2639	(	6.9%	)
1469	(	3.8%	)
1430	(	3.7%	)
1389	(	3.6%	)
1382	(	3.6%	)
777	(	2.0%	)
754	(	2.0%	)
576	(	1.5%	)
6653	(	17.4%	)

4689 (10.9%)

status [character]

1. In Production

2. Planned

3. Post Production

4. Released

5. Rumored

20	(	0.0%	)
15	(	0.0%	)
96	(	0.2%	)
42601	(	99.2%	)
205	(	0.5%	)

54 (0.1%)

release_date [Date]

min : 1874-12-09

med : 2001-10-29

max : 2020-12-16

range : 146y 0m 7d

16860 distinct values

29 (0.1%)

runtime [numeric]

Mean (sd) : 95.5 (36.6)

min ≤ med ≤ max:

0 ≤ 95 ≤ 1256

IQR (CV) : 21 (0.4)

343 distinct values

182 (0.4%)

revenue [numeric]

Mean (sd) : 11845510 (66095743)

min ≤ med ≤ max:

0 ≤ 0 ≤ 2787965087

IQR (CV) : 0 (5.6)

6844 distinct values

3 (0.0%)

budget [numeric]

Mean (sd) : 4461834 (17886326)

min ≤ med ≤ max:

0 ≤ 0 ≤ 3.8e+08

IQR (CV) : 0 (4)

1215 distinct values

0 (0.0%)

original_language [character]

1. en

2. fr

3. it

4. ja

5. de

6. es

7. ru

8. hi

9. ko

10. zh

[ 79 others ]

30566	(	71.1%	)
2348	(	5.5%	)
1359	(	3.2%	)
1287	(	3.0%	)
1029	(	2.4%	)
927	(	2.2%	)
786	(	1.8%	)
478	(	1.1%	)
431	(	1.0%	)
391	(	0.9%	)
3379	(	7.9%	)

10 (0.0%)

spoken_languages [character]

2. (Empty string)

en, fr

10.

en, es

[ 1896 others ]

21943	(	51.0%	)
2660	(	6.2%	)
1785	(	4.2%	)
1220	(	2.8%	)
1097	(	2.6%	)
826	(	1.9%	)
769	(	1.8%	)
722	(	1.7%	)
679	(	1.6%	)
568	(	1.3%	)
10719	(	24.9%	)

3 (0.0%)

genres [character]

1. [{'id': 18, 'name': 'Dram

2. [{'id': 35, 'name': 'Come

3. [{'id': 99, 'name': 'Docu

4. [{'id': 18, 'name': 'Dram

5. [{'id': 35, 'name': 'Come

6. [{'id': 27, 'name': 'Horr

7. [{'id': 35, 'name': 'Come

8. [{'id': 35, 'name': 'Come

9. [{'id': 18, 'name': 'Dram

10. [{'id': 27, 'name': 'Horr

[ 4055 others ]

4996	(	11.6%	)
3620	(	8.4%	)
2721	(	6.3%	)
1300	(	3.0%	)
1133	(	2.6%	)
974	(	2.3%	)
930	(	2.2%	)
593	(	1.4%	)
531	(	1.2%	)
528	(	1.2%	)
25665	(	59.7%	)

0 (0.0%)

War [numeric]

Min : 0

Mean : 0

Max : 1

0	:	41669	(	96.9%	)
1	:	1322	(	3.1%	)

0 (0.0%)

Drama [numeric]

Min : 0

Mean : 0.5

Max : 1

0	:	22747	(	52.9%	)
1	:	20244	(	47.1%	)

0 (0.0%)

Action [numeric]

Min : 0

Mean : 0.2

Max : 1

0	:	36399	(	84.7%	)
1	:	6592	(	15.3%	)

0 (0.0%)

Thriller [numeric]

Min : 0

Mean : 0.2

Max : 1

0	:	35372	(	82.3%	)
1	:	7619	(	17.7%	)

0 (0.0%)

Crime [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	38687	(	90.0%	)
1	:	4304	(	10.0%	)

0 (0.0%)

Western [numeric]

Min : 0

Mean : 0

Max : 1

0	:	41949	(	97.6%	)
1	:	1042	(	2.4%	)

0 (0.0%)

Documentary [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	39061	(	90.9%	)
1	:	3930	(	9.1%	)

0 (0.0%)

Romance [numeric]

Min : 0

Mean : 0.2

Max : 1

0	:	36261	(	84.3%	)
1	:	6730	(	15.7%	)

0 (0.0%)

Family [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	40224	(	93.6%	)
1	:	2767	(	6.4%	)

0 (0.0%)

Animation [numeric]

Min : 0

Mean : 0

Max : 1

0	:	41060	(	95.5%	)
1	:	1931	(	4.5%	)

0 (0.0%)

Comedy [numeric]

Min : 0

Mean : 0.3

Max : 1

0	:	29815	(	69.4%	)
1	:	13176	(	30.6%	)

0 (0.0%)

Horror [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	38320	(	89.1%	)
1	:	4671	(	10.9%	)

0 (0.0%)

Mystery [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	40527	(	94.3%	)
1	:	2464	(	5.7%	)

0 (0.0%)

Adventure [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	39501	(	91.9%	)
1	:	3490	(	8.1%	)

0 (0.0%)

History [numeric]

Min : 0

Mean : 0

Max : 1

0	:	41593	(	96.7%	)
1	:	1398	(	3.3%	)

0 (0.0%)

ScienceFiction [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	39947	(	92.9%	)
1	:	3044	(	7.1%	)

0 (0.0%)

Fantasy [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	40682	(	94.6%	)
1	:	2309	(	5.4%	)

0 (0.0%)

Music [numeric]

Min : 0

Mean : 0

Max : 1

0	:	41394	(	96.3%	)
1	:	1597	(	3.7%	)

0 (0.0%)

years [character]

1. 1920s

2. 1930s

3. 1940s

4. 1950s

5. 1960s

6. 1970s

7. 1990s

8. 2000s

9. 2010s

10. Before 1920

398	(	0.9%	)
1264	(	2.9%	)
1444	(	3.4%	)
1997	(	4.6%	)
2474	(	5.8%	)
3270	(	7.6%	)
8965	(	20.9%	)
10605	(	24.7%	)
12291	(	28.6%	)
283	(	0.7%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

(2) Bechdel data

Removing and Mutating Columns

imdbid and titleId are duplicated because titleId is “tt+imdbid”. movie_clean1 data frame has IDs in the same format (starting with tt) as titleId of bechdel data frame so I deleted imdbid column.

Also, year and title information is available in movie data frame, thus I deleted them.

I made a new column bechdel_pass where 1 means the movie passes bechdel test and 0 means otherwise.

Code

#

bechdel <- bechdel %>% select(-c("imdbid","...1","title"))
colnames(bechdel) <- c("year", "bechdel_rating", "id", "titleid")

summary(bechdel)

      year      bechdel_rating        id          titleid         
 Min.   :1010   Min.   :0.000   Min.   :    1   Length:9802       
 1st Qu.:1988   1st Qu.:1.000   1st Qu.: 2558   Class :character  
 Median :2006   Median :3.000   Median : 5212   Mode  :character  
 Mean   :1996   Mean   :2.133   Mean   : 5225                     
 3rd Qu.:2013   3rd Qu.:3.000   3rd Qu.: 7792                     
 Max.   :2022   Max.   :3.000   Max.   :10641

Code

# Changing the column order
col_order <- c("id", "titleid","year", "bechdel_rating")
bechdel <- bechdel[, col_order]

# Mutate a new column bechdel_pass
bechdel <- bechdel %>% mutate(
  bechdel_pass = case_when(
    bechdel_rating == 3 ~ 1,
    TRUE ~ 0
  )
)

# Change data types of columns
bechdel$bechdel_rating <- factor(bechdel$bechdel_rating, levels = c("0", "1", "2", "3"))
bechdel$bechdel_pass <- factor(bechdel$bechdel_pass, levels = c("0", "1"))

Sense Check

This data looks quite clean, however, I realized that the minimum number of year is 1010, which is weird. “Inazuma Eleven The Movie” was released in 2010, but it seems to have been mistakenly recorded as 1010. So I manually corrected the data.

Code

summary(bechdel)

       id          titleid               year      bechdel_rating bechdel_pass
 Min.   :    1   Length:9802        Min.   :1010   0:1084         0:4208      
 1st Qu.: 2558   Class :character   1st Qu.:1988   1:2124         1:5594      
 Median : 5212   Mode  :character   Median :2006   2:1000                     
 Mean   : 5225                      Mean   :1996   3:5594                     
 3rd Qu.: 7792                      3rd Qu.:2013                              
 Max.   :10641                      Max.   :2022

Code

bechdel %>% 
  filter(year < 1800)

# A tibble: 1 × 5
     id titleid    year bechdel_rating bechdel_pass
  <dbl> <chr>     <dbl> <fct>          <fct>       
1 10556 tt1794796  1010 3              1

Code

bechdel$year[bechdel$year == 1010] <- 2010

Join data sets

We have two data frames movie and bechdel , which have imdb id as a foreign key.

Code

movie_bechdel_join <- inner_join(movie_clean1, bechdel, 
                                 by=c("imdb_id" = "titleid"),
                                 copy = TRUE)

Read in data / Describe data (OMDb)

As we have a joined data, I read in the movie popularity data from OMDb using OMDb API.

omdb_df has 5 columns

imdbID: can be used as a foreign key when joining with other data frames
Director: Director’s name
Metascore: Review scores from 0 to 100 on the metacritic website (https://www.metacritic.com/movie). They are weighted averages. and reviews by certain reviewers are given more weight, however the detail is not revealed.
imdbRating: Review scores from 0 to 10 on IMDb website (https://www.imdb.com/). They are weighted averages.
imdbVotes: Number of votes on IMDb website.

Code

# Generating OMDb dataframe
# imdb_id_list <- c()
# 
# for (i in c(1:length(movie_bechdel_join$imdb_id))) {
#   imdb_id_list[[i]] <- movie_bechdel_join$imdb_id[i]
# }
# 
# omdb_list <- lapply(imdb_id_list, function(movie_id) {
#   actor_vector <- find_by_id(movie_id)
#   actor_vector
# })
# 
# omdb_df <- tibble(place = omdb_list) %>%
#   unnest_wider(place) %>%
#   select(c("imdbID", "Director", "Metascore", "imdbRating", "imdbVotes", "BoxOffice")) %>% unnest() %>% distinct()

#write.csv(omdb_df, "_data/omdb.csv")
omdb_df <- read_csv("_data/omdb.csv")

New names:
Rows: 7729 Columns: 7
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): imdbID, Director, Metascore, BoxOffice dbl (3): ...1, imdbRating,
imdbVotes
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

Code

omdb_df$Metascore <- as.numeric(omdb_df$Metascore)

Warning: NAs introduced by coercion

Code

omdb_df$BoxOffice <- omdb_df$BoxOffice %>%
  str_replace_all(",", "") %>% 
  str_extract("[0-9]*(?=$)") %>% 
  as.numeric()

omdb_df <- omdb_df %>% select(-...1)
glimpse(omdb_df)

Rows: 7,729
Columns: 6
$ imdbID     <chr> "tt2614684", "tt0074080", "tt0090556", "tt1022603", "tt1132…
$ Director   <chr> "Yann Demange", "Beverly Sebastian, Ferd Sebastian", "Tom M…
$ Metascore  <dbl> 83, NA, NA, 76, 58, NA, 71, 52, 45, 52, 60, 76, NA, NA, 70,…
$ imdbRating <dbl> 7.2, 5.4, 7.6, 7.7, 6.3, 6.3, 7.4, 6.5, 5.0, 6.5, 5.5, 7.2,…
$ imdbVotes  <dbl> 56942, 1278, 2371, 516486, 2230, 8603, 181857, 74038, 35911…
$ BoxOffice  <dbl> 1270847, NA, 441863, 32391374, 230600, NA, NA, 27766, 9600,…

Now I have the data from OMDb as omdb_df. Let’s join movie_bechdel_join and omdb_df.

Code

data <- inner_join(movie_bechdel_join, omdb_df, 
                                 by=c("imdb_id" = "imdbID"),
                                 copy = TRUE)

Data Analysis and Visualization

Is female representation in movie stories improving over time?

To find out the answer for this questions, I looked at the result of the Bechdel test.

The number of movies that pass the Bechdel test increases over time, especially after around 1980 as the total number of released movies increased.

Code

movie_bechdel_join %>%
  filter(lubridate::year(release_date) > 1920) %>%
  group_by(year = lubridate::year(release_date), bechdel_rating) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=year, y= n_total, fill = bechdel_rating)) + geom_area(stat = "identity") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two women", 
                             "1.Two women", 
                             "2.Two women talk to each other about men", 
                             "3.Two women talk to each other about other than men (Pass)")) +
  labs(x= "year", y = "Number", title = "The number of movies by Bechdel test ratings", fill = "Bechdel rating")+
  scale_x_continuous(n.breaks=14) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

However, the total number of released movies is increasing as well. Is the percentage of films that pass the Bechdel test increasing?

In the 1920s, which is 100 years ago, less than 20% of films passed the Bechdel test. There was a big increase in the percentage of Bechdel test-passing movies in the 1930s, however, the percentage of films passing the Bechdel test was stagnant at a bit lower than 50% from the 1930s to the 1950s. From 1950s to 1970s, the representation of women in the film somehow went backward, with less than 45% of films meeting the requirements of the Bechdel test; from the 1970s to the present, the percentage of films that pass the Bechdel test has continued to increase, reaching approximately 70% now. Although the representation of women seems to have improved since 1970, it has not improved dramatically compared to the 1930s and 1940s.

Code

data_with_p <- data %>%
  filter(lubridate::year(release_date) > 1920) %>%
  group_by(years, bechdel_rating) %>%
  dplyr::summarize(n = n()) %>%
  mutate(percentage = n/sum(n)*100)

`summarise()` has grouped output by 'years'. You can override using the
`.groups` argument.

Code

data_with_p %>%
  ggplot(aes(x=years, y= percentage, fill = bechdel_rating)) + 
  geom_col() +
  geom_text(aes(label=paste0(round(percentage), "%")), color="white", font="bold", position = position_stack(vjust = 0.5)) +
  
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two women", 
                             "1.Two women", 
                             "2.Two women talk to each other about men", 
                             "3.Two women talk to each other about other than men (Pass)")) +
  labs(x="year", y="percent", title = "% of movies that pass/don't pass Bechdel Test", fill="Bechdel Rating") +
  scale_y_continuous(n.breaks=10) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(ncol = 2))

Warning in geom_text(aes(label = paste0(round(percentage), "%")), color =
"white", : Ignoring unknown parameters: `font`

Perhaps the percentage of films that pass the Bechdel test has not changed too dramatically compared to 70 years ago. However, if we focus on the movies that do NOT pass the Bechdel test, we can see a different trend happening.

In the 1920s, the majority of films did not feature two significant women; this trend changed greatly in the 1930s, with the percentage of films that did not feature two women decreasing significantly, peaking in the 1960s to about 20%, and then declining.
The percentage of films in which the only conversation between female characters is about men also decreased from 1930 to the 2010s.
Interestingly, however, the percentage of films in which female characters do not speak to each other (dark gray in the figure) has changed very little from the 1930s to the 2010s.

Code

data_with_p %>%
  filter(bechdel_rating!="3") %>%
  ggplot(aes(x=years, y= percentage, fill = bechdel_rating)) + 
  geom_col()+
  geom_text(aes(label=paste0(round(percentage), "%")), color="white", font="bold", position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two women", 
                             "1.Two women", 
                             "2.Two women talk to each other about men")) +
  labs(x="year", y="percent", title = "% of movies that do NOT pass Bechdel test by Bechdel test criteria", fill = "Bechdel Rating", subtitle = "% doesn't add up to 100% because it is missing the % of the movies that pass Bechdel test") +
  scale_y_continuous(n.breaks=10) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))

Warning in geom_text(aes(label = paste0(round(percentage), "%")), color =
"white", : Ignoring unknown parameters: `font`

What category represents women better?

I focused on 4428 movies released between 2000 and 2020 because I want to do category-based analysis of a recent situation, and it’s better to focus on the period where more movies were released to observe the trend.

The degree to which women are portrayed in a film’s story varies widely from genre to genre.

The genre with the highest percentage of films passing the bechdel test is Romance (72%) and the lowest is Western (22%), showing a large difference.

Interestingly, not only the percentage of films that pass the Bechdel test, but also the percentage that meet each of the three criteria for passing the Bechdel test varies widely by genre.

For example, Western, War, and Documentary struggle more than other genres to feature two female characters. (To be precise, the Bechdel Test is for evaluating fictional stories, so it may not be appropriate to evaluate Documentary.)

Code

viz_by_genre <- data %>%
  pivot_longer(cols = c(War:Music),names_to = "genre_name", values_drop_na = TRUE) %>%
  filter(value == 1) 


viz_by_genre_with_p <- viz_by_genre %>%
  group_by(genre_name, years, bechdel_rating) %>%
  dplyr::summarize(n=n()) %>%
  mutate(proportion = n/sum(n),
         percentage=proportion*100)

`summarise()` has grouped output by 'genre_name', 'years'. You can override
using the `.groups` argument.

Code

viz_by_genre_with_p %>%
  filter(years %in% c("2000s", "2010s")) %>%
  group_by(genre_name, bechdel_rating) %>%
  summarise(per=mean(percentage)) %>%
  arrange(desc(bechdel_rating), desc(per)) -> order_genre

`summarise()` has grouped output by 'genre_name'. You can override using the
`.groups` argument.

Code

viz_by_genre_with_p$genre_name <-factor(viz_by_genre_with_p$genre_name, levels = order_genre$genre_name[1:18])

viz_by_genre_with_p %>%
  filter(years=="2000s"|years=="2010s") %>%
  group_by(genre_name, bechdel_rating) %>%
  summarize(n=sum(n)) %>%
  mutate(p=n/sum(n),
         percentage = 100*p) %>%
  ggplot(aes(x=genre_name, y=percentage, fill= bechdel_rating)) +
  geom_col() +
  geom_text(aes(label=paste0(round(percentage), "%")), color="white", position=position_stack(vjust=0.5), size=3)+
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two women", 
                             "1.Two women", 
                             "2.Two women talk to each other about men", 
                             "3.Two women talk to each other about other than men (Pass)")) +
  labs(x="Genre", y="%", fill="Bechdel rating")+
  coord_flip()  +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))

`summarise()` has grouped output by 'genre_name'. You can override using the
`.groups` argument.

Above we looked at trends over the past 20 years, but what if we look at the historical trends over past 100 years?
Some genres continue to improve their representation of women compared to the past, while others remain the same.
For example, Horror, Adventure, Science Fiction, Animation, and Fantasy have gradually increased the percentage of films that pass the Bechdel test.
Western, History, War, and Crime, Comedy on the other hand, have not changed much in terms of female representation compared to 100 years ago.

Code

viz_by_genre_with_p %>%
  filter(years!="Before 1920") %>%
  ggplot(aes(x=years, y= percentage, fill = bechdel_rating)) + 
  geom_col() +
  facet_wrap(~genre_name, ncol = 5)+
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No two female characters", "Two female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  labs(x = "year", y = "percent", title = "Proportion of movies that pass / don't pass Bechdel test by categories")+
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2)) +
  scale_x_discrete(breaks = c('1920s', '1960s', '2010s'),
                   labels = c('1920', '1960', '2010'))

Do movies with good female representation succeed in gaining more popularity on online review site?

Again, I focused on the movies released after 2000 to see the recent trends.

I compared Metascore and IMDb ratings of the films that pass and don’t pass the Bechdel test.

The movies that don’t pass the Bechdel test seem to score higher on Metascore, however, the difference is quite small, and there seems to be no relationship between whether a film passes the Bechdel test and its reputation (Metascore).

Code

data %>%
  filter(lubridate::year(release_date) > 2000) %>%
  ggplot(aes(x=bechdel_pass, y=Metascore)) + 
  geom_boxplot() +
  scale_x_discrete(labels=c("0: Don't pass Bechel", "1: Pass Bechdel")) +
  labs(x="Bechdel Test")

Warning: Removed 915 rows containing non-finite values (`stat_boxplot()`).

T-test shows that there is no statistically significant difference in Metascores of the movies that pass and that don’t pass Bechdel test.

Code

ttest_pass <- data %>% 
  filter(bechdel_pass == "1" & lubridate::year(release_date) > 2000) %>%
  select(c("bechdel_pass", "imdbRating", "imdbVotes", "Metascore", "BoxOffice"))

ttest_not_pass <- data %>%
  filter(bechdel_pass == "0" & lubridate::year(release_date) > 2000) %>%
  select(c("bechdel_pass", "imdbRating", "imdbVotes", "Metascore", "BoxOffice"))

t.test(ttest_pass$Metascore, ttest_not_pass$Metascore)


    Welch Two Sample t-test

data:  ttest_pass$Metascore and ttest_not_pass$Metascore
t = 0.37842, df = 2936, p-value = 0.7051
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.9448254  1.3967391
sample estimates:
mean of x mean of y 
 57.81359  57.58764

Then, I compared the IMDb scores of the movies that pass and don’t pass Bechdel test. Those that don’t pass Bechdel test seem to score a bit higher.

Code

data %>%
  filter(lubridate::year(release_date) > 2000) %>%
  ggplot(aes(x=bechdel_pass, y=imdbRating)) + 
  geom_boxplot() +
  scale_x_discrete(labels=c("0: Don't pass Bechel", "1: Pass Bechdel")) +
  labs(x="Bechdel Test", y="IMDb Rating")

Warning: Removed 8 rows containing non-finite values (`stat_boxplot()`).

T-test shows that the movies that do not meet Bechdel test criteria score statistically higher than those that do.

Code

t.test(ttest_pass$imdbRating, ttest_not_pass$imdbRating)


    Welch Two Sample t-test

data:  ttest_pass$imdbRating and ttest_not_pass$imdbRating
t = -5.6621, df = 3804.6, p-value = 1.605e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2230059 -0.1082905
sample estimates:
mean of x mean of y 
 6.383584  6.549233

Do movies with good female representation succeed in making more money?

Code

data %>%
  filter(lubridate::year(release_date) > 2000) %>%
  ggplot(aes(x=bechdel_pass, y=BoxOffice)) + 
  geom_boxplot() +
  scale_x_discrete(labels=c("0: Don't pass Bechel", "1: Pass Bechdel"))

Warning: Removed 1115 rows containing non-finite values (`stat_boxplot()`).

Even though the average box office amount of not-Bechdel-passing movies is higher than that of Bechdel-passing movies, T-test shows that the difference is not significant.

Code

t.test(ttest_pass$BoxOffice, ttest_not_pass$BoxOffice)


    Welch Two Sample t-test

data:  ttest_pass$BoxOffice and ttest_not_pass$BoxOffice
t = -1.5212, df = 3042.1, p-value = 0.1283
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -9207124  1162341
sample estimates:
mean of x mean of y 
 43773430  47795821

Conclusion

The representation of women in film stories stagnated from the 1930s to the 1970s, but has gradually improved, especially since the 1970s.
However, this is not true for all genres: the percentage of films that pass the Bechdel test varies widely from the 20% range (Western) to the 70% range (Romance), depending on the genre. Some genres (Horror, Adventure, Science Fiction, Fantasy) have consistently increased the percentage of films that pass the Bechdel test, while others (War, Crime, History, Western) have remained stagnant. The percentage of films that pass the test is increasing constantly.

It turns out that whether or not a film passes the Bechdel test has little impact on its success (i.e., popularity and revenue). However, the reviews on the imdb website are higher for films that do not pass the Bechdel Test.

This analysis has so far shown how women’s representation has (or has not) improved, and which genres have been particularly well represented by women, but it has not explained why this has happened. I would like to conduct an analysis that can answer the question “Why?”. For example, I would like to test the hypothesis that the increase in female representation in fantasy and horror is due to the increase in female audiences.
Also, I would like to further investigate the relationship between female representation and film success (popularity and revenue). In particular, I am interested in how the gender of the audience affects their evaluation of films in which women are represented and those in which they are not.

Reflection

One of the most difficult parts of this analysis was finding the right data. Even though there are many movie database, the information was not complete or not clear enough to be used. Thus, I ended up using two different movie databases and one Bechdel test database because one movie database (The Movie Dataset) didn’t have a clear definition of review ratings and popularity.

Another difficulty that I faced is that I could not find a good dataset of the gender of directors/writers of movies. Even though I found “credit” dataset that included the name and the gender of people who were involved in the movie production, the information on director was limited and only a few thousand movies had director information. Originally, I wanted to visualize the relationship between the gender of movie directors and the female representation, however, I gave up doing that for this project due to the lack of an appropriate dataset.

Contact

The dataset that I used for this analysis can be found on my GitHub (https://github.com/Enagai-nagai/601_Fall_2022/tree/template/posts/_data).

I’m interested in analyzing and visualizing data related to the entertainment industry, behavior on digital platforms, and gender issues.

If you have any comments or questions, please contact me through Email.

enagai ★ umass.edu (please replace ★ with @)
nagainagai.e ★ gmail.com (please replace ★ with @)

Bibliography and References

I used R and RStudio to realize this analysis.

The below is the source of data sets:

Bechdel API: https://bechdeltest.com/api/v1/doc
The Movie Dataset from Kaggle: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset
OMDb API: https://www.omdbapi.com/

The below is the source of information:

Women and Hollywood. (2022, March 15) Study: Women made up 34% of speaking roles in 2021’s top films, majority of those characters were white. . Retrieved December 13, 2022, from https://womenandhollywood.com/study-women-made-up-34-of-speaking-roles-in-2021s-top-films-majority-of-those-characters-were-white/#:~:text=In%202021’s%20top%20films%2C%20females,and%2037%20percent%20in%202019.
Lauzen, M. M. (2021). (rep.). It’s a Man’s (Celluloid) World, Even in a Pandemic Year: Portrayals of Female Characters in the Top U.S. Films of 2021. San Diego, California: San Diego State University and The Center for the Study of Women in Television and Film.

Introduction

Read in data

(1) “The Movie Data set” from Kaggle :

(2) Bechdel test API:

(3) Open Movie Database (OMDb):

Describe data

(1) “The Movie Data set”

Data Frame Summary

movie

(2) Bechdel test

Data Frame Summary

bechdel

Tidy data

(1) TMDb data

Checking missing values

Deleting unnecessary columns

Cleaning data in JSON nested list format

(1). Tidying Genres

(2). Tyding Production_countries

(3). Tyding production_companies & spoken_languages

Adding a new column

Data Frame Summary

movie_clean1

(2) Bechdel data

Removing and Mutating Columns

Sense Check

Join data sets

Read in data / Describe data (OMDb)

Data Analysis and Visualization

Is female representation in movie stories improving over time?

What category represents women better?

Do movies with good female representation succeed in gaining more popularity on online review site?

Do movies with good female representation succeed in making more money?

Conclusion

Reflection

Contact

Bibliography and References

(1). Tidying `Genres`

(2). Tyding `Production_countries`

(3). Tyding `production_companies` & `spoken_languages`