Assignment3 Erika Nagai

hw2

hw3

movie

gender

bechdel test

female representation

Erika Nagai

Female representation in movie stories

Author

Erika Nagai

Published

December 20, 2022

Introduction

Even though half of moviegoers in the United States is women, their story is not represented as much as that of men. Only 35% of the main characters in the best movies of 2021 were female, and male characters outnumbered female characters in 85% of the films. (Women and Hollywood, 2022) Also lacking are women behind the scenes in the film business. In 2021, just 12% of the top 100 films are directed by women.(Lauzen, n.d.)

In this project, I focused on analyzing female representation in the movie stories rather than in their production. I used the data from three major movie databases, The Movie Database (TMDb), MovieLens, Open Movie Database (OMDb) and Bechdel Test, one of the most common criterion to measure female presentation.

I would like to answer the following questions in this analysis:

Is female representation in movie stories improving over time?
What genres are doing better than others in terms of female representation?
Are movies with good female representation more popular?
Do movies with good female representation more money?

This analysis intends to document all the data cleaning process, which is quite long. If you are curious only about the analysis results, please skip to Data Analysis and Visualization part.

# install libraries

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2

Warning: package 'readr' was built under R version 4.2.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(ggplot2)
library(stringr)
library(tidyr)
library(dplyr)
library(summarytools)


Attaching package: 'summarytools'

The following object is masked from 'package:tibble':

    view

library(plyr)

------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------

Attaching package: 'plyr'

The following objects are masked from 'package:dplyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following object is masked from 'package:purrr':

    compact

library(lubridate)


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

library(ggridges)

Read in data

For this analysis, I used the following different datasets.

(1) “The Movie Dataset” from Kaggle :

This dataset was obtained from Kaggle “The Movie Dataset”(https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download&select=movies_metadata.csv).

#(1) movies_metadata.csv obtained from Kaggle
movie = read_csv("_data/movies_metadata.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 45466 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (14): belongs_to_collection, genres, homepage, imdb_id, original_langua...
dbl   (7): budget, id, popularity, revenue, runtime, vote_average, vote_count
lgl   (2): adult, video
date  (1): release_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

(2) Bechdel test API:

I generated the data by using bechdel test API https://bechdeltest.com/api/v1/doc.

#(2) bechdel test obtained by using bechdel API

library(rjson)
library(jsonlite)

Warning: package 'jsonlite' was built under R version 4.2.2


Attaching package: 'jsonlite'

The following objects are masked from 'package:rjson':

    fromJSON, toJSON

The following object is masked from 'package:purrr':

    flatten

#json_file <- "http://bechdeltest.com/api/v1/getAllMovies"
#bechdel <- read_json(path = json_file, simplifyVector = TRUE)
#bechdel$titleId <- paste("tt",bechdel$imdbid, sep = "")

#write.csv(bechdel, file = "_data/bechdel.csv")
bechdel <- read_csv("_data/bechdel.csv")

New names:
• `` -> `...1`

Rows: 9802 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): imdbid, title, titleId
dbl (4): ...1, year, rating, id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

(3) Open Movie Database (OMDb):

This dataset was generated by using Open Movie Database (OMDb) API https://www.omdbapi.com/. OMDb provides the detail information such as director, writer, nominated award, the number/score of reviews on imdb etc… if you provide the name or imdb id of movies. I decided to use this database because it gives me the data related to movies popularity and financial success of the films, which “The Movie Dataset” doesn’t include.

OMDb doesn’t give you a list of all movies registered on it. Instead, you need to provide the exact movie title or the imdb id to get the list of information. Thus, I will use this API once I have the data that join movies_metadata.csv and bechdel test. (Please refer to “Read in data / Describe data (OMDb)” for this process.)

Describe data

(1) TMDb

This data was originally created from The Movie Database (https://www.themoviedb.org/) and MovieLens (https://movielens.org/).This movie dataset was generated by Movielens, a (non-profit) movie review website (https://movielens.org/), and was obtained from the following Kaggle link. (https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download&select=movies_metadata.csv)

The movie dataset contains 45466 movies with a released date between December 9th 1874 and December 16th 2020.

The data includes the general information of movies, such as genres, revenue, runtime, languages, status (released/in production etc…).

print(summarytools::dfSummary(movie),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

movie

Dimensions: 45466 x 24
Duplicates: 17

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

adult [logical]

1. FALSE

2. TRUE

45454	(	100.0%	)
9	(	0.0%	)

3 (0.0%)

belongs_to_collection [character]

1. {'id': 415931, 'name': 'T

2. {'id': 421566, 'name': 'T

3. {'id': 645, 'name': 'Jame

4. {'id': 96887, 'name': 'Za

5. {'id': 37261, 'name': 'Th

6. {'id': 34055, 'name': 'Po

7. {'id': 413661, 'name': 'C

8. {'id': 374509, 'name': 'G

9. {'id': 148324, 'name': 'U

10. {'id': 38451, 'name': 'Ch

[ 1688 others ]

29	(	0.6%	)
27	(	0.6%	)
26	(	0.6%	)
26	(	0.6%	)
25	(	0.6%	)
22	(	0.5%	)
21	(	0.5%	)
16	(	0.4%	)
15	(	0.3%	)
15	(	0.3%	)
4272	(	95.1%	)

40972 (90.1%)

budget [numeric]

Mean (sd) : 4224579 (17424133)

min ≤ med ≤ max:

0 ≤ 0 ≤ 3.8e+08

IQR (CV) : 0 (4.1)

1223 distinct values

3 (0.0%)

genres [character]

1. [{'id': 18, 'name': 'Dram

2. [{'id': 35, 'name': 'Come

3. [{'id': 99, 'name': 'Docu

4. []

5. [{'id': 18, 'name': 'Dram

6. [{'id': 35, 'name': 'Come

7. [{'id': 27, 'name': 'Horr

8. [{'id': 35, 'name': 'Come

9. [{'id': 35, 'name': 'Come

10. [{'id': 18, 'name': 'Dram

[ 4059 others ]

5000	(	11.0%	)
3621	(	8.0%	)
2723	(	6.0%	)
2442	(	5.4%	)
1301	(	2.9%	)
1135	(	2.5%	)
974	(	2.1%	)
930	(	2.0%	)
593	(	1.3%	)
532	(	1.2%	)
26215	(	57.7%	)

0 (0.0%)

homepage [character]

1. http://www.georgecarlin.c

2. http://www.wernerherzog.c

3. http://breakblade.jp/

4. http://phantasm.com

5. http://www.crownintlpictu

6. http://www.crownintlpictu

7. http://www.crownintlpictu

8. http://www.kungfupanda.co

9. http://www.missionimpossi

10. http://www.thehungergames

[ 7663 others ]

12	(	0.2%	)
7	(	0.1%	)
6	(	0.1%	)
4	(	0.1%	)
4	(	0.1%	)
4	(	0.1%	)
4	(	0.1%	)
4	(	0.1%	)
4	(	0.1%	)
4	(	0.1%	)
7729	(	99.3%	)

37684 (82.9%)

id [numeric]

Mean (sd) : 108359.9 (112460.7)

min ≤ med ≤ max:

2 ≤ 60003 ≤ 469172

IQR (CV) : 130878.5 (1)

45433 distinct values

3 (0.0%)

imdb_id [character]

1. 0

2. tt1180333

3. tt0022537

4. tt0022879

5. tt0046468

6. tt0062229

7. tt0067306

8. tt0080000

9. tt0082992

10. tt0084387

[ 45407 others ]

3	(	0.0%	)
3	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
45427	(	100.0%	)

17 (0.0%)

original_language [character]

1. en

2. fr

3. it

4. ja

5. de

6. es

7. ru

8. hi

9. ko

10. zh

[ 82 others ]

32269	(	71.0%	)
2438	(	5.4%	)
1529	(	3.4%	)
1350	(	3.0%	)
1080	(	2.4%	)
994	(	2.2%	)
826	(	1.8%	)
508	(	1.1%	)
444	(	1.0%	)
409	(	0.9%	)
3608	(	7.9%	)

11 (0.0%)

original_title [character]

1. Alice in Wonderland

2. Hamlet

3. A Christmas Carol

4. Cinderella

5. Les Misérables

6. Macbeth

7. The Three Musketeers

8. Blackout

9. Frankenstein

10. Heidi

[ 43363 others ]

8	(	0.0%	)
8	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
6	(	0.0%	)
6	(	0.0%	)
6	(	0.0%	)
45397	(	99.8%	)

0 (0.0%)

overview [character]

1. No overview found.

2. No Overview

3. A few funny little novels

4. Adaptation of the Jane Au

5. King Lear, old and tired,

6. No movie overview availab

7. Recovering from a nail gu

8. Released

9. A film by Jem Cohen

10. A former aristocrat Ippol

[ 44296 others ]

133	(	0.3%	)
7	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
44345	(	99.6%	)

959 (2.1%)

popularity [numeric]

Mean (sd) : 2.9 (6)

min ≤ med ≤ max:

0 ≤ 1.1 ≤ 547.5

IQR (CV) : 3.3 (2.1)

43757 distinct values

6 (0.0%)

poster_path [character]

1. /5D7UBSEgdyONE6Lql6xS7s6O

2. /2kslZXOaW0HmnGuVPCnQlCdX

3. /qW1oQlOHizRHXZQrpkimYr0o

4. /8VSZ9coCzxOCW2wE2Qene1H1

5. /cdwVC18URfEdQjjxqJyRMoGD

6. /2ngnUQX9Abkesfq72uvBF3uj

7. /2trk2fape4S0CYnRg38kNVe8

8. /4J6Ai4C5YRgfRUTlirrJ7Qsm

9. /5GasjPRAy5rlEyDOH7MeOyxy

10. /5ILjS6XB5deiHop8SXPsYxXW

[ 45014 others ]

5	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
45051	(	99.9%	)

386 (0.8%)

production_companies [character]

1. []

2. [{'name': 'Metro-Goldwyn-

3. [{'name': 'Warner Bros.',

4. [{'name': 'Paramount Pict

5. [{'name': 'Twentieth Cent

6. [{'name': 'Universal Pict

7. [{'name': 'RKO Radio Pict

8. [{'name': 'Columbia Pictu

9. [{'name': 'Columbia Pictu

10. [{'name': 'Mosfilm', 'id'

[ 22698 others ]

11875	(	26.1%	)
742	(	1.6%	)
540	(	1.2%	)
505	(	1.1%	)
439	(	1.0%	)
320	(	0.7%	)
247	(	0.5%	)
207	(	0.5%	)
146	(	0.3%	)
145	(	0.3%	)
30297	(	66.6%	)

3 (0.0%)

production_countries [character]

1. [{'iso_3166_1': 'US', 'na

2. []

3. [{'iso_3166_1': 'GB', 'na

4. [{'iso_3166_1': 'FR', 'na

5. [{'iso_3166_1': 'JP', 'na

6. [{'iso_3166_1': 'IT', 'na

7. [{'iso_3166_1': 'CA', 'na

8. [{'iso_3166_1': 'DE', 'na

9. [{'iso_3166_1': 'IN', 'na

10. [{'iso_3166_1': 'RU', 'na

[ 2383 others ]

17851	(	39.3%	)
6282	(	13.8%	)
2238	(	4.9%	)
1654	(	3.6%	)
1356	(	3.0%	)
1030	(	2.3%	)
840	(	1.8%	)
749	(	1.6%	)
735	(	1.6%	)
735	(	1.6%	)
11993	(	26.4%	)

3 (0.0%)

release_date [Date]

min : 1874-12-09

med : 2001-08-30

max : 2020-12-16

range : 146y 0m 7d

17333 distinct values

90 (0.2%)

revenue [numeric]

Mean (sd) : 11209349 (64332247)

min ≤ med ≤ max:

0 ≤ 0 ≤ 2787965087

IQR (CV) : 0 (5.7)

6863 distinct values

6 (0.0%)

runtime [numeric]

Mean (sd) : 94.1 (38.4)

min ≤ med ≤ max:

0 ≤ 95 ≤ 1256

IQR (CV) : 22 (0.4)

353 distinct values

263 (0.6%)

spoken_languages [character]

1. [{'iso_639_1': 'en', 'nam

2. []

3. [{'iso_639_1': 'fr', 'nam

4. [{'iso_639_1': 'ja', 'nam

5. [{'iso_639_1': 'it', 'nam

6. [{'iso_639_1': 'es', 'nam

7. [{'iso_639_1': 'ru', 'nam

8. [{'iso_639_1': 'de', 'nam

9. [{'iso_639_1': 'en', 'nam

10. [{'iso_639_1': 'en', 'nam

[ 1921 others ]

22395	(	49.3%	)
3829	(	8.4%	)
1853	(	4.1%	)
1289	(	2.8%	)
1218	(	2.7%	)
902	(	2.0%	)
807	(	1.8%	)
762	(	1.7%	)
681	(	1.5%	)
572	(	1.3%	)
11152	(	24.5%	)

6 (0.0%)

status [character]

1. Canceled

2. In Production

3. Planned

4. Post Production

5. Released

6. Rumored

2	(	0.0%	)
20	(	0.0%	)
15	(	0.0%	)
98	(	0.2%	)
45014	(	99.2%	)
230	(	0.5%	)

87 (0.2%)

tagline [character]

1. Based on a true story.

2. -

3. Be careful what you wish

4. Trust no one.

5. A Love Story

6. Classic Albums

7. Documentary

8. Drama

9. How far would you go?

10. Know Your Enemy

[ 20272 others ]

7	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
20374	(	99.8%	)

25055 (55.1%)

title [character]

1. Cinderella

2. Alice in Wonderland

3. Hamlet

4. Beauty and the Beast

5. Les Misérables

6. A Christmas Carol

7. Blackout

8. The Three Musketeers

9. Treasure Island

10. Aftermath

[ 42267 others ]

11	(	0.0%	)
9	(	0.0%	)
9	(	0.0%	)
8	(	0.0%	)
8	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
6	(	0.0%	)
45381	(	99.8%	)

6 (0.0%)

video [logical]

1. FALSE

2. TRUE

45367	(	99.8%	)
93	(	0.2%	)

6 (0.0%)

vote_average [numeric]

Mean (sd) : 5.6 (1.9)

min ≤ med ≤ max:

0 ≤ 6 ≤ 10

IQR (CV) : 1.8 (0.3)

92 distinct values

6 (0.0%)

vote_count [numeric]

Mean (sd) : 109.9 (491.3)

min ≤ med ≤ max:

0 ≤ 10 ≤ 14075

IQR (CV) : 31 (4.5)

1820 distinct values

6 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

(2) Bechdel test

This data (bechdel) dataset documents the Bechdel test rating of 9630 movies released between 1874 and 2022.

According to “Merriam-Webster”, Bechdel test is “a set of criteria used as a test to evaluate a work of fiction (such as a film) on the basis of its inclusion and representation of female characters” (https://www.merriam-webster.com/dictionary/Bechdel%20Test)

A movie passes Bechdel test when it has at least two female featured characters that talk each other about other than a man (men).

In bechdel, Bechdel test rating is registered in a following manner.

0 ~ No two female characters

1 ~ Two female characters but don’t talk each other

2 ~ Two female characters talk each other about a man (men)

3 ~ Passes bechdel test: Two female characters talk each other about other than a man (men)

Apart from Bechdel ratings, this dataset (bechdel) contains the released year, title of movies, and id. id that starts with “tt” followed by 7 digits is imdb id and can work as a foreign key when joining this data with movie dataset.

print(summarytools::dfSummary(bechdel),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

bechdel

Dimensions: 9802 x 7
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

...1 [numeric]

Mean (sd) : 4901.5 (2829.7)

min ≤ med ≤ max:

1 ≤ 4901.5 ≤ 9802

IQR (CV) : 4900.5 (0.6)

9802 distinct values

0 (0.0%)

year [numeric]

Mean (sd) : 1996.2 (27)

min ≤ med ≤ max:

1010 ≤ 2006 ≤ 2022

IQR (CV) : 25 (0)

142 distinct values

0 (0.0%)

rating [numeric]

Mean (sd) : 2.1 (1.1)

min ≤ med ≤ max:

0 ≤ 3 ≤ 3

IQR (CV) : 2 (0.5)

0	:	1084	(	11.1%	)
1	:	2124	(	21.7%	)
2	:	1000	(	10.2%	)
3	:	5594	(	57.1%	)

0 (0.0%)

id [numeric]

Mean (sd) : 5224.9 (3052.9)

min ≤ med ≤ max:

1 ≤ 5211.5 ≤ 10641

IQR (CV) : 5233.5 (0.6)

9802 distinct values

0 (0.0%)

imdbid [character]

1. 0035279
2. 0086425
3. 0117056
4. 2043900
5. 2457282
6. 0000001
7. 0000002
8. 0000003
9. 0000004
10. 0000005
[ 9784 others ]

2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
9784	(	99.8%	)

3 (0.0%)

title [character]

1. Cinderella

2. Dracula

3. Little Women

4. Pride and Prejudice

5. Robin Hood

6. Shelter

7. A Star Is Born

8. Alice in Wonderland

9. Anna Karenina

10. Annie

[ 9547 others ]

5	(	0.1%	)
4	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
9765	(	99.6%	)

0 (0.0%)

titleId [character]

1. tt

2. tt0035279

3. tt0086425

4. tt0117056

5. tt2043900

6. tt2457282

7. tt0000001

8. tt0000002

9. tt0000003

10. tt0000004

[ 9785 others ]

3	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
9785	(	99.8%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

Tidy data

(1) TMDb data

Checking missing values

First of all, I took a look at missing values (NA values) in the data.

movie %>% select(everything()) %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  t()

Warning: `funs()` was deprecated in dplyr 0.8.0.
ℹ Please use a list of either functions or lambdas:

# Simple named list: list(mean = mean, median = median)

# Auto named with `tibble::lst()`: tibble::lst(mean, median)

# Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

                       [,1]
adult                     3
belongs_to_collection 40972
budget                    3
genres                    0
homepage              37684
id                        3
imdb_id                  17
original_language        11
original_title            0
overview                959
popularity                6
poster_path             386
production_companies      3
production_countries      3
release_date             90
revenue                   6
runtime                 263
spoken_languages          6
status                   87
tagline               25055
title                     6
video                     6
vote_average              6
vote_count                6

Deleting unnecessary columns

I deleted several columns, which seem to irrelevant for this analysis or have too many missing values.
- adult: This information doesn’t add any significant meaning as almost all of observations have FALSE value for this column
- homepage: URL of homepage. Unnecessary for this analysis
- belong_to_collection: Has 40000+ missing values - poster_path
- video: This information doesn’t add any significant meaning as almost all of observations have FALSE value for this column

- popularity: This information may be interesting, however, it is not clear how this popularity is measured or where it was generated. Therefore, I decided to use popularity data from OMDb.

- vote_average: Same as popularity

-vote_count: Same as popularity

Then, I changed the order of the columns. Each row represents a movie so the movie title should be the first column.

movie <- movie %>% select(-c("adult", "homepage", "belongs_to_collection", "poster_path", "video", "popularity", "vote_average", "vote_count"))

col_order <- c("title", "original_title", "tagline", "imdb_id", "id", "overview", "production_companies", "production_countries", "status", "release_date", "runtime", "revenue", "budget", "original_language", "spoken_languages", "genres")
movie <- movie[, col_order]

colnames(movie)[5] <- "movielens_id"

Cleaning data in JSON nested list format

The values in some certain columns such as “genres”, “production_companies”, “production_countries”,“spoken_languages” are in a JSON list format for example:

” [{‘id’: XXXX, ‘content(name/genre/title)’: XXX}, {‘id’: XXXX, ’content(name/genre/title): XXX}] ”

movie %>% select(c("genres", "production_companies", "production_countries", "spoken_languages"))

# A tibble: 45,466 × 4
   genres                                                produ…¹ produ…² spoke…³
   <chr>                                                 <chr>   <chr>   <chr>  
 1 [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name':… [{'nam… [{'iso… [{'iso…
 2 [{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name':… [{'nam… [{'iso… [{'iso…
 3 [{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name'… [{'nam… [{'iso… [{'iso…
 4 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'D… [{'nam… [{'iso… [{'iso…
 5 [{'id': 35, 'name': 'Comedy'}]                        [{'nam… [{'iso… [{'iso…
 6 [{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'C… [{'nam… [{'iso… [{'iso…
 7 [{'id': 35, 'name': 'Comedy'}, {'id': 10749, 'name':… [{'nam… [{'iso… [{'iso…
 8 [{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'A… [{'nam… [{'iso… [{'iso…
 9 [{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'A… [{'nam… [{'iso… [{'iso…
10 [{'id': 12, 'name': 'Adventure'}, {'id': 28, 'name':… [{'nam… [{'iso… [{'iso…
# … with 45,456 more rows, and abbreviated variable names
#   ¹production_companies, ²production_countries, ³spoken_languages

# These columns include [] {} and ' in their values so I made a function remove_simbols that removes these unnecessary symbols.
remove_symbols <- function(x) {
  removed_x <- str_remove_all(x, "\\{|\\}|\\[|\\}|\\]|'|,|id|name|:| ")
  return(removed_x)
}

(1). Tidying `Genres`

Each movie contains information on one or more (up to 8) genres in a single column. To organize the information, the multiple genres contained in genre are split by “],” so that each genre is contained in eight new columns, genre1-genre8. Then, I removed unnecessary symbols such as ,.

# Separating the `genre` by "},"
movie <- movie %>% 
  separate(genres, c("genre1", "genre2", "genre3", "genre4", "genre5", "genre6", "genre7", "genre8"), "\\},", remove = FALSE)

Warning: Expected 8 pieces. Missing pieces filled with `NA` in 45463 rows [1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

# Remove unnecessary symbols from each columns

movie$genre1 <- remove_symbols(movie$genre1)
movie$genre2 <- remove_symbols(movie$genre2)
movie$genre3 <- remove_symbols(movie$genre3)
movie$genre4 <- remove_symbols(movie$genre4)
movie$genre5 <- remove_symbols(movie$genre5)
movie$genre6 <- remove_symbols(movie$genre6)
movie$genre7 <- remove_symbols(movie$genre7)
movie$genre8 <- remove_symbols(movie$genre8)

movie %>% 
  select(matches("[1-9]"))

# A tibble: 45,466 × 8
   genre1       genre2       genre3       genre4     genre5 genre6 genre7 genre8
   <chr>        <chr>        <chr>        <chr>      <chr>  <chr>  <chr>  <chr> 
 1 16Animation  35Comedy     10751Family  <NA>       <NA>   <NA>   <NA>   <NA>  
 2 12Adventure  14Fantasy    10751Family  <NA>       <NA>   <NA>   <NA>   <NA>  
 3 10749Romance 35Comedy     <NA>         <NA>       <NA>   <NA>   <NA>   <NA>  
 4 35Comedy     18Drama      10749Romance <NA>       <NA>   <NA>   <NA>   <NA>  
 5 35Comedy     <NA>         <NA>         <NA>       <NA>   <NA>   <NA>   <NA>  
 6 28Action     80Crime      18Drama      53Thriller <NA>   <NA>   <NA>   <NA>  
 7 35Comedy     10749Romance <NA>         <NA>       <NA>   <NA>   <NA>   <NA>  
 8 28Action     12Adventure  18Drama      10751Fami… <NA>   <NA>   <NA>   <NA>  
 9 28Action     12Adventure  53Thriller   <NA>       <NA>   <NA>   <NA>   <NA>  
10 12Adventure  28Action     53Thriller   <NA>       <NA>   <NA>   <NA>   <NA>  
# … with 45,456 more rows

I managed to separate the genre information above. Now all values in genre1-genre8 has a value in the format of “number name of genre” (example: 35Comedy)

However, there are 2,445 observations whose genre value doesn’t fit the above format. Most of them don’t have any genre assigned so their genre1 values are empty or blank, which means they don’t have any genre assigned to them.

movie %>% filter(!str_detect(genre1, "^[0-9]")) %>% select(starts_with("genre"))

# A tibble: 2,445 × 9
   genres genre1 genre2 genre3 genre4 genre5 genre6 genre7 genre8
   <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
 1 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 2 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 3 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 4 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 5 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 6 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 7 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 8 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 9 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
10 []     ""     <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
# … with 2,435 more rows

However, looking at them closely, there are three observations whose genre value that is NOT blank but have strange strings.

The values in these observations such as “Carousel Production” and “Aniplex” don’t look like a name of genres but that of production studios. Also their original_title values don’t look like a title of movies but information of languages. It seems like these observations were not correctly read in because their values in other columns look weird.

Since these observations are not reliable, I decided to delete them from this dataset.

movie %>% filter(!str_detect(genre1, "^[0-9]") &!str_detect(genre1, "^[ \t\n]*$")) %>% #^[ \t\n]*$ is a regular expression for blank.
  select(c(original_title, title, production_countries, genre1:genre8))

# A tibble: 3 × 11
  origin…¹ title produ…² genre1 genre2 genre3 genre4 genre5 genre6 genre7 genre8
  <chr>    <chr> <chr>   <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
1 [{'iso_… <NA>  6.0     Carou… Visio… Teles… <NA>   <NA>   <NA>   <NA>   <NA>  
2 [{'iso_… <NA>  7.0     Anipl… GoHan… BROST… Mardo… Senta… <NA>   <NA>   <NA>  
3 [{'iso_… <NA>  4.3     Odyss… Pulse… Rogue… TheCa… <NA>   <NA>   <NA>   <NA>  
# … with abbreviated variable names ¹original_title, ²production_countries

movie_clean1 <- movie %>% filter(!str_detect(genre1, "^[A-Z]"))

Genre information is more organized but is still not easy to be analyzed. Thus, I decided to make dummy variables of each genre.

For example, if movie A is categorized as “comedy” and “adventure”, the line of movie A should have 1 in “comedy” column and “adventure” respectively and 0 in columns of other genres.

#https://community.rstudio.com/t/creating-dummy-columns-based-on-multiple-columns/58145/3
movie_clean1 <- movie_clean1 %>% 
  pivot_longer(cols = matches("genre[1-9]")) %>%
  add_column(count = 1) %>%
  arrange(value) %>%
  filter(str_detect(value, "^[0-9]")) %>%
  mutate(value1 = str_replace_all(value, "[0-9]+", "")) %>%
  select(-c(value, name)) %>%
  arrange(title) %>%
  pivot_wider(
    names_from = value1, 
    values_from = count, 
    values_fill = list(count=0),
    values_fn = list(count = mean)) %>%
  arrange(title)

# I also deleted genre "TVMovie" and "Foreign" because they're more format or origin country information rather than genre.
movie_clean1 <- movie_clean1 %>% select(-c("TVMovie","Foreign"))

(2). Tyding `Production_countries`

The values in production_countries are written in the following way. >[{‘iso_3166_1’: ‘abbreviation of country’,‘name’: ‘full country name’}]

When there are more than one country in the value, I took the first country in account for this analysis.

movie_clean1$production_countries <- remove_symbols(movie_clean1$production_countries)
movie_clean1$production_countries <- str_extract(movie_clean1$production_countries,"(?<=_1)\\w{2}")

movie_clean1 %>% select(production_countries)

# A tibble: 42,991 × 1
   production_countries
   <chr>               
 1 GB                  
 2 IN                  
 3 US                  
 4 <NA>                
 5 US                  
 6 US                  
 7 FR                  
 8 JP                  
 9 <NA>                
10 US                  
# … with 42,981 more rows

(3). Tyding `production_companies` & `spoken_languages`

The values in both of production_companies and spoken_languages are in the following format.

> [{‘name’: ‘XXXX’, ‘id’: —}] > [{‘iso_639_1’:‘XXXX’, ‘name’:“—-}]

I only need the information that is written as XXXX in this format.

movie_clean1$production_companies <-str_remove_all(movie_clean1$production_companies, "'id': [0-9]*|'name':|\\[|\\]|'|\\{|,") %>%
  str_replace_all("\\},", ",") %>%
  str_replace_all(" \\} ", ",")%>%
  str_remove(" \\}")

movie_clean1$spoken_languages <-str_remove_all(movie_clean1$spoken_languages, "'iso_639_1':|'name': '\\w+'|\\[|\\]|'|\\{|,") %>%
  str_replace_all(" \\} ", ",") %>%
  str_remove_all(" \\}")

movie_clean1 %>% select(production_companies, spoken_languages)

# A tibble: 42,991 × 2
   production_companies                                                  spoke…¹
   <chr>                                                                 <chr>  
 1 " Screen Yorkshire, British Film Institute (BFI), Creative Scotland,… " en"  
 2 ""                                                                    " hi"  
 3 " Sebastian Films Limited"                                            " en"  
 4 " Lorimar Pictures"                                                   " en"  
 5 " Lone Star Production, Monogram Pictures"                            " en, …
 6 ""                                                                    " en"  
 7 " Barnholtz Entertainment"                                            " en, …
 8 ""                                                                    " zh"  
 9 ""                                                                    " fr"  
10 " Disney Channel"                                                     " en"  
# … with 42,981 more rows, and abbreviated variable name ¹spoken_languages

Adding a new column

A new column years was created to group years by decade.

movie_clean1 <- movie_clean1 %>% 
  mutate(
    years = case_when(
    lubridate::year(release_date) < 1920 ~ "Before 1920",
    lubridate::year(release_date) >= 1920 & lubridate::year(release_date) < 1930 ~ "1920s",
    lubridate::year(release_date) >= 1930 & lubridate::year(release_date) < 1940 ~ "1930s",
    lubridate::year(release_date) >= 1940 & lubridate::year(release_date) < 1950 ~ "1940s",
    lubridate::year(release_date) >= 1950 & lubridate::year(release_date) < 1960 ~ "1950s",
    lubridate::year(release_date) >= 1960 & lubridate::year(release_date) < 1970 ~ "1960s",
    lubridate::year(release_date) >= 1970 & lubridate::year(release_date) < 1980 ~ "1970s",
    lubridate::year(release_date) >= 1980 & lubridate::year(release_date) < 1990 ~ "1990s",
    lubridate::year(release_date) >= 1990 & lubridate::year(release_date) < 2000 ~ "1990s",
    lubridate::year(release_date) >= 2000 & lubridate::year(release_date) < 2010 ~ "2000s",
    TRUE ~ "2010s"
  )
  )

Now, the dataset is cleaner. Look at the summary again.

print(summarytools::dfSummary(movie_clean1),
      varnumbers = FALSE,
      plain.ascii  = FALSE,
      style        = "grid",
      graph.magnif = 0.80,
      valid.col    = FALSE,
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

movie_clean1

Dimensions: 42991 x 35
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

title [character]

1. Cinderella

2. Alice in Wonderland

3. Hamlet

4. Les Misérables

5. A Christmas Carol

6. Beauty and the Beast

7. The Three Musketeers

8. Treasure Island

9. Aftermath

10. Countdown

[ 40025 others ]

10	(	0.0%	)
9	(	0.0%	)
9	(	0.0%	)
8	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
6	(	0.0%	)
6	(	0.0%	)
42912	(	99.8%	)

3 (0.0%)

original_title [character]

1. Alice in Wonderland

2. Hamlet

3. A Christmas Carol

4. Les Misérables

5. Macbeth

6. The Three Musketeers

7. Cinderella

8. Frankenstein

9. Heidi

10. The Hound of the Baskervi

[ 41043 others ]

8	(	0.0%	)
8	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
6	(	0.0%	)
6	(	0.0%	)
6	(	0.0%	)
6	(	0.0%	)
42923	(	99.8%	)

0 (0.0%)

tagline [character]

1. Based on a true story.

2. -

3. Be careful what you wish

4. Trust no one.

5. A Love Story

6. Classic Albums

7. Drama

8. How far would you go?

9. Know Your Enemy

10. Some doors should never b

[ 20008 others ]

7	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
4	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
20097	(	99.8%	)

22857 (53.2%)

imdb_id [character]

1. tt0000001

2. tt0000003

3. tt0000005

4. tt0000008

5. tt0000010

6. tt0000012

7. tt0000014

8. tt0000023

9. tt0000029

10. tt0000070

[ 42969 others ]

1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
42969	(	100.0%	)

12 (0.0%)

movielens_id [numeric]

Mean (sd) : 104553.2 (111978.6)

min ≤ med ≤ max:

2 ≤ 55734 ≤ 469172

IQR (CV) : 118823.5 (1.1)

42991 distinct values

0 (0.0%)

overview [character]

1. No overview found.

2. A few funny little novels

3. No movie overview availab

4. A former aristocrat Ippol

5. A group of travelers, inc

6. A Russian engineer Petr G

7. A wooden boy Buratino tri

8. Adventurer Allan Quarterm

9. Alien pods come to Earth

10. Director Michael Apted re

[ 42137 others ]

120	(	0.3%	)
3	(	0.0%	)
3	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
42151	(	99.7%	)

700 (1.6%)

production_companies [character]

1. (Empty string)

Metro-Goldwyn-Mayer (MGM

Warner Bros.

Paramount Pictures

Twentieth Century Fox Fi

Universal Pictures

RKO Radio Pictures

Columbia Pictures Corpor

Columbia Pictures

10.

Mosfilm

[ 22474 others ]

9770	(	22.7%	)
737	(	1.7%	)
538	(	1.3%	)
500	(	1.2%	)
434	(	1.0%	)
320	(	0.7%	)
247	(	0.6%	)
207	(	0.5%	)
146	(	0.3%	)
144	(	0.3%	)
29945	(	69.7%	)

3 (0.0%)

production_countries [character]

1. US

2. GB

3. FR

4. CA

5. JP

6. DE

7. IT

8. RU

9. IN

10. ES

[ 132 others ]

18202	(	47.5%	)
3031	(	7.9%	)
2639	(	6.9%	)
1469	(	3.8%	)
1430	(	3.7%	)
1389	(	3.6%	)
1382	(	3.6%	)
777	(	2.0%	)
754	(	2.0%	)
576	(	1.5%	)
6653	(	17.4%	)

4689 (10.9%)

status [character]

1. In Production

2. Planned

3. Post Production

4. Released

5. Rumored

20	(	0.0%	)
15	(	0.0%	)
96	(	0.2%	)
42601	(	99.2%	)
205	(	0.5%	)

54 (0.1%)

release_date [Date]

min : 1874-12-09

med : 2001-10-29

max : 2020-12-16

range : 146y 0m 7d

16860 distinct values

29 (0.1%)

runtime [numeric]

Mean (sd) : 95.5 (36.6)

min ≤ med ≤ max:

0 ≤ 95 ≤ 1256

IQR (CV) : 21 (0.4)

343 distinct values

182 (0.4%)

revenue [numeric]

Mean (sd) : 11845510 (66095743)

min ≤ med ≤ max:

0 ≤ 0 ≤ 2787965087

IQR (CV) : 0 (5.6)

6844 distinct values

3 (0.0%)

budget [numeric]

Mean (sd) : 4461834 (17886326)

min ≤ med ≤ max:

0 ≤ 0 ≤ 3.8e+08

IQR (CV) : 0 (4)

1215 distinct values

0 (0.0%)

original_language [character]

1. en

2. fr

3. it

4. ja

5. de

6. es

7. ru

8. hi

9. ko

10. zh

[ 79 others ]

30566	(	71.1%	)
2348	(	5.5%	)
1359	(	3.2%	)
1287	(	3.0%	)
1029	(	2.4%	)
927	(	2.2%	)
786	(	1.8%	)
478	(	1.1%	)
431	(	1.0%	)
391	(	0.9%	)
3379	(	7.9%	)

10 (0.0%)

spoken_languages [character]

2. (Empty string)

en, fr

10.

en, es

[ 1896 others ]

21943	(	51.0%	)
2660	(	6.2%	)
1785	(	4.2%	)
1220	(	2.8%	)
1097	(	2.6%	)
826	(	1.9%	)
769	(	1.8%	)
722	(	1.7%	)
679	(	1.6%	)
568	(	1.3%	)
10719	(	24.9%	)

3 (0.0%)

genres [character]

1. [{'id': 18, 'name': 'Dram

2. [{'id': 35, 'name': 'Come

3. [{'id': 99, 'name': 'Docu

4. [{'id': 18, 'name': 'Dram

5. [{'id': 35, 'name': 'Come

6. [{'id': 27, 'name': 'Horr

7. [{'id': 35, 'name': 'Come

8. [{'id': 35, 'name': 'Come

9. [{'id': 18, 'name': 'Dram

10. [{'id': 27, 'name': 'Horr

[ 4055 others ]

4996	(	11.6%	)
3620	(	8.4%	)
2721	(	6.3%	)
1300	(	3.0%	)
1133	(	2.6%	)
974	(	2.3%	)
930	(	2.2%	)
593	(	1.4%	)
531	(	1.2%	)
528	(	1.2%	)
25665	(	59.7%	)

0 (0.0%)

War [numeric]

Min : 0

Mean : 0

Max : 1

0	:	41669	(	96.9%	)
1	:	1322	(	3.1%	)

0 (0.0%)

Drama [numeric]

Min : 0

Mean : 0.5

Max : 1

0	:	22747	(	52.9%	)
1	:	20244	(	47.1%	)

0 (0.0%)

Action [numeric]

Min : 0

Mean : 0.2

Max : 1

0	:	36399	(	84.7%	)
1	:	6592	(	15.3%	)

0 (0.0%)

Thriller [numeric]

Min : 0

Mean : 0.2

Max : 1

0	:	35372	(	82.3%	)
1	:	7619	(	17.7%	)

0 (0.0%)

Crime [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	38687	(	90.0%	)
1	:	4304	(	10.0%	)

0 (0.0%)

Western [numeric]

Min : 0

Mean : 0

Max : 1

0	:	41949	(	97.6%	)
1	:	1042	(	2.4%	)

0 (0.0%)

Documentary [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	39061	(	90.9%	)
1	:	3930	(	9.1%	)

0 (0.0%)

Romance [numeric]

Min : 0

Mean : 0.2

Max : 1

0	:	36261	(	84.3%	)
1	:	6730	(	15.7%	)

0 (0.0%)

Family [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	40224	(	93.6%	)
1	:	2767	(	6.4%	)

0 (0.0%)

Animation [numeric]

Min : 0

Mean : 0

Max : 1

0	:	41060	(	95.5%	)
1	:	1931	(	4.5%	)

0 (0.0%)

Comedy [numeric]

Min : 0

Mean : 0.3

Max : 1

0	:	29815	(	69.4%	)
1	:	13176	(	30.6%	)

0 (0.0%)

Horror [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	38320	(	89.1%	)
1	:	4671	(	10.9%	)

0 (0.0%)

Mystery [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	40527	(	94.3%	)
1	:	2464	(	5.7%	)

0 (0.0%)

Adventure [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	39501	(	91.9%	)
1	:	3490	(	8.1%	)

0 (0.0%)

History [numeric]

Min : 0

Mean : 0

Max : 1

0	:	41593	(	96.7%	)
1	:	1398	(	3.3%	)

0 (0.0%)

ScienceFiction [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	39947	(	92.9%	)
1	:	3044	(	7.1%	)

0 (0.0%)

Fantasy [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	40682	(	94.6%	)
1	:	2309	(	5.4%	)

0 (0.0%)

Music [numeric]

Min : 0

Mean : 0

Max : 1

0	:	41394	(	96.3%	)
1	:	1597	(	3.7%	)

0 (0.0%)

years [character]

1. 1920s

2. 1930s

3. 1940s

4. 1950s

5. 1960s

6. 1970s

7. 1990s

8. 2000s

9. 2010s

10. Before 1920

398	(	0.9%	)
1264	(	2.9%	)
1444	(	3.4%	)
1997	(	4.6%	)
2474	(	5.8%	)
3270	(	7.6%	)
8965	(	20.9%	)
10605	(	24.7%	)
12291	(	28.6%	)
283	(	0.7%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

(2) Bechdel data

Removing and Mutating Columns

imdbid and titleId are duplicated because titleId is “tt+imdbid”. movie_clean1 dataframe has IDs in the same format (starting with tt) as titleId of bechdel dataframe so I deleted imdbid column.

Also, year and title information is available in movie dataframe, thus I deleted them.

I made a new column bechdel_pass where 1 means the movie passes bechdel test and 0 means otherwise.

#

bechdel <- bechdel %>% select(-c("imdbid","...1","title"))
colnames(bechdel) <- c("year", "bechdel_rating", "id", "titleid")

summary(bechdel)

      year      bechdel_rating        id          titleid         
 Min.   :1010   Min.   :0.000   Min.   :    1   Length:9802       
 1st Qu.:1988   1st Qu.:1.000   1st Qu.: 2558   Class :character  
 Median :2006   Median :3.000   Median : 5212   Mode  :character  
 Mean   :1996   Mean   :2.133   Mean   : 5225                     
 3rd Qu.:2013   3rd Qu.:3.000   3rd Qu.: 7792                     
 Max.   :2022   Max.   :3.000   Max.   :10641

# Changing the column order
col_order <- c("id", "titleid","year", "bechdel_rating")
bechdel <- bechdel[, col_order]

# Mutate a new column bechdel_pass
bechdel <- bechdel %>% mutate(
  bechdel_pass = case_when(
    bechdel_rating == 3 ~ 1,
    TRUE ~ 0
  )
)

# Change data types of columns
bechdel$bechdel_rating <- factor(bechdel$bechdel_rating, levels = c("0", "1", "2", "3"))
bechdel$bechdel_pass <- factor(bechdel$bechdel_pass, levels = c("0", "1"))

Sense Check

This data look quite clean, however, I looked at the summary for sense check, then I realized that the minimum number of year is 1010, which is weird. “Inazuma Eleven The Movie” was released in 2010, but it seems to have been mistakenly recorded as 1010. So I manually corrected the data.

summary(bechdel)

       id          titleid               year      bechdel_rating bechdel_pass
 Min.   :    1   Length:9802        Min.   :1010   0:1084         0:4208      
 1st Qu.: 2558   Class :character   1st Qu.:1988   1:2124         1:5594      
 Median : 5212   Mode  :character   Median :2006   2:1000                     
 Mean   : 5225                      Mean   :1996   3:5594                     
 3rd Qu.: 7792                      3rd Qu.:2013                              
 Max.   :10641                      Max.   :2022

bechdel %>% 
  filter(year < 1800)

# A tibble: 1 × 5
     id titleid    year bechdel_rating bechdel_pass
  <dbl> <chr>     <dbl> <fct>          <fct>       
1 10556 tt1794796  1010 3              1

bechdel$year[bechdel$year == 1010] <- 2010

Join datasets

We have two dataframes movie and bechdel , which have imdb id as a foreign key.

movie_bechdel_join <- inner_join(movie_clean1, bechdel, 
                                 by=c("imdb_id" = "titleid"),
                                 copy = TRUE)

glimpse(movie_bechdel_join)

Rows: 7,736
Columns: 39
$ title                <chr> "'71", "'Gator Bait", "'night, Mother", "(500) Da…
$ original_title       <chr> "'71", "'Gator Bait", "'night, Mother", "(500) Da…
$ tagline              <chr> NA, "UNTAMED AND DEADLY, she ruled the swamp with…
$ imdb_id              <chr> "tt2614684", "tt0074080", "tt0090556", "tt1022603…
$ movielens_id         <dbl> 252178, 88950, 34760, 19913, 45261, 8420, 8329, 1…
$ overview             <chr> "A young British soldier must find his way back t…
$ production_companies <chr> " Screen Yorkshire, British Film Institute (BFI),…
$ production_countries <chr> "GB", "US", "US", "US", "US", "FR", "ES", "ES", "…
$ status               <chr> "Released", "Released", "Released", "Released", "…
$ release_date         <date> 2014-10-10, 1974-06-01, 1986-09-12, 2009-07-17, …
$ runtime              <dbl> 99, 88, 96, 95, 96, 95, 78, 85, 80, 102, 95, 103,…
$ revenue              <dbl> 1625847, 0, 442000, 60722734, 230600, 0, 30448000…
$ budget               <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 7.50e+06, 0.00e+00,…
$ original_language    <chr> "en", "en", "en", "en", "en", "fr", "es", "es", "…
$ spoken_languages     <chr> " en", " en", " en", " en, fr, sv", " en", " en, …
$ genres               <chr> "[{'id': 53, 'name': 'Thriller'}, {'id': 28, 'nam…
$ War                  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Drama                <dbl> 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0…
$ Action               <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ Thriller             <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0…
$ Crime                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
$ Western              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
$ Documentary          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Romance              <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ Family               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ Animation            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Comedy               <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1…
$ Horror               <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ Mystery              <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Adventure            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ History              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ScienceFiction       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0…
$ Fantasy              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ Music                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ years                <chr> "2010s", "1970s", "1990s", "2000s", "2000s", "195…
$ id                   <dbl> 5159, 9738, 3155, 434, 7382, 3409, 3646, 4764, 50…
$ year                 <dbl> 2014, 1973, 1986, 2009, 2009, 1956, 2007, 2009, 2…
$ bechdel_rating       <fct> 0, 0, 3, 1, 1, 3, 3, 1, 3, 2, 3, 3, 3, 1, 3, 1, 3…
$ bechdel_pass         <fct> 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1…

Read in data / Describe data (OMDb)

As we have a joined data, I read in the movie popularity data from OMDb using OMDb API.

omdb_df has 5 columns

imdbID: can be used as a foreign key when joining with other dataframes
Director: Director’s name
Metascore: Review scores from 0 to 100 on the metacritic website (https://www.metacritic.com/movie). They are weighted averages. and reviews by certain reviewers are given more weight, however the detail is not revealed.
imdbRating: Review scores from 0 to 10 on IMDb website (https://www.imdb.com/). They are weighted averages.
imdbVotes: Number of votes on IMDb website.

# Generating OMDb dataframe
# imdb_id_list <- c()
# 
# for (i in c(1:length(movie_bechdel_join$imdb_id))) {
#   imdb_id_list[[i]] <- movie_bechdel_join$imdb_id[i]
# }
# 
# omdb_list <- lapply(imdb_id_list, function(movie_id) {
#   actor_vector <- find_by_id(movie_id)
#   actor_vector
# })
# 
# omdb_df <- tibble(place = omdb_list) %>%
#   unnest_wider(place) %>%
#   select(c("imdbID", "Director", "Metascore", "imdbRating", "imdbVotes", "BoxOffice")) %>% unnest() %>% distinct()

#write.csv(omdb_df, "_data/omdb.csv")
omdb_df <- read_csv("_data/omdb.csv")

New names:
Rows: 7729 Columns: 7
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): imdbID, Director, Metascore, BoxOffice dbl (3): ...1, imdbRating,
imdbVotes
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

omdb_df$Metascore <- as.numeric(omdb_df$Metascore)

Warning: NAs introduced by coercion

omdb_df$BoxOffice <- omdb_df$BoxOffice %>%
  str_replace_all(",", "") %>% 
  str_extract("[0-9]*(?=$)") %>% 
  as.numeric()

omdb_df <- omdb_df %>% select(-...1)
glimpse(omdb_df)

Rows: 7,729
Columns: 6
$ imdbID     <chr> "tt2614684", "tt0074080", "tt0090556", "tt1022603", "tt1132…
$ Director   <chr> "Yann Demange", "Beverly Sebastian, Ferd Sebastian", "Tom M…
$ Metascore  <dbl> 83, NA, NA, 76, 58, NA, 71, 52, 45, 52, 60, 76, NA, NA, 70,…
$ imdbRating <dbl> 7.2, 5.4, 7.6, 7.7, 6.3, 6.3, 7.4, 6.5, 5.0, 6.5, 5.5, 7.2,…
$ imdbVotes  <dbl> 56942, 1278, 2371, 516486, 2230, 8603, 181857, 74038, 35911…
$ BoxOffice  <dbl> 1270847, NA, 441863, 32391374, 230600, NA, NA, 27766, 9600,…

Now I have the data from OMDb as omdb_df. Let’s join movie_bechdel_join and omdb_df.

data <- inner_join(movie_bechdel_join, omdb_df, 
                                 by=c("imdb_id" = "imdbID"),
                                 copy = TRUE)

Data Analysis and Visualization

Is female representation in movie stories improving over time?

To find out the answer for this questions, I looked at the result of the Bechdel test.

The number of movies that pass the Bechdel test increases over time, especially after around 1980 as the total number of released movies increased.

movie_bechdel_join %>%
  filter(lubridate::year(release_date) > 1920) %>%
  group_by(year = lubridate::year(release_date), bechdel_rating) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=year, y= n_total, fill = bechdel_rating)) + geom_area(stat = "identity") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No 2 female characters", "2 female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  labs(x= "year", y = "Total number of movies", title = "The number of movies by Bechdel test ratings")+
  scale_x_continuous(n.breaks=14) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

However, the total number of released movies is increasing as well. Is the percentage of films that pass the Bechdel test increasing?

In 1920’s, which is 100 years ago, less than 20% of films passed Bechdel test. There was a big increase in percentage of Bechdel test passing movies in 1930’s. However, the percentage of films passing Bechdel test was stagnant a bit lower than 50% from 1930’s to 1950’s. Somehow, from 1950 to 1970, the representation of women in film went backwards, with less than 45% of films meeting the requirements of the Bechdel test; from the 1970s to the present, the percentage of films that pass the Bechdel test has continued to increase, reaching approximately 70% now. Although the representation of women seems to have improved since 1970, it has not improved dramatically compared to the 1930s and 1940s.

data %>%
  filter(lubridate::year(release_date) > 1920) %>%
  group_by(years, bechdel_rating) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=years, y= n_total, fill = bechdel_rating)) + geom_bar(stat = "identity", position="fill") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two female characters", "1.Two female characters", "2.That talk each other", "3.Other than a man (Passes Bechdel Test)")) +
  labs(x="year", y="percentage", title = "% of movies that pass/don't pass Bechdel Test") +
  scale_y_continuous(n.breaks=10) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(ncol = 2))

`summarise()` has grouped output by 'years'. You can override using the
`.groups` argument.

If we look at the number of each year, we can see that some years in 1930’s and 1940’s did as good as in 2010s. For example, 1943 - 1944 had a quite high percentage of Bechdel Test passing rate.

data %>%
  filter(lubridate::year(release_date) > 1920) %>%
  group_by(year = lubridate::year(release_date), bechdel_rating) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=year, y= n_total, fill = bechdel_rating)) + geom_area(stat = "identity", position = "fill") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No 2 female characters", "2 female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  labs(x= "year", y = "Total number of movies", title = "The number of movies by Bechdel test ratings")+
  scale_x_continuous(n.breaks=14) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

Perhaps the percentage of films that pass the Bechdel test has not changed too dramatically compared to 70 years ago. However, a breakdown of the Bechdel test criteria for films that do not pass the Bechdel Test shows a different trend compared to 70 years ago.

Currently, most of not-bechdel-passing films fail to Bechdel test because there are more than one female character but they don’t talk each other. There are few movies (still corresponding nearly 10% of total movies released every year) that don’t have two female characters.

It has not been always the case. In 1920s, quite many movies didn’t even have two featured female characters.

data %>%
  filter(lubridate::year(release_date) > 1920 & bechdel_pass=="0") %>%
  group_by(years, bechdel_rating) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=years, y= n_total, fill = bechdel_rating)) + geom_bar(stat = "identity", position="fill") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("0.No two female characters", "1.Two female characters", "2.That talk each other about men", "3.Other than a man (Passes Bechdel Test)")) +
  labs(x="year", y="percentage", title = "% of movies that do NOT pass Bechdel test by Bechdel test criteria") +
  scale_y_continuous(n.breaks=10) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))

`summarise()` has grouped output by 'years'. You can override using the
`.groups` argument.

What category represents women better?

I focused on 6514 movies released between 1970 and 2022 because there is a small number of movies released before 1970 (25 or less movies a year) and it’s better to focus on the priod where more movies were released to observe the trend.

The degree to which women are portrayed in a film’s story varies widely from genre to genre.

For example, many movies in Comedy, Animation, Drama, Family, Fantasy, Romance, and Music pass the Bechdel test, while Action, Adventure, Crime, History, War, and Western The percentage of films that pass the Bechdel test is about 50% or less.

viz_by_genre <- data %>%
  pivot_longer(cols = c(War:Music),names_to = "genre_name", values_drop_na = TRUE) %>%
  filter(value == 1) 

viz_by_genre %>%
  filter(lubridate::year(release_date) > 1970) %>%
  group_by(year = lubridate::year(release_date), bechdel_rating, genre_name) %>%
  dplyr::summarize(n_total = n()) %>%
  ggplot(aes(x=year, y= n_total, fill = bechdel_rating)) + 
  geom_bar(stat = "identity", position = "fill") +
  facet_wrap(~genre_name, ncol = 3)+
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No two female characters", "Two female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  labs(x = "year", y = "proportion", title = "Proportion of movies that pass / don't pass Bechdel test by categories")+
  scale_x_continuous(n.breaks=7) +
  theme(legend.position="bottom") +
  guides(fill = guide_legend(nrow = 2))

`summarise()` has grouped output by 'year', 'bechdel_rating'. You can override
using the `.groups` argument.

viz_by_genre %>%
  filter(lubridate::year(release_date)>1970) %>%
  group_by(bechdel_rating, genre_name) %>%
  dplyr::summarize(number = n()) %>%
  ggplot(aes(x=genre_name, y=number, fill=as.character(bechdel_rating))) +
  geom_bar(stat="identity", position = "fill") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No two female characters", "Two female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  coord_flip() +
  labs(x="Genre", y="%")

`summarise()` has grouped output by 'bechdel_rating'. You can override using
the `.groups` argument.

Why do movies do NOT pass Bechdel test?

viz_by_genre %>%
  filter(lubridate::year(release_date)>1970 & bechdel_pass == "0") %>%
  group_by(bechdel_rating, genre_name) %>%
  dplyr::summarize(number = n()) %>%
  ggplot(aes(x=genre_name, y=number, fill=as.character(bechdel_rating))) +
  geom_bar(stat="identity", position = "fill") +
  scale_fill_manual(values = c("gray0", "gray40", "gray80", "orange"),
                    labels=c("No two female characters", "Two female characters", "That talk each other", "Other than a man (Passes Bechdel Test)")) +
  coord_flip()+
  labs(x="Genre", y="%")

`summarise()` has grouped output by 'bechdel_rating'. You can override using
the `.groups` argument.

Do movies with good female representation succeed in gaining more popularity on online review site?

I compared Metascore and IMDb ratings of the films that pass and don’t pass Bechdel test.

The movies that don’t pass Bechdel test seem to score higher on Metascore, however the difference is quite small, and there seems to be no relationship between whether a film passes the bechdel test and its reputation (Metascore).

data %>%
  filter(lubridate::year(release_date) > 1970) %>%
  ggplot(aes(x=bechdel_pass, y=Metascore)) + 
  geom_boxplot() +
  scale_x_discrete(labels=c("0: Don't pass Bechel", "1: Pass Bechdel")) +
  labs(x="Bechdel Test")

Warning: Removed 1530 rows containing non-finite values (stat_boxplot).

T-test shows that there is no statistically significant difference in Metascores of the movies that pass and that don’t pass Bechdel test.

ttest_pass <- data %>% 
  filter(bechdel_pass == "1" & lubridate::year(release_date) > 1970) %>%
  select(c("bechdel_pass", "imdbRating", "imdbVotes", "Metascore", "BoxOffice"))

ttest_not_pass <- data %>%
  filter(bechdel_pass == "0" & lubridate::year(release_date) > 1970) %>%
  select(c("bechdel_pass", "imdbRating", "imdbVotes", "Metascore", "BoxOffice"))

t.test(ttest_pass$Metascore, ttest_not_pass$Metascore)


    Welch Two Sample t-test

data:  ttest_pass$Metascore and ttest_not_pass$Metascore
t = 0.059821, df = 4647, p-value = 0.9523
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.9335741  0.9923409
sample estimates:
mean of x mean of y 
 58.66466  58.63527

Then, I compared the IMDb scores of the movies that pass and don’t pass Bechdel test. Those don’t pass Bechdel test seem to score a bit higher.

data %>%
  filter(lubridate::year(release_date) > 1970) %>%
  ggplot(aes(x=bechdel_pass, y=imdbRating)) + 
  geom_boxplot() +
  scale_x_discrete(labels=c("0: Don't pass Bechel", "1: Pass Bechdel")) +
  labs(x="Bechdel Test", y="IMDb Rating")

Warning: Removed 18 rows containing non-finite values (stat_boxplot).

T-test shows that the movies that do not meet Bechdel test criteria score statistically higher than those that do.

t.test(ttest_pass$imdbRating, ttest_not_pass$imdbRating)


    Welch Two Sample t-test

data:  ttest_pass$imdbRating and ttest_not_pass$imdbRating
t = -8.029, df = 6104.8, p-value = 1.17e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2395005 -0.1454990
sample estimates:
mean of x mean of y 
 6.463997  6.656497

Do movies with good female representation succeed in making more money?

data %>%
  filter(lubridate::year(release_date) > 1970) %>%
  ggplot(aes(x=as.character(bechdel_pass), y=BoxOffice)) + geom_boxplot()

Warning: Removed 1642 rows containing non-finite values (stat_boxplot).

t.test(ttest_pass$BoxOffice, ttest_not_pass$BoxOffice)


    Welch Two Sample t-test

data:  ttest_pass$BoxOffice and ttest_not_pass$BoxOffice
t = -2.0793, df = 4805.5, p-value = 0.03765
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7909685.6  -232633.7
sample estimates:
mean of x mean of y 
 41825853  45897013

Conclusion & Discussion

% of the movies that pass Bechdel test:

Compared to 100 years ago, the rate of movies that pass Bechdel test significantly increased, however,

Contact

The dataset that I used for this analysis can be found on my GitHub (https://github.com/Enagai-nagai/601_Fall_2022/tree/template/posts/_data).

I’m interested in analyzing and visualizing data related to entertainment industry, behavior on digital platforms, and gender issues,

If you have any comments or questions, please contact me through Email.

enagai ★ umass.edu (please replace ★ with @)
nagainagai.e ★ gmail.com (please replace ★ with @)

References

Study: Women made up 34% of speaking roles in 2021’s top films, majority of those characters were white. Women and Hollywood. (2022, March 15). Retrieved December 13, 2022, from https://womenandhollywood.com/study-women-made-up-34-of-speaking-roles-in-2021s-top-films-majority-of-those-characters-were-white/#:~:text=In%202021’s%20top%20films%2C%20females,and%2037%20percent%20in%202019.

Introduction

Read in data

(1) “The Movie Dataset” from Kaggle :

(2) Bechdel test API:

(3) Open Movie Database (OMDb):

Describe data

(1) TMDb

Data Frame Summary

movie

(2) Bechdel test

Data Frame Summary

bechdel

Tidy data

(1) TMDb data

Checking missing values

Deleting unnecessary columns

Cleaning data in JSON nested list format

(1). Tidying Genres

(2). Tyding Production_countries

(3). Tyding production_companies & spoken_languages

Adding a new column

Data Frame Summary

movie_clean1

(2) Bechdel data

Removing and Mutating Columns

Sense Check

Join datasets

Read in data / Describe data (OMDb)

Data Analysis and Visualization

Is female representation in movie stories improving over time?

What category represents women better?

Why do movies do NOT pass Bechdel test?

Do movies with good female representation succeed in gaining more popularity on online review site?

Do movies with good female representation succeed in making more money?

Conclusion & Discussion

Contact

References

(1). Tidying `Genres`

(2). Tyding `Production_countries`

(3). Tyding `production_companies` & `spoken_languages`