---
title: "Homework2 Erika Nagai"
author: "Erika Nagai"
description: "Introduction to Visualization"
date: "10/12/2022"
format:
html:
toc: true
code-copy: true
code-tools: true
categories:
- hw2
- movie
- gender
---
## Challenge Overview
- Read in a dataset from the \_data folder in the course blog repository, or choose your own data. If you decide to use one of the datasets we have provided, please use a challenging dataset - check with us if you are not sure.
- Clean the data as needed using dplyr and related tidyverse packages.
- Provide a narrative about the data set (look it up if you aren't sure what you have got) and the variables in your dataset, including what type of data each variable is. The goal of this step is to communicate in a visually appealing way to non-experts - not to replicate r-code.
- Identify potential research questions that your dataset can help answer.
quarto-executable-code-5450563D
```r
# install libraries
library(tidyverse)
library(ggplot2)
library(stringr)
library(tidyr)
library(dplyr)
library(summarytools)
```
## Read in a data
quarto-executable-code-5450563D
```r
movie = read_csv("_data/movies_metadata.csv")
```
This movie dataset was generated by Movielens, a (non-profit) movie review website (https://movielens.org/), and was obtained from the following Kaggle link. (https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download&select=movies_metadata.csv)
The movie dataset is contains 45466 movies with a released date between December 9th 1874 and December 16th 2020.
The data includes the information of genres, revenue, runtime, languages, status (released/in production etc...).
This dataset includes the following columns.
quarto-executable-code-5450563D
```r
colnames(movie)
```
The data type of each column is as follows.
quarto-executable-code-5450563D
```r
str(movie)
```
quarto-executable-code-5450563D
```r
movieSummary = dfSummary(movie)
```
## Tidy data
The values in some certain columns such as "belongs_to_collection", "genres", "production_companies", "production_countries", "spoken_languages" are in a list format.
quarto-executable-code-5450563D
```r
movie %>% select(c("belongs_to_collection", "genres", "production_companies", "production_countries", "spoken_languages"))
```
1. Genre
First I need to delete "\[" and "\]". I used a useful package that can remove brackets (round, square, curly or any shape), "qdapRegex".
```{r}
head(movie$genres)
```
2. **Question**: I had to use this package to remove the swuare brackets but I originally wanted to do so with str_extract or str_replace, but it didn't work because "\[ \]" have a special meaning in regex. I would appreciate it if you could show me how I could have done it by using str\_ functions.
```{r}
library(qdapRegex)
movie$clean_genres <- rm_square(movie$genres, extract = TRUE)
head(movie$clean_genres)
```
I counted the number of "id" showed in the "genre" column to see how many genres each movie has.
The maximum number of genres is 8 and there are movies that do NOT have a genre assigned.
```{r}
movie$num_genre <- str_count(movie$genres, "id")
summary(movie$num_genre)
```
Since a single column "genres" now contains multiple genres of information, let's split the string so that each column contains only one genre of information.
```{r}
#str_split(movie$clean_genres, "\\},")
movie <- movie %>%
separate(clean_genres, c("genre1", "genre2", "genre3", "genre4", "genre5", "genre6", "genre7", "genre8"), "\\},")
```
The values in the genre1 to genre8 columns still contain unnecessary {}, etc., so let's clean them up!
```{r}
#"\\{", "\\}",
# I wanted to as follows but this misses some "{" "
# movie$genre1 <- str_replace(movie$genre1, c("\\{","\\}"), "")
# remove {
movie$genre1 <- str_replace(movie$genre1, "\\{", "")
movie$genre2 <- str_replace(movie$genre2, "\\{", "")
movie$genre3 <- str_replace(movie$genre3, "\\{", "")
movie$genre4 <- str_replace(movie$genre4, "\\{", "")
movie$genre5 <- str_replace(movie$genre5, "\\{", "")
movie$genre6 <- str_replace(movie$genre6, "\\{", "")
movie$genre7 <- str_replace(movie$genre7, "\\{", "")
movie$genre8 <- str_replace(movie$genre8, "\\{", "")
# remove }
movie$genre1 <- str_replace(movie$genre1, "\\}", "")
movie$genre2 <- str_replace(movie$genre2, "\\}", "")
movie$genre3 <- str_replace(movie$genre3, "\\}", "")
movie$genre4 <- str_replace(movie$genre4, "\\}", "")
movie$genre5 <- str_replace(movie$genre5, "\\}", "")
movie$genre6 <- str_replace(movie$genre6, "\\}", "")
movie$genre7 <- str_replace(movie$genre7, "\\}", "")
movie$genre8 <- str_replace(movie$genre8, "\\}", "")
# remove "'id':"
movie$genre1 <- str_replace(movie$genre1, "'id':", "")
movie$genre2 <- str_replace(movie$genre2, "'id':", "")
movie$genre3 <- str_replace(movie$genre3, "'id':", "")
movie$genre4 <- str_replace(movie$genre4, "'id':", "")
movie$genre5 <- str_replace(movie$genre5, "'id':", "")
movie$genre6 <- str_replace(movie$genre6, "'id':", "")
movie$genre7 <- str_replace(movie$genre7, "'id':", "")
movie$genre8 <- str_replace(movie$genre8, "'id':", "")
# remove ", 'name':"
movie$genre1 <- str_replace(movie$genre1, ", 'name': ", "")
movie$genre2 <- str_replace(movie$genre2, ", 'name': ", "")
movie$genre3 <- str_replace(movie$genre3, ", 'name': ", "")
movie$genre4 <- str_replace(movie$genre4, ", 'name': ", "")
movie$genre5 <- str_replace(movie$genre5, ", 'name': ", "")
movie$genre6 <- str_replace(movie$genre6, ", 'name': ", "")
movie$genre7 <- str_replace(movie$genre7, ", 'name': ", "")
movie$genre8 <- str_replace(movie$genre8, ", 'name': ", "")
```
Next we will make the genre name columns.
```{r}
pivot_wider(
movie,
names_from = genre1, values_from = num_genre
)
```
**Question**: I wanted to make genre options contained in genre 1 - 8 columns like the below image, but I haven't figured out how... I tried to use the Pivot_wider feature, but it didn't work the way I wanted it to.
Is there any function that I could use for this?
![](images/paste-B3340D59.png)
After I figure out how to clean "genre" columns, I will do the same with the columns "belongs_to_collection", "genres", "production_companies", "production_countries", "spoken_languages".
Then, I would love to join the bechdel test dataset (The movie has to have at least 2 \[named\] female characters, who talk to each other, about something other than a man) by using imdb_id.
## Research questions
- How does female representation in the movie affects its popularity and profitability (≒ revenue)?
- How has changed the degree of female representation in the movies worldwide?