Submission for Homework 4
The following table lists the median, mean and sd for each of the numeric columns (with column name suffixes _mean, _median and _sd respectively representing corresponding statistic)
library(dplyr)
movie_data <- read.csv("C:/Users/gbsam/Desktop/movie_metadata.csv")
movie_data %>% summarise_if(is.numeric, list(mean = mean,median = median,sd = sd), na.rm = TRUE) %>% head()
num_critic_for_reviews_mean duration_mean
1 140.1943 107.2011
director_facebook_likes_mean actor_3_facebook_likes_mean
1 686.5092 645.0098
actor_1_facebook_likes_mean gross_mean num_voted_users_mean
1 6560.047 48468408 83668.16
cast_total_facebook_likes_mean facenumber_in_poster_mean
1 9699.064 1.371173
num_user_for_reviews_mean budget_mean title_year_mean
1 272.7708 39752620 2002.471
actor_2_facebook_likes_mean imdb_score_mean aspect_ratio_mean
1 1651.754 6.442138 2.220403
movie_facebook_likes_mean num_critic_for_reviews_median
1 7525.965 110
duration_median director_facebook_likes_median
1 103 49
actor_3_facebook_likes_median actor_1_facebook_likes_median
1 371.5 988
gross_median num_voted_users_median
1 25517500 34359
cast_total_facebook_likes_median facenumber_in_poster_median
1 3090 1
num_user_for_reviews_median budget_median title_year_median
1 156 2e+07 2005
actor_2_facebook_likes_median imdb_score_median aspect_ratio_median
1 595 6.6 2.35
movie_facebook_likes_median num_critic_for_reviews_sd duration_sd
1 166 121.6017 25.19744
director_facebook_likes_sd actor_3_facebook_likes_sd
1 2813.329 1665.042
actor_1_facebook_likes_sd gross_sd num_voted_users_sd
1 15020.76 68452990 138485.3
cast_total_facebook_likes_sd facenumber_in_poster_sd
1 18163.8 2.013576
num_user_for_reviews_sd budget_sd title_year_sd
1 377.9829 206114898 12.4746
actor_2_facebook_likes_sd imdb_score_sd aspect_ratio_sd
1 4042.439 1.125116 1.385113
movie_facebook_likes_sd
1 19320.45
The following table lists the frequency of classes for each of the non-numeric columns (skipping a few columns because of too many different values)
language n
1 English 4704
2 French 73
3 Spanish 40
4 Hindi 28
5 Mandarin 26
6 German 19
7 Japanese 18
8 12
9 Cantonese 11
10 Italian 11
11 Russian 11
12 Korean 8
13 Portuguese 8
14 Arabic 5
15 Danish 5
16 Hebrew 5
17 Swedish 5
18 Dutch 4
19 Norwegian 4
20 Persian 4
21 Polish 4
22 Chinese 3
23 Thai 3
24 Aboriginal 2
25 Dari 2
26 Icelandic 2
27 Indonesian 2
28 None 2
29 Romanian 2
30 Zulu 2
31 Aramaic 1
32 Bosnian 1
33 Czech 1
34 Dzongkha 1
35 Filipino 1
36 Greek 1
37 Hungarian 1
38 Kannada 1
39 Kazakh 1
40 Maya 1
41 Mongolian 1
42 Panjabi 1
43 Slovenian 1
44 Swahili 1
45 Tamil 1
46 Telugu 1
47 Urdu 1
48 Vietnamese 1
country n
1 USA 3807
2 UK 448
3 France 154
4 Canada 126
5 Germany 97
6 Australia 55
7 India 34
8 Spain 33
9 China 30
10 Italy 23
11 Japan 23
12 Hong Kong 17
13 Mexico 17
14 New Zealand 15
15 South Korea 14
16 Ireland 12
17 Denmark 11
18 Russia 11
19 Brazil 8
20 Norway 8
21 South Africa 8
22 Sweden 6
23 5
24 Netherlands 5
25 Poland 5
26 Thailand 5
27 Argentina 4
28 Belgium 4
29 Iran 4
30 Israel 4
31 Romania 4
32 Czech Republic 3
33 Iceland 3
34 Switzerland 3
35 West Germany 3
36 Greece 2
37 Hungary 2
38 Taiwan 2
39 Afghanistan 1
40 Aruba 1
41 Bahamas 1
42 Bulgaria 1
43 Cambodia 1
44 Cameroon 1
45 Chile 1
46 Colombia 1
47 Dominican Republic 1
48 Egypt 1
49 Finland 1
50 Georgia 1
51 Indonesia 1
52 Kenya 1
53 Kyrgyzstan 1
54 Libya 1
55 New Line 1
56 Nigeria 1
57 Official site 1
58 Pakistan 1
59 Panama 1
60 Peru 1
61 Philippines 1
62 Slovakia 1
63 Slovenia 1
64 Soviet Union 1
65 Turkey 1
66 United Arab Emirates 1
content_rating n
1 R 2118
2 PG-13 1461
3 PG 701
4 303
5 Not Rated 116
6 G 112
7 Unrated 62
8 Approved 55
9 TV-14 30
10 TV-MA 20
11 TV-PG 13
12 X 13
13 TV-G 10
14 Passed 9
15 NC-17 7
16 GP 6
17 M 5
18 TV-Y 1
19 TV-Y7 1
The following 3 tables shows us the top 10 director, actor and language based on their IMDB scores.
movie_data %>% group_by(director_name) %>% summarise_at(vars(imdb_score),funs(mean)) %>% arrange(desc(imdb_score))
# A tibble: 2,399 x 2
director_name imdb_score
<chr> <dbl>
1 John Blanchard 9.5
2 Cary Bell 8.7
3 Mitchell Altieri 8.7
4 Sadyk Sher-Niyaz 8.7
5 Charles Chaplin 8.6
6 Mike Mayhall 8.6
7 Damien Chazelle 8.5
8 Majid Majidi 8.5
9 Raja Menon 8.5
10 Ron Fricke 8.5
# ... with 2,389 more rows
movie_data %>% group_by(actor_1_name) %>% summarise_at(vars(imdb_score),funs(mean)) %>% arrange(desc(imdb_score))
# A tibble: 2,098 x 2
actor_1_name imdb_score
<chr> <dbl>
1 Krystyna Janda 9.1
2 Jack Warden 8.9
3 Rob McElhenney 8.8
4 Abigail Evans 8.7
5 Elina Abai Kyzy 8.7
6 Jackie Gleason 8.7
7 Kimberley Crossman 8.7
8 Maria Pia Calzone 8.7
9 Takashi Shimura 8.7
10 Bunta Sugawara 8.6
# ... with 2,088 more rows
movie_data %>% group_by(language) %>% summarise_at(vars(imdb_score),funs(mean)) %>% arrange(desc(imdb_score))
# A tibble: 48 x 2
language imdb_score
<chr> <dbl>
1 Telugu 8.4
2 Polish 8.25
3 None 7.95
4 Indonesian 7.9
5 Maya 7.8
6 Hebrew 7.58
7 Persian 7.58
8 Icelandic 7.55
9 Danish 7.5
10 Dari 7.5
# ... with 38 more rows
The following bar graph shows the distribution of the IMDB ratings in this dataset across years. The variable I am using is ‘title_year’, which depicts the year that the movie was released. From the visualization, it is clear that majority of the movies in the dataset have been released between the years 2000 - 2010.
Limitations :
The following plot shows the variation of IMDB score of a movie with respect to the number of critic reviews for the movie. From the plot, we are able to identify that as the number of critic reviews increase, the chances of higher IMDB rating increases.
ggplot(data = movie_data, aes(x = num_critic_for_reviews, y = imdb_score)) + geom_point()
The following plot shows the variation of IMDB score of a movie with respect to the number of likes on the movie’s Facebook page. From the plot, we are able to identify that as the number of Facebook likes increase, the chances of higher IMDB rating increases.
ggplot(data = movie_data, aes(x = movie_facebook_likes, y = imdb_score)) + geom_point()
Limitations of the above bi-variate plots :
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Barlaya (2022, Jan. 14). Data Analytics and Computational Social Science: Homework 4 : Samhith Barlaya. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomsbarlayahw4/
BibTeX citation
@misc{barlaya2022homework, author = {Barlaya, Samhith}, title = {Data Analytics and Computational Social Science: Homework 4 : Samhith Barlaya}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomsbarlayahw4/}, year = {2022} }