Data Analytics and Computational Social Science: Homework 4 : Samhith Barlaya

Samhith Barlaya

1. Read in your dataset, and compute descriptive statistics for each of your variables using dplyr.

This should include mean, median, and standard deviation for numerical variables, and frequencies for categorical variables.

The following table lists the median, mean and sd for each of the numeric columns (with column name suffixes _mean, _median and _sd respectively representing corresponding statistic)

library(dplyr)
movie_data <- read.csv("C:/Users/gbsam/Desktop/movie_metadata.csv")

movie_data %>%   summarise_if(is.numeric, list(mean = mean,median = median,sd = sd), na.rm = TRUE) %>% head()

  num_critic_for_reviews_mean duration_mean
1                    140.1943      107.2011
  director_facebook_likes_mean actor_3_facebook_likes_mean
1                     686.5092                    645.0098
  actor_1_facebook_likes_mean gross_mean num_voted_users_mean
1                    6560.047   48468408             83668.16
  cast_total_facebook_likes_mean facenumber_in_poster_mean
1                       9699.064                  1.371173
  num_user_for_reviews_mean budget_mean title_year_mean
1                  272.7708    39752620        2002.471
  actor_2_facebook_likes_mean imdb_score_mean aspect_ratio_mean
1                    1651.754        6.442138          2.220403
  movie_facebook_likes_mean num_critic_for_reviews_median
1                  7525.965                           110
  duration_median director_facebook_likes_median
1             103                             49
  actor_3_facebook_likes_median actor_1_facebook_likes_median
1                         371.5                           988
  gross_median num_voted_users_median
1     25517500                  34359
  cast_total_facebook_likes_median facenumber_in_poster_median
1                             3090                           1
  num_user_for_reviews_median budget_median title_year_median
1                         156         2e+07              2005
  actor_2_facebook_likes_median imdb_score_median aspect_ratio_median
1                           595               6.6                2.35
  movie_facebook_likes_median num_critic_for_reviews_sd duration_sd
1                         166                  121.6017    25.19744
  director_facebook_likes_sd actor_3_facebook_likes_sd
1                   2813.329                  1665.042
  actor_1_facebook_likes_sd gross_sd num_voted_users_sd
1                  15020.76 68452990           138485.3
  cast_total_facebook_likes_sd facenumber_in_poster_sd
1                      18163.8                2.013576
  num_user_for_reviews_sd budget_sd title_year_sd
1                377.9829 206114898       12.4746
  actor_2_facebook_likes_sd imdb_score_sd aspect_ratio_sd
1                  4042.439      1.125116        1.385113
  movie_facebook_likes_sd
1                19320.45

The following table lists the frequency of classes for each of the non-numeric columns (skipping a few columns because of too many different values)

movie_data %>%   count(language, sort = TRUE)

     language    n
1     English 4704
2      French   73
3     Spanish   40
4       Hindi   28
5    Mandarin   26
6      German   19
7    Japanese   18
8               12
9   Cantonese   11
10    Italian   11
11    Russian   11
12     Korean    8
13 Portuguese    8
14     Arabic    5
15     Danish    5
16     Hebrew    5
17    Swedish    5
18      Dutch    4
19  Norwegian    4
20    Persian    4
21     Polish    4
22    Chinese    3
23       Thai    3
24 Aboriginal    2
25       Dari    2
26  Icelandic    2
27 Indonesian    2
28       None    2
29   Romanian    2
30       Zulu    2
31    Aramaic    1
32    Bosnian    1
33      Czech    1
34   Dzongkha    1
35   Filipino    1
36      Greek    1
37  Hungarian    1
38    Kannada    1
39     Kazakh    1
40       Maya    1
41  Mongolian    1
42    Panjabi    1
43  Slovenian    1
44    Swahili    1
45      Tamil    1
46     Telugu    1
47       Urdu    1
48 Vietnamese    1

movie_data %>%   count(country,sort = TRUE)

                country    n
1                   USA 3807
2                    UK  448
3                France  154
4                Canada  126
5               Germany   97
6             Australia   55
7                 India   34
8                 Spain   33
9                 China   30
10                Italy   23
11                Japan   23
12            Hong Kong   17
13               Mexico   17
14          New Zealand   15
15          South Korea   14
16              Ireland   12
17              Denmark   11
18               Russia   11
19               Brazil    8
20               Norway    8
21         South Africa    8
22               Sweden    6
23                         5
24          Netherlands    5
25               Poland    5
26             Thailand    5
27            Argentina    4
28              Belgium    4
29                 Iran    4
30               Israel    4
31              Romania    4
32       Czech Republic    3
33              Iceland    3
34          Switzerland    3
35         West Germany    3
36               Greece    2
37              Hungary    2
38               Taiwan    2
39          Afghanistan    1
40                Aruba    1
41              Bahamas    1
42             Bulgaria    1
43             Cambodia    1
44             Cameroon    1
45                Chile    1
46             Colombia    1
47   Dominican Republic    1
48                Egypt    1
49              Finland    1
50              Georgia    1
51            Indonesia    1
52                Kenya    1
53           Kyrgyzstan    1
54                Libya    1
55             New Line    1
56              Nigeria    1
57        Official site    1
58             Pakistan    1
59               Panama    1
60                 Peru    1
61          Philippines    1
62             Slovakia    1
63             Slovenia    1
64         Soviet Union    1
65               Turkey    1
66 United Arab Emirates    1

movie_data %>%   count(content_rating,sort = TRUE)

   content_rating    n
1               R 2118
2           PG-13 1461
3              PG  701
4                  303
5       Not Rated  116
6               G  112
7         Unrated   62
8        Approved   55
9           TV-14   30
10          TV-MA   20
11          TV-PG   13
12              X   13
13           TV-G   10
14         Passed    9
15          NC-17    7
16             GP    6
17              M    5
18           TV-Y    1
19          TV-Y7    1

In addition to overall means, medians, and SDs, use group_by() and summarise() to compute mean/median/SD for any relevant groupings.

The following 3 tables shows us the top 10 director, actor and language based on their IMDB scores.

movie_data %>% group_by(director_name) %>% summarise_at(vars(imdb_score),funs(mean)) %>%  arrange(desc(imdb_score))

# A tibble: 2,399 x 2
   director_name    imdb_score
   <chr>                 <dbl>
 1 John Blanchard          9.5
 2 Cary Bell               8.7
 3 Mitchell Altieri        8.7
 4 Sadyk Sher-Niyaz        8.7
 5 Charles Chaplin         8.6
 6 Mike Mayhall            8.6
 7 Damien Chazelle         8.5
 8 Majid Majidi            8.5
 9 Raja Menon              8.5
10 Ron Fricke              8.5
# ... with 2,389 more rows

movie_data %>% group_by(actor_1_name) %>% summarise_at(vars(imdb_score),funs(mean)) %>%  arrange(desc(imdb_score))

# A tibble: 2,098 x 2
   actor_1_name       imdb_score
   <chr>                   <dbl>
 1 Krystyna Janda            9.1
 2 Jack Warden               8.9
 3 Rob McElhenney            8.8
 4 Abigail Evans             8.7
 5 Elina Abai Kyzy           8.7
 6 Jackie Gleason            8.7
 7 Kimberley Crossman        8.7
 8 Maria Pia Calzone         8.7
 9 Takashi Shimura           8.7
10 Bunta Sugawara            8.6
# ... with 2,088 more rows

movie_data %>% group_by(language) %>% summarise_at(vars(imdb_score),funs(mean)) %>%  arrange(desc(imdb_score))

# A tibble: 48 x 2
   language   imdb_score
   <chr>           <dbl>
 1 Telugu           8.4 
 2 Polish           8.25
 3 None             7.95
 4 Indonesian       7.9 
 5 Maya             7.8 
 6 Hebrew           7.58
 7 Persian          7.58
 8 Icelandic        7.55
 9 Danish           7.5 
10 Dari             7.5 
# ... with 38 more rows

2. Visualization

3. Explanation for each visualization.

4. Limitations of the visualizations

The following bar graph shows the distribution of the IMDB ratings in this dataset across years. The variable I am using is ‘title_year’, which depicts the year that the movie was released. From the visualization, it is clear that majority of the movies in the dataset have been released between the years 2000 - 2010.

Limitations :

X-axis values could have been more granular, in order to identify the exact range of the years.
Can leave out the years with very few releases, for making the plot more clear to a naive user.
For the project, I intend to add more variables like language and country to help obtain more clarity on the dataset.

library(ggplot2)
ggplot(data = movie_data,aes(x = title_year)) + geom_bar()

The following plot shows the variation of IMDB score of a movie with respect to the number of critic reviews for the movie. From the plot, we are able to identify that as the number of critic reviews increase, the chances of higher IMDB rating increases.

ggplot(data = movie_data, aes(x = num_critic_for_reviews, y = imdb_score)) +   geom_point()

The following plot shows the variation of IMDB score of a movie with respect to the number of likes on the movie’s Facebook page. From the plot, we are able to identify that as the number of Facebook likes increase, the chances of higher IMDB rating increases.

ggplot(data = movie_data, aes(x = movie_facebook_likes, y = imdb_score)) +   geom_point()

Limitations of the above bi-variate plots :

Hard to find a pattern owing to too many values clustered in the left end of the plot.
Not enough data towards the right end to confidently conclude if increase in critic reviews increases IMDB rating.
Can add a line showing general trend in data too, for ease of understanding of a naive user.
For the final project, I plan to dig in and identify variation of IMDB score with respect to more variables like director_facebook_likes, names of the actors/directors, budget of the movie and the gross gains for a given movie using more visualizations.

Comment on this article Share:

Homework 4 : Samhith Barlaya