-
Notifications
You must be signed in to change notification settings - Fork 4
/
movie_predictions.Rmd
811 lines (590 loc) · 35.6 KB
/
movie_predictions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
---
title: "Predicting Movie popularity"
output:
html_document:
fig_height: 4
highlight: pygments
theme: spacelab
pdf_document: default
---
### Setup
### Load packages
```{r load-packages, message = FALSE}
library(ggplot2)
library(dplyr)
library(statsr)
library(grid)
library(gridExtra)
library(tidyverse)
library(broom)
library(plyr)
library(BAS)
library(pander)
library(GGally)
```
### Load data
```{r load-data}
load("movies.Rdata")
```
* * *
## Data
In the `movies` dataset, there is 651 **randomly sampled** movies which were released in United States movie theater in the period of 1970-2016. The data was obtained from [Rotten Tomatoes](https://www.rottentomatoes.com/) and [IMDB](https://www.imdb.com/). The dataset contains 32 features of each movie, including genre, MPAA rating, production studio, and whether they received Oscar nominations.
Because the movies included in this dataset were randomly sampled from the above two mentioned sources and no bias was created by the sampling method, we can assume that the results obtained can be **generalized to all U.S movies released between 1970 and 2016**. On the other hand, because this is an observational study, **the relationships that could be found from this data indicate association, but _not causation_**
* * *
## Data wrangling
We will first create some new variables that will be used later in the Bayesian modeling making use of the function `mutate` as follows:
* `feature_film`: with two levels:
+ 'yes': if `title_type` is 'Feature Film'
+ 'no': otherwise
```{r feature}
movies <- movies%>%
mutate(feature_film = factor(ifelse(title_type == 'Feature Film', 'yes', 'no')))
```
* `drama`: with two levels:
+ 'yes': if `genre` is 'Drama'
+ 'no': otherwise
```{r drama}
movies <- movies%>%
mutate(drama = factor(ifelse(genre == 'Drama', 'yes', 'no')))
```
* `mpaa_rating_R`: with two levels:
+ 'yes': if `mpaa_rating` is 'R'
+ 'no': otherwise
```{r mpaa}
movies <- movies%>%
mutate(mpaa_rating_R = factor(ifelse(mpaa_rating == 'R', 'yes', 'no')))
```
* `oscar_season`: with two levels:
+ 'yes': if `thtr_rel_month` is 'November' or 'October' or 'December'
+ 'no': otherwise
```{r oscar}
movies <- movies%>%
mutate(oscar_season = factor(ifelse(thtr_rel_month == 10|
thtr_rel_month == 11|
thtr_rel_month == 12,
'yes', 'no')))
```
* `summer_season`: with two levels:
+ 'yes': if `thtr_rel_month` is 'May' or 'June' or 'July' or 'August'
+ 'no': otherwise
```{r summer}
movies <- movies%>%
mutate(summer_season = factor(ifelse(thtr_rel_month == 5|
thtr_rel_month == 6|
thtr_rel_month == 7|
thtr_rel_month == 8,
'yes', 'no')))
```
* * *
The aim of this project is to assess whether:
> Can the popularity of a movie be predicted by considering certain characteristic information about it such as type, genre, MPAA rating, number of IMDb votes, and whether it has won an award?
* **Why predicting popularity of a movie?**
Movie popularity can help people to decide which movie to watch, or whether they want to go the cinema to watch it or wait till the DVD is release and watch it at home. Consequently, it could also help theater owner to choose which movies to show or how many times to show it or for how long.
* * *
## FIRST PART: LINEAR REGRESSION
For this first part of the project, we will considered the following variables:
* **Variables to consider:**
+ We will analyze movie popularity by considering the variables:
+ `audience_score`(Audience score on Rotten Tomatoes)
+ `imdb_rating`(Rating on IMDB)
+ For the characteristics of the movie we will considered the following variables:
+ `title_type` (Type of movie)
+ `genre` (Genre of movie)
+ `mpaa_rating` (MPAA rating of the movie)
+ `imdb_num_votes`( Number of votes on IMDB)
+ `best_pic_win` (Whether or not the movie won a best picture Oscar)
+ `best_actor_win` (Whether or not one of the main actors in the movie ever won an Oscar)
+ `best_actress_win` (Whether or not one of the main actress in the movie ever won an Oscar)
+ `best_dir_win` (Whether or not one of the director in the movie ever won an Oscar)
## Exploratory data analysis
First of all, we will check whether `audience_score` and `imdb_rating` show a correlation between them. For doing this, we will plot both variables in a scatter plot:
```{r fig.width=5, fig.height=3}
ggplot(movies, aes(x=audience_score, y=imdb_rating))+
theme_minimal()+
geom_point()+
geom_smooth(method = lm)+
labs(x = "Score in Rotten Tomatoes", y = "Rating in Imdb")
```
We can see that the plot shows a possible positive correlation between the two variables. We will confirm this by using the function `cor` to numerically calculate this correlation:
```{r correlation calc}
cor(movies$audience_score, movies$imdb_rating)
```
As it can be observed from the plot and the correlation coefficient, there is a high correlation between both variables. So in this case, we can use for the model only one of them.
#### -- Distribution of `imbd_rating`:
We will check how our response variable is distributed by making use of `histogram` and summary statistics:
```{r histo, fig.width=5, fig.height=3}
ggplot(movies, aes(x=imdb_rating)) +
geom_histogram(fill="lightgreen", alpha = 0.7)+
theme_bw()+
labs(x = "Imdb rating", y= "Count", title = "Distribution of Imdb rating")
```
```{r table}
grid.newpage()
grid.table(movies %>%
summarise(mean = round(mean(imdb_rating), 3),
sd = round(sd(imdb_rating), 3),
median = median(imdb_rating),
IQR = IQR(imdb_rating),
min = min(imdb_rating),
max = max(imdb_rating)))
```
The variable __imbd_rating__ shows a distribution close to normal with a slightly left skew with a mean of 6.493 and a median of 6.00.
#### -- Distribution of `audience_score`:
We will check how the `audience_score` variable is distributed by making again use of `histogram` and summary statistics:
```{r histo2, fig.width=5, fig.height=3}
ggplot(movies, aes(x=audience_score)) +
geom_histogram(fill="pink", alpha = 0.7)+
theme_bw()+
labs(x = "Audience Score", y= "Count", title = "Distribution of audience score")
```
```{r summary2}
grid.newpage()
grid.table(movies %>%
summarise(mean = round(mean(audience_score), 3),
sd = round(sd(audience_score), 3),
median = median(audience_score),
IQR = IQR(audience_score),
min = min(audience_score),
max = max(audience_score)))
```
The variable __audience_score__ shows a more uniform distribution with a mean of 62.36 and a median of 65.00.
Because of its distribution, we will choose to considered only `imdb_rating` as the response variable.
We will subset the dataset to keep only those variables that we are interested in:
```{r subseting}
movies_interesting <- movies %>%
select(title_type, genre, mpaa_rating, imdb_num_votes, best_pic_win, best_actor_win,
best_actress_win, best_dir_win, imdb_rating)
```
#### -- Distribution of variables under consideration:
We will now considered how the variables that we are interested in including in our model are distributed. For this, we will plot a histogram for each of the variables.
```{r variables, fig.width=7, fig.height=7}
g1<- ggplot(movies_interesting, aes(x=genre)) +
geom_bar(fill="lightblue", alpha = 0.7)+
theme_bw()+
labs(x = "genre", y= "Count")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))
g2 <- ggplot(movies_interesting, aes(x=title_type)) +
geom_bar(fill="pink", alpha = 0.7)+
theme_bw()+
labs(x = "Type of movie", y= "Count")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))
g3 <- ggplot(movies_interesting, aes(x=mpaa_rating)) +
geom_bar(fill="lightgreen", alpha = 0.7)+
theme_bw()+
labs(x = "MPAA Rating", y= "Count")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))
g4 <- ggplot(movies_interesting, aes(x=imdb_num_votes)) +
geom_histogram(binwidth =50000, fill="orange", alpha = 0.7)+
theme_bw()+
labs(x = "Number of votes in Imdb", y= "Count")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))
g5 <- ggplot(movies_interesting, aes(x=best_pic_win)) +
geom_bar(fill="yellow", alpha = 0.7)+
theme_bw()+
labs(x = "Film won an oscar", y= "Count")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))
g6 <- ggplot(movies_interesting, aes(x=best_actor_win)) +
geom_bar(fill="grey", alpha = 0.7)+
theme_bw()+
labs(x = "Actor won an oscar", y= "Count")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))
g7 <- ggplot(movies_interesting, aes(x=best_actress_win)) +
geom_bar(fill="red", alpha = 0.7)+
theme_bw()+
labs(x = "Actress won an oscar", y= "Count")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))
g8 <- ggplot(movies_interesting, aes(x=best_dir_win)) +
geom_bar(fill="blue", alpha = 0.4)+
theme_bw()+
labs(x = "Director won an oscar", y= "Count")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))
grid.arrange(g2, g4, g3, g1, nrow=2, top = "Movie Characteristics")
```
```{r oscars, fig.height=4}
grid.arrange(g5, g6, g7, g8, nrow =2, top = "Film or Staff involved won an Oscar")
```
We need to obtained summary descriptive tables. For those variables that are categorical, we can use a proportion table in order to summarise them. For this task, the `table` build-in function can be used. On the other hand, we will create a data frame with the two continous variables and apply `summary` to obtained the descriptive statistics:
```{r summary}
summary(movies_interesting$imdb_num_votes)
table(movies_interesting$mpaa_rating)
table(movies_interesting$title_type)
table(movies_interesting$genre)
table(movies_interesting$best_pic_win)
table(movies_interesting$best_actress_win)
table(movies_interesting$best_actor_win)
table(movies_interesting$best_dir_win)
```
There are 591 movies in the dataset that are type “Feature Film”. We can observed that there are also 60 movies that belong to the category "Documentary" and "TV movies". Regarding the MPAA rating, we found 52 movies with MPAA ratings of NC-17 or unrated, 118 of PG, 133 of PG-13, 19 of G, and most of the movies, particularly 329, are rated as R. It's interesting to see that 305 films are clasified as Drama.
Because "Documentary" and "TV movies" are not likely to be displayed at Cinema theaters, we would exclude those type of movies and only include in the analysis "feature film"
```{r skip doc}
movies_interesting <- movies_interesting %>%
filter(title_type == "Feature Film")
```
#### -- Interaction between the variables under consideration:
Now, we can analyze the interaction between our exploratory variables and the response variable. For this task, we will plot boxplot or scatter plots according to whether the exploratory variable is numerical o categorical.
```{r plots, fig.width=7, fig.height=7}
p1 <- ggplot(movies_interesting, aes(x=genre, y = imdb_rating, fill=genre))+
geom_boxplot(alpha = 0.7)+
theme_bw()+
labs(x = "genre", y= "Imdb rating")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))+
theme(legend.position="none")
p2 <- ggplot(movies_interesting, aes(x=mpaa_rating, y = imdb_rating, fill=mpaa_rating))+
geom_boxplot(alpha = 0.7)+
theme_bw()+
labs(x = "mpaa_rating", y= "Imdb rating")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))+
theme(legend.position="none")
p3 <- ggplot(movies_interesting, aes(x=imdb_num_votes, y = imdb_rating))+
geom_point(colour = "blue", alpha = 0.5)+
theme_bw()+
geom_smooth()+
labs(x = "Number votes", y= "Imdb rating", fill = "won_oscar")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))+
theme(legend.position="none")
p4 <- ggplot(movies_interesting, aes(x=best_pic_win, y = imdb_rating, fill = best_pic_win))+
geom_boxplot(alpha = 0.7)+
theme_bw()+
labs(x = "Film_won_Oscar", y= "Imdb rating", fill = "best_pic_win")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))+
theme(legend.position="none")
p5 <- ggplot(movies_interesting, aes(x=best_actress_win, y = imdb_rating, fill = best_actress_win))+
geom_boxplot(alpha = 0.7)+
theme_bw()+
labs(x = "Actress_won_Oscar", y= "Imdb rating", fill = "best_actress_win")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))+
theme(legend.position="none")
p6 <- ggplot(movies_interesting, aes(x=best_actor_win, y = imdb_rating, fill = best_actor_win))+
geom_boxplot(alpha = 0.7)+
theme_bw()+
labs(x = "Actor_won_Oscar", y= "Imdb rating", fill = "best_actor_win")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))+
theme(legend.position="none")
p7 <- ggplot(movies_interesting, aes(x=best_dir_win, y = imdb_rating, fill = best_dir_win))+
geom_boxplot(alpha = 0.7)+
theme_bw()+
labs(x = "Director_won_Oscar", y= "Imdb rating", fill = "best_dir_win")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))+
theme(legend.position="none")
grid.arrange(p3, p4, p7, p5, p6, nrow = 3)
```
```{r staff oscar, fig.height=3}
grid.arrange(p5, p6, nrow =1)
```
```{r summary st}
ddply(movies_interesting,~best_dir_win,summarise,mean=mean(imdb_rating),sd=sd(imdb_rating))
```
```{r summary sta}
ddply(movies_interesting,~best_pic_win,summarise,mean=mean(imdb_rating),sd=sd(imdb_rating))
```
From the plots and the summary descriptive, we learn that MPAA rating and genre don't appear to have a clear association with the rating given in IMDB. However, those movies that won an oscar or the director ever won an oscar appear to have a sightly higher rating. Moreover, the number of votes given show a weak positive association with the IMDB rating.
We can see also that the distribution of `imdb_num_votes` looks rigth skew. In order to adjust this, we can apply a log-transformation to the values:
```{r log}
movies_interesting <- movies_interesting %>% mutate(log_votes = log(imdb_num_votes))
```
Last, the variables `best_actor_win` and `best_actress_win` appear to have the same distribution and a similar association with `imdb_rating`, so we will combine these two variables in a new one called `main_oscar_win`.
```{r main}
movies_interesting <- movies_interesting%>%
mutate(main_oscar_win = ifelse(best_actor_win == 'yes' | best_actress_win == 'yes', 'yes', 'no'))
movies_interesting <- movies_interesting%>%
select(-c(best_actor_win, best_actress_win))
```
* * *
## Modeling
Multiple linear regression seek to model the relationship between two or more independent or explanatory variables and the response variable by fitting a linear equation to the data.
There is a very important concept to have in mind for linear regression: Collinearity. Two variables are considered to be collinear when they are highly correlated with each other. The inclusion of collinear predictors complicates the model estimation.
Our goal is to reach a parsimonious model, this is the simpler model with great explanatory predictive power. In order to do this, we have two options for model selection: Forward selection and backwards elimination. In the first case, we start with an empty model and we add one predictor at a time. We will choose the second option: **Backwards elimination** implies starting with a model comprising all candidates and dropping one predictor at a time until the __parsimonious model__ is reached.
### Baseline model
Backwards elimination implies starting with a model comprising all candidates. In our case, our first full model includes all six variables. We will use `lm` for this task and include the variables `genre`, `best_pic_win`, `best_dir_win`, `main_oscar_win`, `log_votes` and `mpaa_rating`
```{r modeling}
fullmodel <- lm(imdb_rating ~ genre+best_pic_win+best_dir_win+main_oscar_win+log_votes+mpaa_rating,
data = movies_interesting)
summary(fullmodel)
```
### Model Selection
After running the full model with all the variables involved, we have obtained an $R_{adj}^2$ of 0.3582, which means that we can still improve the model. In order to do so, we can use the one-by-one remove method and start by removing the variable which has the highest p-value each time, until all the variables remaining in the model are significant. P-value was chosen as elimination criteria due to the fact that in this case, the aim was to create a model that shows the highest predictive value using only variables with sigificance.
So the variable that has the highest p-value in our model is `main_oscar_win`.
```{r step1 model}
step1_model <- lm(imdb_rating ~ genre+best_pic_win+best_dir_win+log_votes+mpaa_rating,
data = movies_interesting)
summary(step1_model)
```
After running again our simpler model, we can see that now our $R_{adj}^2$ is 0.3594. We can try to improve our model more by eliminating again the variable with the highest p-value. In this case, it will be `best_pic_win`.
```{r step2 model}
step2_model <- lm(imdb_rating ~ genre+best_dir_win+log_votes+mpaa_rating,
data = movies_interesting)
summary(step2_model)
```
We now see that the $R_{adj}^2$ is 0.3595, not different from our previous model in step1, but with the difference that this time all variables involved are significant. I will not show it here for practical sake, but removal of any of other variables will decrease $R_{adj}^2$. So we considered this our final model.
### Collinearity
So at this point, we can look into our variables and see if the variables we are interested in show some degree of collinearity. In our dataset, we have mixed variables, this is we have some variables that are categorical and some that are continuous, so in this case, a way to measure collinearity is using the variance inflation factor (VIF). The VIF, that quantifies the extent of multicollinearity in an ordinary linear regression, is calculated as the ratio between the variance of the model with multiple terms and the variance of the model with one term alone. In simple words, it tells us how much the variance of a regression coefficient increases due to collinearity existent in the model. So, let's go ahead and calculate this:
```{r collinearity}
library(car)
car::vif(step2_model)
```
None of our predictors has a high VIF, so we can assume that multicollinearity is not playing a role in our model.
### Model Diagnostics
The multiple regression model depends on the following four assumptions:
1) Each numerical explanatory variable is linearly related to the response variable
2) Residuals are distributed nearly normal with a mean of 0
3) Variability of residuals is nearly constant
4) The residuals are independant
We will test one-by-one the assumptions in the context of our model:
1) The only numerical variable that we have in our model is `log_values`. So we can explore the first assumption by checking the residual plots (e vs. _X_).
```{r linear relationship, fig.width=5, fig.height=3}
ggplot(augment(step2_model), aes(x = movies_interesting$log_votes, y = .resid))+
geom_point(colour = "red", size = 2, alpha = 0.5)+
theme_classic()+
geom_smooth(se=FALSE)+
labs(x = "log(number of votes)", y= "residuals")+
geom_hline(yintercept = 0, linetype = "dashed")
```
The plot shows that the residuals are random scatter around 0, which indicates a linear relationship between the numerical exploratory variable and the response variable.
2) To check this condition, we will perform first the histogram of the residuals and then a residuals Q-Q plot.
```{r residuals normal, fig.height= 4}
par(mfrow=c(1,2))
hist(step2_model$residuals, main = "Residual Distribution",
xlab = "Residuals", col = "lightgreen")
qqnorm(step2_model$residuals, col = "blue")
qqline(step2_model$residuals, col = "red")
```
As we can see above, the distribution histogram and the residuals Q-Q plot show a close to normal distribution, and also mimics the left-hand skew that was observed in the original `imdb rating` variable.
3) Now, we need to check that the residuals are equally variable for low and high values of the predicted response variable. Then, we will check the plot of residuals vs. predicted (e vs. $\hat{y}$).
```{r variability, fig.width=5, fig.height=3}
ggplot(augment(step2_model), aes(x= .fitted, y= .resid))+
geom_point(colour = "blue", size = 2, alpha = 0.5)+
theme_classic()+
geom_smooth()+
geom_hline(yintercept = 0, linetype = "dashed")+
labs(x = "Fitted", y = "Residuals")
```
The residuals are randomly scattered in a band with a constant width around 0.
4) Lastly, we will check for the independecy of the residuals:
```{r independence, fig.width=5, fig.height=3}
ggplot(augment(step2_model), aes(x = seq_along(.resid), y = .resid)) +
geom_point(colour = "lightgreen", size = 2.5, alpha = 0.6)+
theme_classic()+
labs(x = "Order of Collection", y = "Residuals", title = "Residuals vs. Order of Collection")
```
The plot above does not display any particulat pattern, so it is possible to assume that the residuals and as a consequence, the observations are independant.
* * *
## Prediction
Now, we can test the predictive capability of the developed model using the movie: "Zootropolis" released in 2016. The corresponding information was obtained from the IMDB website to be consistent with the analysis data.
```{r zootropolis_linar}
zoo <- data.frame(genre="Comedy", mpaa_rating="PG", best_dir_win="yes",
log_votes = log(345340))
predict_1 <- predict(step2_model, zoo, interval="predict")
imdb_rating_predictions <- c(8.0, 7.8)
predictions <- data.frame("Movie" = "Zootropolis",
"Predicted rating" = sprintf("%2.1f", predict_1[1]),
"95% CI" = sprintf("%2.1f-%2.1f", predict_1[2], predict_1[3]),
"IMDb rating" = imdb_rating_predictions[1])
predictions
```
First of all, we can say that in this case the 95% confidence interval can be interpreted as the interval around the predicted rating score within which we are 95% confident the real movie rating would fall.
From the table we can observed that the model was close in predicting the rating for Zootropolis if we considered that the real rating is inside the 95% confidence prediction interval.
* * *
## SECOND PART: BAYESIAN MODELING
Another way to predict movie popularity is to use a Bayesian modeling instead of a linear regression model.
So, we will start by selecting the variables that are interesting for this part of the project:
```{r selecting}
movies_final <- movies%>%
select(feature_film, drama, runtime, mpaa_rating_R,
thtr_rel_year, oscar_season, summer_season, imdb_rating,
imdb_num_votes, critics_score, best_pic_nom, best_pic_win,
best_actor_win, best_actress_win, best_dir_win, top200_box)
```
* * *
## Exploratory data analysis
### Relation between new exploratory and response variables:
First of all, we will investigate which is the relationship between the response variable `imdb_rating` and the new exploratory variables created. In doing so, we will create summary statistic tables and side-by-side boxplot:
```{r feature_plot, fig.height= 8, fig.width=8}
p1 <- ggplot(movies_final, aes(x=feature_film, y = imdb_rating, fill=feature_film))+
geom_boxplot(alpha = 0.7)+
theme_minimal()+
scale_fill_brewer(palette="Set2")+
labs(x = "Feature Film", y= "imdb_rating")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))+
theme(legend.position="none")
p2 <- ggplot(movies_final, aes(x=drama, y = imdb_rating, fill=drama))+
geom_boxplot(alpha = 0.7)+
theme_minimal()+
scale_fill_brewer(palette="Set3")+
labs(x = "Drama", y= "imdb_rating")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))+
theme(legend.position="none")
p3<- ggplot(movies_final, aes(x=mpaa_rating_R, y = imdb_rating, fill=mpaa_rating_R))+
geom_boxplot(alpha = 0.7)+
theme_minimal()+
scale_fill_brewer(palette="Set1")+
labs(x = "MPAA Rating R", y= "imdb_rating")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))+
theme(legend.position="none")
p4 <- ggplot(movies_final, aes(x=oscar_season, y = imdb_rating, fill=oscar_season))+
geom_boxplot(alpha = 0.7)+
theme_minimal()+
scale_fill_brewer(palette="Dark2")+
labs(x = "Oscar season", y= "imdb_rating")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))+
theme(legend.position="none")
p5 <- ggplot(movies_final, aes(x=summer_season, y = imdb_rating, fill=summer_season))+
geom_boxplot(alpha = 0.7)+
theme_minimal()+
scale_fill_brewer(palette="RdBu")+
labs(x = "Summer season", y= "imdb_rating")+
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0))+
theme(legend.position="none")
grid.arrange(p1, p2, p3, p4, p5, nrow = 3)
```
#### - `feature film`
```{r summary feature}
table_feature <- movies_final %>%
tbl_df%>%
group_by(feature_film)%>%
dplyr::summarize(n = n(), Mean = mean(round(mean(imdb_rating), 2)), Sd = sd(imdb_rating))
pandoc.table(table_feature)
```
From the table and the plots, we can observe that:
* Even though there are only 60 non-feature films vs. 591 feature films, a potential relationship between `feature_film` and `imdb_rating` is present in this dataset due to the fact that non feature films appear to have an `imdb_rating` mean of 7.53 point higher than feature films.
* Taking into consideration the variance of both groups, it is neccesary to use inferential tools to distinguish if this difference is statistically significant.
#### - `drama`:
```{r summary drama}
table_drama <- movies_final %>%
tbl_df%>%
group_by(drama)%>%
dplyr::summarize(n = n(), Mean = mean(round(mean(imdb_rating), 2)), Sd = sd(imdb_rating))
pandoc.table(table_drama)
```
From the plot and the table, we can observe the following:
* There are 346 movies that belong to the category drama and 305 that do not.
* Contrary to what we observed with `feature_film` variable, there is not clear relationship between `drama` and `imdb_rating` as the mean of both groups is similar.
#### - `mpaa_rating_R`:
```{r summary rating}
table_rating <- movies_final %>%
tbl_df%>%
group_by(mpaa_rating_R)%>%
dplyr::summarize(n = n(), Mean = mean(round(mean(imdb_rating), 2)), Sd = sd(imdb_rating))
pandoc.table(table_rating)
```
From the plot and the table, we learnt that:
* The variable `mpaa_rating` shows no relationship with `imdb_rating`.
* Not only we can see that half of the movies (329) belongs to the category 'R' of MPAA Rating, but also the mean of `audience_score` is equal in both groups as well as the variance.
#### - `oscar_season`:
```{r summary oscar}
table_oscar <- movies_final %>%
tbl_df%>%
group_by(oscar_season)%>%
dplyr::summarize(n = n(), Mean = mean(round(mean(imdb_rating), 2)), Sd = sd(imdb_rating))
pandoc.table(table_oscar)
```
From the plot and the table, we observe that:
* The variable `oscar_season` do not show a evident relationship with `imdb_rating`.
* There is fewer movies released within Oscar season (191) that outside it (460).
* The mean of `imdb_rating` is similar in both groups as well as the variance.
#### - `summer_season`:
```{r summary summer}
table_summer <- movies_final %>%
tbl_df%>%
group_by(summer_season)%>%
dplyr::summarize(n = n(), Mean = mean(round(mean(imdb_rating), 2)), Sd = sd(imdb_rating))
pandoc.table(table_summer)
```
From the plot and the table, we observe that:
* The variable `summer_season` do not show a evident relationship with `audience_score`.
* There is fewer movies released within summer season (208) that outside it (443).
* The mean of `audience_score` is similar in both groups as well as the variance.
From the exploratory analysis performed, we can speculate that the new created variable `feature_film` would have the strongest relationship with our response variable `imdb_rating` while the other new variables created could have weak or no relationship.
* * *
## Modeling
We will now proceed to conduct a Bayesian regression using the `BAS` package. We will use Bayesian Model Average (BMA) and Zellner-Siow Cauchy prior along with an uniform model prior to assign equal probabilities to all models. Regarding the option method, we will use "MCMC" (Markov chain Monte Carlo) that improves the model search efficiency.
Fist of all, we will discard any rows with `NAs`:
```{r}
movies_final <- movies_final %>%
filter(complete.cases(.))
```
The code below creates the full Bayesian model:
```{r bayesian}
movies_bas <- bas.lm(imdb_rating ~ .,
data = movies_final,
method = "MCMC",
prior = "ZS-null",
modelprior = uniform())
```
We will now print the marginal inclusion probabilities obtained for the model:
```{r summary3}
movies_bas
```
After that, we can use the function `summary`
```{r summarybas}
summary(movies_bas)
```
to see the top 5 models with the zero-one indicators for variable inclusion. It is also displayed a column with the Bayes factor ($BF$) for each model to the highest probability model, the posterior probabilities of the models ($PostProbs$), the $R^2$ of the models, the dimension of the models ($dim$) and the log marginal likelihood ($logmarg$) under the selected prior distribution.
Last, we can make use of the function `image`
```{r image, fig.height=6, fig.width=6}
image(movies_bas, rotate=F)
```
to visualize the Log Posterior Odds and Model Rank. In the picture above, each row correspond to each variable included in the full model as well as one extra row for the intercept. In each column, we can see all possible models ($2^{16}$ because we have 16 variables included) sorted by their posterior probability from the best to worst rank on the top (from left to right).
> From the model and the image above, we can see that:
> * `feature_film` has a marginal probability of 0.999, and appears in all five top models
> * `critics_score` has a marginal probability of 0.999 and also appears in all five top models
> * `runtime` has a marginal probability of 0.98 and appears in all five top models
> * `drama` has a marginal probability of 0.57 and appears in three of the five top models
> * `imbd_num_votes` has a marginal probability of 0.99 and appears in three of the five top models
> * the intercept also has a marginal probability of 1, and appears in all five top models
According to this, the best model includes the intercept, `feature_film`, `critics_score`, `drama`, `imbd_num_votes` and `runtime`
#### Posterior Distributions of Coefficients
We can now obtain the coefficients estimates and standard deviations under BMA in order to be able to examine the marginal distributions for the important variables coefficients. To do so, we will use the function `coef` and plot them using `plot`:
```{r coefficients}
coef_movies <- coef(movies_bas)
par(mfrow=c(3,2))
plot(coef_movies, subset = c(1, 2, 3, 4, 9, 10), ask=F)
```
The vertical line corresponds to the posterior probability that the coefficient equals to 0. On the other hand, the shaped curve shows the density of posiible values where the coefficient is non-zero. It is worthy to mention that the height of the line is scaled to its probability. This implies that intercept and `feature_film`, `critics_score`, `imbd_num_votes` and `runtime` show no line denoting non-zero probability.
Last, we can obtain credible intervals for coefficients using `confint` method:
```{r cin}
confint(coef_movies)
```
#### Graphical Summaries
`BAS` package provides us with an easy way to get graphical summaries for our model just using the function `plot` and the `which` option
**Residual vs. fitted plot**
```{r residuals, fig.height=4}
plot(movies_bas, which = 1, ask=F)
```
We can observe here a plot of the "Residuals vs. Fitted values"" under BMA. Ideally, we will expect to not see outliers or non-constant variance. However, in this case we can see that there is a constant spread over the prediction but there are two outliers.
**Model probabilities**
```{r prob, fig.height=4}
plot(movies_bas, which = 2, ask=F)
```
This plot displays the cumulative probability of the models in the order that they are sampled. This plot shows that the cumulative probability starts to level off after 300 model trials as each additional model adds only a small increment to the cumulative probability. The model search stops at ~1400 instead of enumerations of 2^15 combinations.
**Model complexity**
```{r compl, fig.height=4}
plot(movies_bas, which = 3, ask=F)
```
This plot shows the dimension of each model, that is the number of regression coefficients including the intercept versus the log of the marginal likelihood of the model. In this case, we can see that highest log marginal can be reached from 5 to 12 dimensions.
**Marginal inclusion probabilities**
```{r marg, fig.height=4}
plot(movies_bas, which = 4, ask=F, cex.lab=0.5)
```
In this case, we can observe the marginal posterior inclusion probabilities for each of the covariates, with marginal posterior inclusion probabilities that are greater than 0.5 shown in red (important variables for explaining the data and prediction). In the graph, we can see what it was show already before about which variables contribute to the final scores.
* * *
## Prediction
Now, we can test the predictive capability of the developed model using the movie: "Zootropolis" released in 2016. The corresponding information was obtained from the [IMDB website](https://www.imdb.com/title/tt2948356/?ref_=nv_sr_1) and [RottenTomatoes](https://www.rottentomatoes.com/m/zootopia) to be consistent with the analysis data.
```{r zootropolis_bayesian}
zootropolis <- data.frame(feature_film = "yes", drama="no",
runtime=108, mpaa_rating_R = "no",
thtr_rel_year = 2016, oscar_season = "no",
summer_season = "no",
imdb_num_votes = 345433, critics_score=98,
best_pic_nom = "yes", best_pic_win = "yes",
best_actor_win = "no", best_actress_win = "no",
best_dir_win = "yes", top200_box = "no")
predict_1 <- predict(movies_bas, zootropolis, estimator="BMA", interval = "predict", se.fit=TRUE)
data.frame('Movie' = 'Zootropolis',
'Estimated IMDB rating' = predict_1$Ybma,
'Real IMDB rating' = 8.0)
```
The true `imdb_rating` is 8, which is pretty close to what our model predicted.
* * *
From the linear regression and the Bayesian model we learnt that in fact the popularity of a movie can be predicted by considering characteristic data of each movie.
In the linear regression analysis, it was possible to build a parsimonious, multivariable, linear model that is able to some extend to predict the movie popularity, understood as $IMDb\ rating$, with the four statistically significant predictors chosen. However, it is important to remember that the $R_{adj}^2$ of our final model is only 0.3595, so this means that 35.95% of the variability is explained by the model. In the Bayesian model, we finally got a parsimonious model that also fullfilled the Bayesian assumptions.
From both models, we can see that the Bayesian model is the one which prediction was close to the real `imdb_rating`.