Previous chapter
Case StudiesIMDB Ratings: TV's golden age is real
Next chapter

Data Description

DATA is curated courtesy of Sara Stoudt and comes from the recently created The Economist Data GitHub!

Their November 24th article on TV ratings covers ‘all TV dramas … via IMDb from 1990 to 2018’.

Data Dictionary

Data dictionary courtesy of skimr and kable, with credit to Phillip Knor for the pull-request.

typevariablemissingcompletenminmax
charactergenres022662266525
charactertitle022662266151
charactertitleId02266226699
Datedate0226622661990-01-032018-10-10
integerseasonNumber022662266NANA
numericav_rating022662266NANA
numericshare022662266NANA

1. Data Sourcing/Exploration

Step 0: Read Data

Read data from IMDB Economist TV Ratings

dat <- read_csv("data/IMDb_Economist_tv_ratings.csv")

1. Data Exploration

First, let’s find answers for following questions:

  1. What is the series with the longest duration (by date)? Hint: You can use diff(range(date)) to calculate duration.
  2. Which series has the highest/lowest average rating?
  3. Which series had most number of episodes?

2. Ratings for Selected Series

We would now like to plot and compare movie ratings for the following series in the dataset:

series <- c("Star Trek", "Breaking Bad", "Game of Thrones", "Sopranos", "Big Bang")
  1. Use filter to reduce the dataset to only the series above. You can use str_detect() to detect strings in series. If you want to detect multiple series at once a pattern can be constructed as follows: paste(series, collapse = "|").
  2. Plot the average rating av_rating for each series over date. You can color by title and set size by share as in the plot below.
  3. What other series could be plotted and would be interesting?

3. Ratings by Genre

Plot the average movie ratings by date. Which genre seems to be most/least popular?

4. Forecast Number of Ratings

We would now like to examine the overall pattern of movie ratings over time and forecast the number of movie ratings for the next 12 months.

  1. First, we would like to aggregate the number of ratings over time. To aggreagate data monthly we can use group_by and round date to ceiling_date to group data monthly. Subsequently we summarise() and count observations using n().
  2. Filter data from 2008-01-01 onwards.
  3. Convert data.frame to a ts() object
  4. Use your favorite forecasting technique, e.g. Arima with auto.arima() or Neural Networks with netar() to forecast the time series.

How to the different forecasting techniques compare? What (seasonal) patterns do you see in the time series over time?