Predicting Friends Characters with Text and Tidymodels

Using the text of Friends characters to predict the lines by Joey Tribbiani.

title: "Predicting Friends Character by Line" author: "Quantargo" date: "`r format(Sys.time(), '%d %B, %Y')`" output: html_document

# see also https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-09-08/readme.md
library(tidyverse)
library(tidytuesdayR)
library(tidytext)
library(stopwords)
library(rsample)
library(textrecipes)
library(textfeatures)
library(dplyr)
library(parsnip)
library(tune)
library(glmnet)

tuesdata <- tt_load('2020-09-08')

Introduction

In this project we will work with a data set containing the transcript of the television sitcom Friends. Using a simple approach to analyze the character lines, we are going to train a model that predicts the probability whether a sentence was said by the character named Joey Tribbiani.

Create Data Set Splits

There are 700 characters contained in the data set, who appear in the show. For the sake of simplicity, we will restrict the analysis to the six main characters. We also limit the variables to the text and the speaker as we are not interested in other information at the moment:

characters <- c("Rachel Green", "Ross Geller", "Chandler Bing",
                "Joey Tribbiani", "Monica Geller", "Phoebe Buffay")

friends_lines <- tuesdata$friends %>%
  filter(speaker %in% characters) %>%
  select(text, speaker)

friends_lines

Next, we split the data set into separate training, validation and testing sets:

set.seed(42) 
friends_split <- initial_split(friends_lines)
friends_train <- training(friends_split)
friends_test <- testing(friends_split)
friends_cv <- vfold_cv(friends_train)

Preprocessing

The preprocessing steps taken for this project can be summarized as follows:

Strip each sentence into a set of individual terms (tokens) using step_tokenize().
Reduce the number of tokens by removing stopwords (e.g. a, an, and etc.) using step_stopwords().
Include only tokens that appear at least 30 times overall, since we are searching for typical words that are said frequently by a character. In addition, limit the number of overall unique tokens so that we have a reasonable data set size.
Calculate the term frequency per sentence using step_tf().
Handle all characters except for Joey Tribbiani as a common other class.

rec <- recipe(speaker ~ ., data = friends_train) %>%
  step_tokenize(text) %>%
  step_stopwords(text) %>%
  step_tokenfilter(text, min_times = 30, max_tokens = 1000) %>%
  step_tf(text) %>%
  step_mutate(speaker = fct_collapse(speaker, 
                                     other = c("Chandler Bing", "Monica Geller", "Phoebe Buffay",
                                               "Rachel Green", "Ross Geller")))

If we prep() and bake() the recipe, we get the following training data set as output:

train_preprocessed <- rec %>% 
  prep() %>%
  bake(new_data = NULL) 

train_preprocessed

Each of the 38,286 rows is a sentence, said by either Joey Tribbiani or some other character. There are 839 columns, each of them representing a unique word (i.e. the max_tokens limit of 1000 has not been reached). The numeric values describe the number of occurrences of a given word in a specific sentence. As we can see this produces in a rather sparse data set.

Modeling and Tuning

As the training data has a very large number of columns we require some form of regularization that will efficiently eliminate irrelevant predictors (i.e. words). Therefore, we apply a logistic regression and tune the regularization hyperparameters to find the optimal settings:

model <- logistic_reg(penalty = tune(), 
                      mixture = tune()) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

tune_res_logreg <- tune_grid(object = model,
                             preprocessor = rec,
                             resamples = friends_cv,
                             control = control_grid(verbose = TRUE, event_level = "second"))

show_best(tune_res_logreg, metric = "roc_auc")

model <- logistic_reg(penalty = tune(), 
                      mixture = tune()) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

tune_res_logreg <- readRDS("tune_res_logreg.RDS")
show_best(tune_res_logreg, metric = "roc_auc")

The best hyperparameter settings achieve an area under the curve (AUC) performance of ~0.66. Given that we rely exclusively on unique words as predictors, this is a reasonable performance. It indicates, that there are in fact words that are typical for a character like Joey. However, these only help to distinguish a fraction of the sentences.

Next we apply the optimal hyperparameter settings and finalize the model:

best_parameters <- select_best(tune_res_logreg, metric = "roc_auc")
model <- finalize_model(model, best_parameters)

Last Fit

Using the finalized model, we run a last_fit() to get an estimate of the model performance on a truly unseen test set:

lf <- last_fit(object = model, 
               preprocessor = rec, 
               split = friends_split)

collect_metrics(lf)

The model performance is similar to the results we have seen during the model tuning. Next, we take a look at the lines that have the highest predicted probability to be a typical Joey line:

collect_predictions(lf) %>%
  slice_max(`.pred_Joey Tribbiani`, n = 10)

According to the predictions the most obvious Joey line from the test set is the following one:

friends_lines %>%
  slice(29512) %>% 
  select(text)

Variable Importance

To get an insight into the words that proved to be useful predictors, we train a model with the optimal hyperparameter settings for inspection:

model_inspect <- model %>%
  fit(speaker ~ ., data = train_preprocessed)

To extract the coefficients of the single words, we apply the tidy() function on the output:

coeffs <- tidy(model_inspect)
coeffs

Next we visualize the most important predictors. For this we take the following steps:

We exclude the estimated Intercept for them model, as it is not an actual word.
We group the estimates by their sign (i.e. positive or negative).
For both groups we extract the top 15 absolute values.
We plot the terms and their respective coefficient estimates.

coeffs %>%
  filter(term != "(Intercept)") %>% 
  group_by(estimate > 0) %>%
  slice_max(abs(estimate), n = 15) %>%
  ungroup() %>%
  ggplot() +
  geom_col(aes(x = fct_reorder(term, estimate), y = estimate, fill = estimate > 0), 
           alpha = 0.8, show.legend = FALSE) +
  labs(x = "Term", y = "Coefficient Estimate") +
  coord_flip()

If you are familiar with the show, you can easily identify how the terms with the highest positive coefficient might be typical for Joey. When it comes to terms with negative coefficients, they represent words that are rarely used by Joey and tend to be typical for some other character(s).