Game of Thrones and Data Science: Cleaning the Dataset and a Simple Classifier
This blog post uses data science to predict who is going to die in the new Game of Thrones Season 7. In this entry I’ll discuss building a simple random forest classifier, and then use it to predict whether or not a character should be labeled as dead. The prediction is based on the assumption that living characters labeled as dead are the most likely to die next. Based on this simple model, my first set of predictions of which characters will die in season 7 are:
- Daario Naharis
- Gregor Clegane
- Meera Reed
- Tormund Giantsbane
- Podrick Payne
Before I get there, the data needs a fair bit of work. I’ll summarise the main bits, but leave out a lot of the more tedius work on validating the data, and getting it into a usable state. I’ll then have a look at how I can evaluate how well the model performs. Later, I’ll be building some more complicated models, and make updated predictions about deaths in season 7.
Cleaning the Dataset
Cleaning the dataset can often be a lot of work. This example required a fair bit of effort to get the data into a usable form. I’m going to omit most of the details as I think they’re probably not that interesting (but do get in touch if you’re keen to find out more). I will go over some of the last steps, though, as they provide a nice example of using merge and groupby pandas funtions. So beginning at a stage where the dataset is in a pretty simple state, in fact, they live in two files deaths.csv and appearances.csv. They both have data structured in a similar way: they both have character_name, season_index, and episode_index, columns. Basically, each character has a row in the appearances file for each episode they appear in (recording the episode and season numbers). If that character has died, they’ll also have a row in the deaths file with the season and episode number of their death.
A very easy way to view and play with the data is using the Pandas module in Python. For example, it’s straightforward to add extra columns. I can combine the episode_index and season_index into a master_episode_index (which takes a value between 1 and 60, as there are 60 episodes in total, split over 10 seasons).
Another useful thing to do was to join my two datasets together using the pandas’s merge function:
appearances_and_deaths = pd.merge( appearances, deaths, on='character_name', suffixes=['_appearances', '_deaths'], how='left')
This works by finding all the matching rows in both datasets which have the same value in the on column. (It works the same as a SQL JOIN.) The suffixes option is handy since it is added to the end of column names, meaning you can easily tell which dataset the column originated in (E.g. here I have episode_index in both datasets, so I’ll have episode_index_appearances and episode_index_deaths in the merged dataset). The how option here lets you specify what to do with non-matching rows. In this case, if a character isn’t dead they won’t be found in the deaths dataset. By default, if a row in one dataset has no matches in the other they’ll be left out. I can be a bit less strict by saying that I’ll leave in non-matching rows from the dataset on the ‘left’ (so it’s just like a LEFT JOIN in SQL.)
Characters who aren’t dead will have NaN values in the _deaths columns. I can make this a bit more obvious by adding a new column called is_dead:
appearances_and_deaths['is_dead'] =\ ~appearances_and_deaths.episode_index_deaths.map(np.isnan)
(using the Numpy function isnan). In the next section I’ll train a model aimed at predicting this value from the other data.
Finally, I aggregate all this down so that each character has a single row, containing a summary of all this data. I’ll count the total number of episodes and seasons that the character appears in, and the episode and season ‘distances’ (how long was it their between their first and last appearance, in terms of number of episodes or seasons). This is slightly fiddly, but not too bad:
aggregated = all_appearances_and_deaths.groupby([ 'character', 'is_dead' ]) groups = aggregated.groups.items() aggregated =  for (character, is_dead), indexes in groups: sub_data = all_appearances_and_deaths.iloc[indexes, :] aggregated.append([ character, is_dead, sub_data.master_episode_index.unique().size, sub_data.season_index_appearances.unique().size, sub_data.master_episode_index.max() - sub_data.master_episode_index.min(), sub_data.season_index_appearances.max() - sub_data.season_index_appearances.min() ]) aggregated = pd.DataFrame(aggregated, columns=[ 'character', 'is_dead', 'total_nr_episodes', 'total_nr_seasons', 'episode_distance', 'season_distance' ])
I now have enough to build a very simple model. I will use the columns total_nr_episodes, total_nr_seasons, episode_distance and season_distance (I’ll call these the features from now on) to try to ‘predict’ is_dead. Before I get on to that, this is how the distributions looked, for both living and dead characters.
Random Forest Model
I’ll start with a random forest model. This is based on a decision tree classifier, and is pretty simple to understand. I split the dataset in two by choosing some feature (one of the columns used to build the model) and a critical value. All the rows with values less than the critical value are put in one subset, and the rows that are greater than the critical value are put into the other. Next, I chose the feature and critical value that I split on so that one set is more likely to have is_dead being true, and the other one more likely to be false. I then take each subset in turn and repeat the process again and again until I have lots of small subsets. Hopefully each subset will have mostly the same values for is_dead.
The problem with this is that it’s hard to know when to stop splitting. If I split too much I would end up with very small subsets which match the training data very well, but might not be able to handle new data accurately. (This is called overfitting.) This can can be overcome by building lots of trees, each one trained on a different, small, randomly selected, subset of the total number of rows. When I want to classify a new bit of data, each tree makes its own prediction and then they all ‘vote’ for which category the new item should be in (i.e. whether is_dead is true or false). Such a set of trees is called a Random Forest, and they’re pretty easy to use straight out of the box. I’ll use the random forest classifier provided by Scikit-Learn.
One problem here is that I’m trying to predict whether a character should be labeled as dead, but I already have that in the training set. So I’ll use a simple strategy: I’ll randomly split the dataset into two parts: a training set (containing 70% of the data) and a validation set (containing the remaining 30%). I’ll build the model using the training set, and then see how well it performs using the validation set. The code looks like this:
features = ['total_nr_episodes', 'total_nr_seasons', 'episode_distance', 'season_distance'] target = ‘is_dead’ x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split( aggregated[features], aggregated[target], test_size=0.3, random_state=np.random.randint(0,1000)) classifier = sklearn.ensemble.RandomForestClassifier() classifier.fit(x_train, y_train) print(classifier.score(x_test, y_test))
That’s it! The data is split into a training and validation (or test) set using train_test_split, then I train or ‘fit’ the model, and lastly I see how well it has performed using the score method (which gives the fraction of the test set that are correctly predicted correctly). The scores that I get here vary depending on the training-test split, but are typically about 85%. Not too bad! But I can dig deeper and see what’s going on.
One thing to look at is the confusion matrix. Basically, I have four types of results: characters who are predicted to be dead and who are actually dead, characters who are predicted to be alive and who are actually alive, characters who are predicted to be dead but who are actually alive, and characters who are predicted to be alive but who are actually dead. Expressing these as percentages, I find that:
|Percentage of death predictions that are accurate||60|
|Predicted of alive predictions that are accurate||89|
|Percentage of dead characters predicted to be dead||44|
|Percentage of alive characters predicted to be alive||94|
The predictions for characters being alive look very good. But only 44% of dead characters are actually predicted to be dead. l That’s worse than choosing by flipping a coin! Can I do better?
Yes! One problem is that mydataset contains way more alive characters (506) than dead characters (143). So it’s likely that the model will be biased towards guessing that a character is alive. One way to fix this is to use down-sampling. What this means is that I randomly throw away some of the dead characters in the dataset, so that the number of alive and dead characters are the same. Sadly this means I throw away some data. In general, there’s a trade off between having a balanced set (often good), and having less data (usually bad). In this case the overall accuracy is down to 78%, which sounds worse. But here’s a look at the confusion matrix percentages
|Percentage of death predictions that are accurate||87|
|Predicted of alive predictions that are accurate||73|
|Percentage of dead characters predicted to be dead||66|
|Percentage of alive characters predicted to be alive||90|
This is better at predicting that dead characters are actually dead, but I’ve paid a price in that the percentage of correct alive predictions has gone down.
So can I predict who might be more likely to die next season? One way might be to look at characters currently alive who are predicted to be dead. Actually, I can do a bit better. One thing I glossed over is that when the random forest votes for whether a character should be alive or dead, not only can I say who wins, I can also look at how strong the votes are. Characters where more than 99% of trees voted for dead, but who are in fact alive, might be good candidates for dying next season. Some characters to watch are: Daario Naharis, Gregor Clegane, Meera Reed, Melisandre, Tormund Giantsbane and Podrick Payne.
Where to Next?
There’s a lot more that can be done, and more to information to extract. What if I wanted to predict when a character would die? Luckily there are other approaches I can use and will update you on shortly!