We are your data science workshop.

Who dies in Game of Thrones Season 7 and why is it relevant to your business?

In this blog I’m going to set out how I’ve used data science to predict the deaths of characters in the upcoming Game of Thrones TV series. I can hand on my heart confirm that I’ve no insider knowledge of these facts and by using a very simple model, my predictions are that these characters WILL DIE:

  • Daario Naharis
  • Gregor Clegane
  • Meera Reed
  • Melisandre
  • Tormund Giantsbane
  • Podrick Payne

So although I’ve had fun doing this, you may also ask, how is this relevant to data science in my business?

Simply put, data science is about using scientific thinking and computing power to make the best use of your data. There are many different approaches, depending on what you need. In this Game of Thrones example I’ve used machine learning techniques, and make predictions about which characters will die based on a very simple classification model. Despite its simplicity, it achieves a reasonable degree of accuracy; in this case 75%. It is based on the concept of supervised learning. It takes a set of example data, each entry having a category attached to it. We use the data set to infer, or learn, rules which will let us categorise other data.

This is just one technique a data scientist may use and here’s an example of how it could be used in a business related problem. If a grocery shop had demographic information about some of its customers, it could use their shopping behaviour to infer demographics about other customers.

Some of the rules you could use for this would be very simple: how can you tell which customers own cats? They buy cat food.

What about other categories? There may not be simple rules in those cases, but maybe we can devise a complex series of rules, such as a decision tree, that allows us to categorise other shoppers.

For example if a customer buys at least 2 pints of milk, buys a particular brand of sliced bread, and so forth, then we categorise them as having teenage sons. (I used a more sophisticated version of this sort of model in the Game of Thrones predictions.) A human would struggle to go through the data and discover those rules, but it is the perfect job for a computer.

If you want to find out more about the mechanics of this, I’ve written up another blog that sets out the technical aspects.

Game of Thrones and Data Science: Cleaning the Dataset and a Simple Classifier

This blog post uses data science to predict who is going to die in the new Game of Thrones Season 7. In this entry I’ll discuss building a simple random forest classifier, and then use it to predict whether or not a character should be labeled as dead. The prediction is based on the assumption that living characters labeled as dead are the most likely to die next. Based on this simple model, my first set of predictions of which characters will die in season 7 are:

  • Daario Naharis
  • Gregor Clegane
  • Meera Reed
  • Melisandre
  • Tormund Giantsbane
  • Podrick Payne

Before I get there, the data needs a fair bit of work. I’ll summarise the main bits, but leave out a lot of the more tedius work on validating the data, and getting it into a usable state. I’ll then have a look at how I can evaluate how well the model performs. Later, I’ll be building some more complicated models, and make updated predictions about deaths in season 7.

Cleaning the Dataset

Cleaning the dataset can often be a lot of work. This example required a fair bit of effort to get the data into a usable form. I’m going to omit most of the details as I think they’re probably not that interesting (but do get in touch if you’re keen to find out more). I will go over some of the last steps, though, as they provide a nice example of using merge and groupby pandas funtions. So beginning at a stage where the dataset is in a pretty simple state, in fact, they live in two files deaths.csv and appearances.csv. They both have data structured in a similar way: they both have character_name, season_index, and episode_index, columns. Basically, each character has a row in the appearances file for each episode they appear in (recording the episode and season numbers). If that character has died, they’ll also have a row in the deaths file with the season and episode number of their death.

A very easy way to view and play with the data is using the Pandas module in Python. For example, it’s straightforward to add extra columns. I can combine the episode_index and season_index into a master_episode_index (which takes a value between 1 and 60, as there are 60 episodes in total, split over 10 seasons).

Another useful thing to do was to join my two datasets together using the pandas’s merge function:

    appearances_and_deaths = pd.merge(
        appearances,
        deaths,
        on='character_name',
        suffixes=['_appearances', '_deaths'],
        how='left')

This works by finding all the matching rows in both datasets which have the same value in the on column. (It works the same as a SQL JOIN.) The suffixes option is handy since it is added to the end of column names, meaning you can easily tell which dataset the column originated in (E.g. here I have episode_index in both datasets, so I’ll have episode_index_appearances and episode_index_deaths in the merged dataset). The how option here lets you specify what to do with non-matching rows. In this case, if a character isn’t dead they won’t be found in the deaths dataset. By default, if a row in one dataset has no matches in the other they’ll be left out. I can be a bit less strict by saying that I’ll leave in non-matching rows from the dataset on the ‘left’ (so it’s just like a LEFT JOIN in SQL.)

Characters who aren’t dead will have NaN values in the _deaths columns. I can make this a bit more obvious by adding a new column called is_dead:

    appearances_and_deaths['is_dead'] =\ 
        ~appearances_and_deaths.episode_index_deaths.map(np.isnan)

(using the Numpy function isnan). In the next section I’ll train a model aimed at predicting this value from the other data.

Finally, I aggregate all this down so that each character has a single row, containing a summary of all this data. I’ll count the total number of episodes and seasons that the character appears in, and the episode and season ‘distances’ (how long was it their between their first and last appearance, in terms of number of episodes or seasons). This is slightly fiddly, but not too bad:

aggregated = all_appearances_and_deaths.groupby([
    'character', 'is_dead'
])

groups = aggregated.groups.items()
aggregated = []
for (character, is_dead), indexes in groups:
    sub_data = all_appearances_and_deaths.iloc[indexes, :]
    aggregated.append([
        character,
        is_dead,
        sub_data.master_episode_index.unique().size,
        sub_data.season_index_appearances.unique().size,
        sub_data.master_episode_index.max()
            - sub_data.master_episode_index.min(),
       sub_data.season_index_appearances.max()
            - sub_data.season_index_appearances.min()
    ])
aggregated = pd.DataFrame(aggregated, columns=[
    'character', 'is_dead', 'total_nr_episodes',
    'total_nr_seasons', 'episode_distance',
    'season_distance'
])

I now have enough to build a very simple model. I will use the columns total_nr_episodes, total_nr_seasons, episode_distance and season_distance (I’ll call these the features from now on) to try to ‘predict’ is_dead. Before I get on to that, this is how the distributions looked, for both living and dead characters.

 Data Science - Random Forest Classifier Histogram
 Data Science - Random Forest Classifier Histogram

Random Forest Model

I’ll start with a random forest model. This is based on a decision tree classifier, and is pretty simple to understand. I split the dataset in two by choosing some feature (one of the columns used to build the model) and a critical value. All the rows with values less than the critical value are put in one subset, and the rows that are greater than the critical value are put into the other. Next, I chose the feature and critical value that I split on so that one set is more likely to have is_dead being true, and the other one more likely to be false. I then take each subset in turn and repeat the process again and again until I have lots of small subsets. Hopefully each subset will have mostly the same values for is_dead.

The problem with this is that it’s hard to know when to stop splitting. If I split too much I would end up with very small subsets which match the training data very well, but might not be able to handle new data accurately. (This is called overfitting.) This can can be overcome by building lots of trees, each one trained on a different, small, randomly selected, subset of the total number of rows. When I want to classify a new bit of data, each tree makes its own prediction and then they all ‘vote’ for which category the new item should be in (i.e. whether is_dead is true or false). Such a set of trees is called a Random Forest, and they’re pretty easy to use straight out of the box. I’ll use the random forest classifier provided by Scikit-Learn.

One problem here is that I’m trying to predict whether a character should be labeled as dead, but I already have that in the training set. So I’ll use a simple strategy: I’ll randomly split the dataset into two parts: a training set (containing 70% of the data) and a validation set (containing the remaining 30%). I’ll build the model using the training set, and then see how well it performs using the validation set. The code looks like this:

features = ['total_nr_episodes', 'total_nr_seasons',
            'episode_distance', 'season_distance']
target = ‘is_dead’

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(
    aggregated[features], aggregated[target],
    test_size=0.3, random_state=np.random.randint(0,1000))


classifier = sklearn.ensemble.RandomForestClassifier()
classifier.fit(x_train, y_train)
print(classifier.score(x_test, y_test))

That’s it! The data is split into a training and validation (or test) set using train_test_split, then I train or ‘fit’ the model, and lastly I see how well it has performed using the score method (which gives the fraction of the test set that are correctly predicted correctly). The scores that I get here vary depending on the training-test split, but are typically about 85%. Not too bad! But I can dig deeper and see what’s going on.

One thing to look at is the confusion matrix. Basically, I have four types of results: characters who are predicted to be dead and who are actually dead, characters who are predicted to be alive and who are actually alive, characters who are predicted to be dead but who are actually alive, and characters who are predicted to be alive but who are actually dead. Expressing these as percentages, I find that:

Percentage of death predictions that are accurate 60
Predicted of alive predictions that are accurate 89
Percentage of dead characters predicted to be dead 44
Percentage of alive characters predicted to be alive 94

The predictions for characters being alive look very good. But only 44% of dead characters are actually predicted to be dead. l That’s worse than choosing by flipping a coin! Can I do better?

Yes! One problem is that mydataset contains way more alive characters (506) than dead characters (143). So it’s likely that the model will be biased towards guessing that a character is alive. One way to fix this is to use down-sampling. What this means is that I randomly throw away some of the dead characters in the dataset, so that the number of alive and dead characters are the same. Sadly this means I throw away some data. In general, there’s a trade off between having a balanced set (often good), and having less data (usually bad). In this case the overall accuracy is down to 78%, which sounds worse. But here’s a look at the confusion matrix percentages

Percentage of death predictions that are accurate 87
Predicted of alive predictions that are accurate 73
Percentage of dead characters predicted to be dead 66
Percentage of alive characters predicted to be alive 90

This is better at predicting that dead characters are actually dead, but I’ve paid a price in that the percentage of correct alive predictions has gone down.

So can I predict who might be more likely to die next season? One way might be to look at characters currently alive who are predicted to be dead. Actually, I can do a bit better. One thing I glossed over is that when the random forest votes for whether a character should be alive or dead, not only can I say who wins, I can also look at how strong the votes are. Characters where more than 99% of trees voted for dead, but who are in fact alive, might be good candidates for dying next season. Some characters to watch are: Daario Naharis, Gregor Clegane, Meera Reed, Melisandre, Tormund Giantsbane and Podrick Payne.

Where to Next?

There’s a lot more that can be done, and more to information to extract. What if I wanted to predict when a character would die? Luckily there are other approaches I can use and will update you on shortly!

_data journey

Your data can tell you a lot about your customer's journey. Our services can provide you with the information and tools that you need to match your services to customers.