We are your data science workshop.

Four Questions to Consider Before Launching a Machine Learning Project

In our last post, we focused on the four problems and tasks that data science and machine learning is particularly well suited for. This is a good starting point for where you can focus your efforts. But there are still practical and business realities of your particular use case to consider. Specifically, why you are doing the project in the first place? What is the volume and quality of your data? And is there enough value in solving the problem to devote energy and resource to it?

In this post we take you through a typical process we run on data science projects involving questions to consider before launching a machine learning project. We start with the why of the project, through to discovery, implementation and last-mile delivery. This process isn’t always linear – in fact it’s almost never linear. But, in general, these are the steps that go through.

1. Why?

Why is always a good place to start. Why are we doing this? What value do we think, or hope, it will deliver?

The “why” helps form the vision of what you want to achieve with your machine learning project. It makes it easier to communicate with other stakeholders, such as your CEO, your colleagues, or the data science team working on it. This is also when you sketch out what you think the final product will look like. Of course, this will change over the course of the machine learning project.

To use an example: you are facing a high rate of customer churn. You think there’s some evidence you can predict when this will happen. You think you can use this information to stop customers from leaving. The why is clear here: you want to lose fewer customers. But should also think about how you’ll operationalise this. Who the key stakeholders are (maybe account managers or customer success), where you will need data from (CRM and your product), and what a final tool might look like (could be something as simple as an email notification to the account manager to intervene).

Importantly, this is the time to focus on a minimum viable product (MVP). What is the smallest thing you can build to validate the tool? The example above is an approach: rather than implementing a full solution that integrates with your CRM, a simple email notification is quicker to set up and will get the model’s predictions in the hands of your time. They can give feedback on the accuracy of the model, and whether the interventions. You can then embed it further in your processes down the line. You can always go and improve it further down the line once you know it works.


2. Is there enough good data?

One of the most common questions we get is ‘how much data do we need’. The answer we give is a bit unsatisfying: it depends. Quantity of data is important. But this goes hand in hand with the quality of the data.

First, let’s consider the quantity. If the data is good and provides some signal, you may not have enough of it to make meaningful predictions. If you rely solely on data you collect to train models, you solve this by collecting more data. That means more customers or more interactions with them. Until you have enough data, you can’t use machine learning. 

There are ways of generating ‘synthetic’ data. This is artificially created and mimics real data Learn more on this at Towards Data Science.

Alternatively, if the data you require isn’t organisation specific, there are also public data repositories. Check out Google’s Dataset search, and Kaggle.

View more resources like this on _database

All of the above is caveated with whether the data is good to start with. Does the answer we are looking for live in the underlying data? And, is it in the right shape to make use of it for the machine learning project?

For the customer churn prediction model example: say the overwhelmingly greatest indicator of churn is country of residence. If we don’t collect that information from enough customers, then the model’s outputs will be weak.

Additionally, is the data structured so that we can make use of it? If it’s unstructured data, such as free text like customer comments, or if it doesn’t properly link customers with the data you have on them, then you can’t use the data in the current state. That work is likely to be a project into itself. In this case, it comes down to identifying what other information you need.

3. Is it worth it?

Does pursuing a long term machine learning project, implementing it, and maintaining it justify the resource you put into it? Or, are there more straightforward solutions to achieve the same result?

Often, this doesn’t mean that you should stop the project. It might mean that data science isn’t the right tool for the job. It could be a much simpler solution, like better reporting dashboards and data visualisation.

However, often pursuing a machine lerning project is well worth the cost and resource that goes into it. Consider our Concentre project as an example, where our tool allowed them to reduce time spent analysing documents from 200 in 8 hours to 200 in 30 minutes.

4. What can we do with it?

A good machine learning project – or any project for that matter – considers the last-mile delivery. How will we get this model operational, and in the hands of stakeholders who will get value from it?

Consider our customer churn problem again. Our tool enables the customer success team to intervene and offer some perk or incentive to retain that customer. In the abstract, this sounds great. You build the tool, you hand it over to the customer support team, and then they get on with it.

However, this is where machine learning projects can fail. Embedding these tools into existing workflows or products is the most critical bit. In our example, you might build this tool and deploy it in the cloud somewhere, and the customer success team needs to navigate over there to find a list of soon to churn customers. However, the customer success team spends all their time in a CRM or email. And this additional step of navigating away means they won’t use it as much, which means the organisation isn’t able to get the maximum value from this. 

That’s again why in the beginning it is important to define a minimum viable product or outcome – what’s the least amount of time and resource we can spend to prove that this tool solves a problem, and can be operationalised. Avoid spending time on figuring out how you can integrate the tool with your CRM, and instead make it much simpler. You can always extend it down the line.

While we have thus far presented a linear sequence, you should consider this question in the beginning, before diving into the data and value. What does an end-product potentially look like? How do we ensure it gets used? Why are we doing this in the first place and for whom?

But this linear sequence is essential. Inevitably when you dive into this project, the final output will vary from where you originally started. So, this last-mile delivery might look different from what you expected, and you may need to adapt based on technology or data restraints.

Summing Up

If you have a task or problem in front of you that is well suited for data science, then this framework is the next step. Start with why this project is worth pursuing, and what the simplest version of it could look like. Then, dive into the data you have and see if there is enough of it, and it is telling you something meaningful. With this information in mind, make sure continuing to pursue this machine learning project justifies the costs and resources. Finally, make sure you get it operational in a way that your team or customers can make use of it. This might look like the initial MVP you sketched out at the beginning. Or, it could have drastically changed over the course of the project.

Modelling Eurovision voting

This is a follow-up to our previous blog on Eurovision voting, where we’ll explain how we modelled the objective quality of songs and voting biases between countries in Eurovision, and how we grouped the countries into blocks based on their biases. The source code can be found here. We’ve taken the data from a Kaggle competition, and sourced data for any missing years from Wikipedia.

The hierarchical model

The idea is that the voting outcome is dependent of both the inherent quality of the entry, and the biases countries have for voting for each other. There are lots of possible ways of doing this, but ours is fairly simple and works quite well.

Let \(r_{c_ic_jy_k}\) denote the fraction of people in country \(c_i\) that voted for country \(c_j\) in the year \(y_k\). Note that \(\sum_{j=1}^Nr_{c_ic_jy_k} = 1\), so it is reasonable to model the vector \(\mathbf{r}_{c_iy_k} = (r_{c_ic_1y_k}, \dots, r_{c_ic_Ny_k})\) as following a Dirichlet distribution:

\[\mathbf{r}_{c_iy_k} \sim \operatorname{Dir}(\beta_{c_ic_1y_k}, \dots, \beta_{c_ic_Ny_k}).\]

We choose a model where the parameters \(\beta_{c_ic_jy_k}\) decompose as

\[\beta_{c_ic_jy_k} = \operatorname{Exp}\bigl(\theta_{c_jy_k} + \phi_{c_ic_j}\bigr),\]

where \(\theta_{c_jy_k}\) captures the objective quality of the song from country \(c_j\) in the year \(y_k\), and \(\phi_{c_ic_j}\) captures the bias country \(c_i\) has in voting (or not voting) for country \(c_j\). Furthermore, we assume that the \(\theta_{c_jy_k}\)’s and \(\phi_{c_ic_j}\)’s are drawn from an (unknown) normal distribution:

\phi_{c_ic_j}, \theta_{c_jy_k}\sim N(\mu, \sigma).

Note that we don’t actually have access to \(r_{c_ic_jy_k}\), we only have data on the number of points each country was awarded. But we make do with what we have and approximate \(r_{c_ic_jy_k}\) by

\frac{\text{(points awarded to country \(c_j\) by country \(c_i\) in the year \(y_k\)}) + \alpha}{(\text{total points awarded by country \(c_i\) in the year \(y_k\)}) + N\alpha},

where \(\alpha\) is a constant that we set to 0.1.

It’s hard to say for definite whether this is a reasonable approximation without being able to actually see the voting data, but preferences often follow power laws, and the decreasing sequence of points 12, 10, 8, 7, 6, 5, 4, 3, 2, 1, 0, 0, 0, … at least follow a similar shape:

Voting/Powerlaw fit

It’s not perfect, but hopefully good enough. Note that we do completely miss out on any information about the tail, but we assume that this is mostly noise that don’t contribute much anyway.

Fitting the model

We fit the model using using Stan, a programming language that is great for making Bayesian inferences. It uses Markov chain Monte Carlo methods to find the distribution of the parameters which best explain the responses. Stan is very powerful, all we really need to do is to specify the model, then pass in our data and Stan does the rest!

As Stan uses Bayesian methods, it returns a sample of the distribution of your parameters, in our case consisting of 16 000 (paired) values for each parameter. In our previous analysis we simply took the means as point estimates for our parameters, but having the distribution lets us talk about the uncertainty of these estimates. For example, the point estimate for objective quality \(\theta\) of the winner in 2015 (Sweden) is 1.63, and 1.41 for the runner up (Russia). This however, doesn’t tell us the full picture. Here’s a plot for the joint distribution of \(\theta\) for these entries:

Distribution of theta for Sweden and Russia in 2015

From this joint distribution we can calculate the probability that Sweden’s entry was objectively better than Russia’s entry as the proportion of samples above the blue line, and this turns out to be about 93%.

Finding the blocks

Looking at the bias terms \(\phi_{c_ic_j}\), we can attempt to group the countries into groups that are tightly connected, i.e. where there’s positive biases within a group and neutral or negative biases between groups. We use a method based on Information Theoretic Co-Clustering, where we choose a clustering where we loose as little mutual information as possible.

The basic idea can be described as follows: For each vote from country A to country B, take a marble and label it ‘from: country A, to: country B’. Now put all the marbles in a jar, and pick one at random. How much does knowing from which country the vote was from tell you who it was for? For example, Romania and Moldova almost always trade 12 points, so knowing that the vote was from Romania tells me there is a high probability that the vote was for Moldova. Mutual information gives us a quantitative value of this knowledge for the whole jar.

Now if we have clustered the countries into blocks, we can instead label the marbles ‘from: block A, to: block B’. We generally loose information by doing this. As we don’t know which country the vote was actually from it’s harder to predict which block the vote was for. By finding the clustering that looses the least amount of information, we get the clustering that best represents the biases.

Below is a heatmap showing the probabilities of countries voting for each other, with our identified blocks separated by lines. We do see that the blocks certainly capture voting behaviour fairly well, voting within the blocks is far more likely than between the blocks (with Lithuania and Georgia being a notable exception). Also, we can identify an “ex-Soviet block” within the “Eastern Europe” block, and a “Northern Europe” block with in the “Western Europe” block, both highlighted in gray.

Voting blocks in Eurovision

How I Became a Data Scientist: With Jeremy

There are lots of people out there wondering how to transition from whatever field they’re in now into the exciting world of Data Science, so I thought that I’d throw my hat into the ring and describe how I went about becoming a data scientist. Checkout this video to see more about Jeremy’s life as a Data Scientist.

Jeremy Mitchell, Data Scientist, Data Mettle, Becoming a Data Scientist
Jeremy Mitchell Data Scientist

My Life as a Space Physicist

I started my career as a space physicist. My research was all about trying to figure out how astrophysical shock waves work. Basically, shock waves happen when you’ve got objects traveling faster than the speed of sound (or some other wave). So just like the sonic boom in front of a jet, or the bow wake in front of a boat. The shock wave’s job is to slow the fluid down. On Earth, that’s easy: there are millions of collisions among the atoms and molecules in the air/water/whatever that can slow the fluid down. In space, there are (almost) no collisions, so where do the shock waves come from?

I don’t want to go into answering that too much here. Instead, I’ll talk a little bit about how I studied it. There’s a big shock wave in the solar wind between the Earth and the Sun (because the solar wind is moving so fast, and the Earth is blocking its way). I used a bunch of different spacecraft, each of which crossed over this shock wave from time to time. This meant that sometimes I could see what was happening on different parts of the shock wave at the same time, and see if there were any large scale effects.

The relevant part here is that I needed to get large data sets from the spacecraft, prepare the data, compare the different datasets, and then use them to build physical models. That’s pretty similar to what I do now! The important part here is all the work needed to carefully collect, understand, and calibrate the data. Once I’d done that, I could use the data to build physical and mathematical models. The final step is validating those models, often meaning the process starts again!

Becoming a Data Scientist

How did this help me become a data scientist? Easy. The process is almost exactly the same: I would gather, clean and understand the data, use the data to build models, and then validate the models. Of course, I was now building statistical or machine learning models for marketing or operations optimisation for a large supermarket, but the process was remarkably similar. And just as much fun!

So what skills helped me make the switch? I’d say these (in no particular order):

  1. Lots of programming experience.
  2. Mathematical and statistical modeling knowledge.
  3. Knowing how to handle that much data.

Of course, there are lots of things that are very different too, so I’d add a fourth point

  1. Being open to learning the ropes in a very new environment.

Although this can be a challenge, personally I found it one of the best parts of becoming a data scientist, and it was nice to learn that there are so many interesting problems out there to get our teeth stuck into!

_data journey

Your data can tell you a lot about your customer's journey. Our services can provide you with the information and tools that you need to match your services to customers.