Four Questions to Consider Before Launching a Machine Learning Project
In our last post, we focused on the four problems and tasks that data science and machine learning is particularly well suited for. This is a good starting point for where you can focus your efforts. But there are still practical and business realities of your particular use case to consider. Specifically, why you are doing the project in the first place? What is the volume and quality of your data? And is there enough value in solving the problem to devote energy and resource to it?
In this post we take you through a typical process we run on data science projects involving questions to consider before launching a machine learning project. We start with the why of the project, through to discovery, implementation and last-mile delivery. This process isn’t always linear – in fact it’s almost never linear. But, in general, these are the steps that go through.
1. Why?
Why is always a good place to start. Why are we doing this? What value do we think, or hope, it will deliver?
The “why” helps form the vision of what you want to achieve with your machine learning project. It makes it easier to communicate with other stakeholders, such as your CEO, your colleagues, or the data science team working on it. This is also when you sketch out what you think the final product will look like. Of course, this will change over the course of the machine learning project.
To use an example: you are facing a high rate of customer churn. You think there’s some evidence you can predict when this will happen. You think you can use this information to stop customers from leaving. The why is clear here: you want to lose fewer customers. But should also think about how you’ll operationalise this. Who the key stakeholders are (maybe account managers or customer success), where you will need data from (CRM and your product), and what a final tool might look like (could be something as simple as an email notification to the account manager to intervene).
Importantly, this is the time to focus on a minimum viable product (MVP). What is the smallest thing you can build to validate the tool? The example above is an approach: rather than implementing a full solution that integrates with your CRM, a simple email notification is quicker to set up and will get the model’s predictions in the hands of your time. They can give feedback on the accuracy of the model, and whether the interventions. You can then embed it further in your processes down the line. You can always go and improve it further down the line once you know it works.
2. Is there enough good data?
One of the most common questions we get is ‘how much data do we need’. The answer we give is a bit unsatisfying: it depends. Quantity of data is important. But this goes hand in hand with the quality of the data.
First, let’s consider the quantity. If the data is good and provides some signal, you may not have enough of it to make meaningful predictions. If you rely solely on data you collect to train models, you solve this by collecting more data. That means more customers or more interactions with them. Until you have enough data, you can’t use machine learning.
There are ways of generating ‘synthetic’ data. This is artificially created and mimics real data Learn more on this at Towards Data Science.
Alternatively, if the data you require isn’t organisation specific, there are also public data repositories. Check out Google’s Dataset search, and Kaggle.
All of the above is caveated with whether the data is good to start with. Does the answer we are looking for live in the underlying data? And, is it in the right shape to make use of it for the machine learning project?
For the customer churn prediction model example: say the overwhelmingly greatest indicator of churn is country of residence. If we don’t collect that information from enough customers, then the model’s outputs will be weak.
Additionally, is the data structured so that we can make use of it? If it’s unstructured data, such as free text like customer comments, or if it doesn’t properly link customers with the data you have on them, then you can’t use the data in the current state. That work is likely to be a project into itself. In this case, it comes down to identifying what other information you need.
3. Is it worth it?
Does pursuing a long term machine learning project, implementing it, and maintaining it justify the resource you put into it? Or, are there more straightforward solutions to achieve the same result?
Often, this doesn’t mean that you should stop the project. It might mean that data science isn’t the right tool for the job. It could be a much simpler solution, like better reporting dashboards and data visualisation.
However, often pursuing a machine lerning project is well worth the cost and resource that goes into it. Consider our Concentre project as an example, where our tool allowed them to reduce time spent analysing documents from 200 in 8 hours to 200 in 30 minutes.
4. What can we do with it?
A good machine learning project – or any project for that matter – considers the last-mile delivery. How will we get this model operational, and in the hands of stakeholders who will get value from it?
Consider our customer churn problem again. Our tool enables the customer success team to intervene and offer some perk or incentive to retain that customer. In the abstract, this sounds great. You build the tool, you hand it over to the customer support team, and then they get on with it.
However, this is where machine learning projects can fail. Embedding these tools into existing workflows or products is the most critical bit. In our example, you might build this tool and deploy it in the cloud somewhere, and the customer success team needs to navigate over there to find a list of soon to churn customers. However, the customer success team spends all their time in a CRM or email. And this additional step of navigating away means they won’t use it as much, which means the organisation isn’t able to get the maximum value from this.
That’s again why in the beginning it is important to define a minimum viable product or outcome – what’s the least amount of time and resource we can spend to prove that this tool solves a problem, and can be operationalised. Avoid spending time on figuring out how you can integrate the tool with your CRM, and instead make it much simpler. You can always extend it down the line.
While we have thus far presented a linear sequence, you should consider this question in the beginning, before diving into the data and value. What does an end-product potentially look like? How do we ensure it gets used? Why are we doing this in the first place and for whom?
But this linear sequence is essential. Inevitably when you dive into this project, the final output will vary from where you originally started. So, this last-mile delivery might look different from what you expected, and you may need to adapt based on technology or data restraints.
Summing Up
If you have a task or problem in front of you that is well suited for data science, then this framework is the next step. Start with why this project is worth pursuing, and what the simplest version of it could look like. Then, dive into the data you have and see if there is enough of it, and it is telling you something meaningful. With this information in mind, make sure continuing to pursue this machine learning project justifies the costs and resources. Finally, make sure you get it operational in a way that your team or customers can make use of it. This might look like the initial MVP you sketched out at the beginning. Or, it could have drastically changed over the course of the project.