We are your data science workshop.

Four Questions to Consider Before Launching a Machine Learning Project

In our last post, we focused on the four problems and tasks that data science and machine learning is particularly well suited for. This is a good starting point for where you can focus your efforts. But there are still practical and business realities of your particular use case to consider. Specifically, why you are doing the project in the first place? What is the volume and quality of your data? And is there enough value in solving the problem to devote energy and resource to it?

In this post we take you through a typical process we run on data science projects involving questions to consider before launching a machine learning project. We start with the why of the project, through to discovery, implementation and last-mile delivery. This process isn’t always linear – in fact it’s almost never linear. But, in general, these are the steps that go through.

1. Why?

Why is always a good place to start. Why are we doing this? What value do we think, or hope, it will deliver?

The “why” helps form the vision of what you want to achieve with your machine learning project. It makes it easier to communicate with other stakeholders, such as your CEO, your colleagues, or the data science team working on it. This is also when you sketch out what you think the final product will look like. Of course, this will change over the course of the machine learning project.

To use an example: you are facing a high rate of customer churn. You think there’s some evidence you can predict when this will happen. You think you can use this information to stop customers from leaving. The why is clear here: you want to lose fewer customers. But should also think about how you’ll operationalise this. Who the key stakeholders are (maybe account managers or customer success), where you will need data from (CRM and your product), and what a final tool might look like (could be something as simple as an email notification to the account manager to intervene).

Importantly, this is the time to focus on a minimum viable product (MVP). What is the smallest thing you can build to validate the tool? The example above is an approach: rather than implementing a full solution that integrates with your CRM, a simple email notification is quicker to set up and will get the model’s predictions in the hands of your time. They can give feedback on the accuracy of the model, and whether the interventions. You can then embed it further in your processes down the line. You can always go and improve it further down the line once you know it works.

 

2. Is there enough good data?

One of the most common questions we get is ‘how much data do we need’. The answer we give is a bit unsatisfying: it depends. Quantity of data is important. But this goes hand in hand with the quality of the data.

First, let’s consider the quantity. If the data is good and provides some signal, you may not have enough of it to make meaningful predictions. If you rely solely on data you collect to train models, you solve this by collecting more data. That means more customers or more interactions with them. Until you have enough data, you can’t use machine learning. 

There are ways of generating ‘synthetic’ data. This is artificially created and mimics real data Learn more on this at Towards Data Science.

Alternatively, if the data you require isn’t organisation specific, there are also public data repositories. Check out Google’s Dataset search, and Kaggle.

View more resources like this on _database

All of the above is caveated with whether the data is good to start with. Does the answer we are looking for live in the underlying data? And, is it in the right shape to make use of it for the machine learning project?

For the customer churn prediction model example: say the overwhelmingly greatest indicator of churn is country of residence. If we don’t collect that information from enough customers, then the model’s outputs will be weak.

Additionally, is the data structured so that we can make use of it? If it’s unstructured data, such as free text like customer comments, or if it doesn’t properly link customers with the data you have on them, then you can’t use the data in the current state. That work is likely to be a project into itself. In this case, it comes down to identifying what other information you need.

3. Is it worth it?

Does pursuing a long term machine learning project, implementing it, and maintaining it justify the resource you put into it? Or, are there more straightforward solutions to achieve the same result?

Often, this doesn’t mean that you should stop the project. It might mean that data science isn’t the right tool for the job. It could be a much simpler solution, like better reporting dashboards and data visualisation.

However, often pursuing a machine lerning project is well worth the cost and resource that goes into it. Consider our Concentre project as an example, where our tool allowed them to reduce time spent analysing documents from 200 in 8 hours to 200 in 30 minutes.

4. What can we do with it?

A good machine learning project – or any project for that matter – considers the last-mile delivery. How will we get this model operational, and in the hands of stakeholders who will get value from it?

Consider our customer churn problem again. Our tool enables the customer success team to intervene and offer some perk or incentive to retain that customer. In the abstract, this sounds great. You build the tool, you hand it over to the customer support team, and then they get on with it.

However, this is where machine learning projects can fail. Embedding these tools into existing workflows or products is the most critical bit. In our example, you might build this tool and deploy it in the cloud somewhere, and the customer success team needs to navigate over there to find a list of soon to churn customers. However, the customer success team spends all their time in a CRM or email. And this additional step of navigating away means they won’t use it as much, which means the organisation isn’t able to get the maximum value from this. 

That’s again why in the beginning it is important to define a minimum viable product or outcome – what’s the least amount of time and resource we can spend to prove that this tool solves a problem, and can be operationalised. Avoid spending time on figuring out how you can integrate the tool with your CRM, and instead make it much simpler. You can always extend it down the line.

While we have thus far presented a linear sequence, you should consider this question in the beginning, before diving into the data and value. What does an end-product potentially look like? How do we ensure it gets used? Why are we doing this in the first place and for whom?

But this linear sequence is essential. Inevitably when you dive into this project, the final output will vary from where you originally started. So, this last-mile delivery might look different from what you expected, and you may need to adapt based on technology or data restraints.

Summing Up

If you have a task or problem in front of you that is well suited for data science, then this framework is the next step. Start with why this project is worth pursuing, and what the simplest version of it could look like. Then, dive into the data you have and see if there is enough of it, and it is telling you something meaningful. With this information in mind, make sure continuing to pursue this machine learning project justifies the costs and resources. Finally, make sure you get it operational in a way that your team or customers can make use of it. This might look like the initial MVP you sketched out at the beginning. Or, it could have drastically changed over the course of the project.

A Framework for the Use of Data Science in Your Company: Four Approaches

Data science, machine learning and AI are buzzy topics. Many leaders within companies are excited to explore these technologies to gain a competitive advantage, deliver more value to their users, or avoid getting left behind. In our experience, managers often know they want to make better use of data or have a data problem. What is less clear are the specifics of what can be the use of data science in their organisation.

In order to help businesses think about the types of problems or tasks they might approach using data science, we’ve developed a general framework for how you look at opportunities to use data science. This is meant to jumpstart a brainstorm as to where data science might be applied within an organisation, and to ensure it delivers value.

1. Repetitive and time-consuming tasks

As a general rule, if you can teach a person, you can teach a machine. Machine learning and data science can be very well suited for repetitive and time-consuming tasks. This generally involves classifying large volumes of information by subject matter.

An example of this problem is our project with Concentre. They have a large volume of documents to check, and previously this was a completely human-powered endeavour. We effectively trained a machine to do this task for them, which cut the time they needed to spend checking these documents down dramatically.

Another great example of this comes from a cucumber farmer in Japan. Cucumbers need to be sorted by size and curvature once picked. This task fell to people. The cucumber farmer realised this is a perfect problem for image recognition and deep learning. So, he built a tool that took images of the cucumbers and automatically sorted them into groups based on those factors.

2. Error-prone and/or highly permutative tasks

Tightly related to the above, these are tasks that humans often make a mistake when doing. This might be due to the amount of data a human needs to process or consider to complete these tasks. Or it could be because as above, the repetition might lead to boredom and mistakes being made.

A great example of this in our past work is a major consultancy. Their large team of consultants had specific skills and experience and were located across the world. Their clients needed certain skills, over specific time periods in specific locations. Assigning the right teams for the right jobs at the right times is an enormous undertaking, with countless factors to consider. You have to do many permutations, each change having knock-on effects for other assignments.

If you’ve had a wedding, you might have experienced a similar problem when setting up your table assignments. You want to have the right people together at tables, and there are always pairs of people you can’t seat together. Uncle Joe doesn’t get along with Aunt Emma. Each change you make leads to a cascading effect on other tables, and it ends up taking way more time than you thought. Then you share it with the groom’s mother in law and she doesn’t like her table and so you start over. Machine learning is a perfect solution. You could tag participants by personality type and group them together, keeping in mind pairs that need to be excluded from each other.

3. There is clarity around the problem

Don’t do data science for the sake of doing data science. There needs to be a complete understanding of the problem you are trying to solve. For the cucumber farmer in Japan, they needed to classify cucumbers by size and shape. For Concentre, they needed to validate files against the metadata.

Starting with a solution or technology you want to use is the wrong way around. You can’t just throw algorithms at something and hope it returns something useful for you. Data science, machine learning and AI are not magic solutions.

Wanting to use data science to improve marketing or sales is not enough information to start with. You should avoid top-down approaches like this, and instead work up from the problem to see if data science is the right tool for the job. An example problem for sales might be not having a good way of predicting which leads will convert into customers, which leads to inefficiency and salespeople wasting time on the wrong leads when they could be nurturing the right ones. Now, this is an interesting problem to solve, and potentially one for the use of data science. Which brings us to the next part of the framework.

4. It’s clear why solving the problem is valuable

This is also related to the previous point. Is the problem a big enough headache that it is worth spending time-solving? Does it deliver enough value to users, or enough savings, or increased revenue to embark on a potentially uncertain and time-consuming journey? This needs to be clear and obvious to all stakeholders from the beginning because it’s not just about kicking off the project, but also ensuring everyone is brought into final delivery of the data science product. For example, a product manager will need to see value in the proposed solution enough to prioritise getting it into a roadmap. A sales manager will need to be clear on why it improves outcomes enough to ask his sales team to change their workflows to use it.

Using our previous sales example, a good way to quantify this is to try and calculate how much time is spent trying to qualify leads, and how effective sales teams are at this currently. Then, you can try and better understand how the use of data science might improve accuracy, and/or cut down on the teams time spent researching leads.

For Concentre, this was clear from the outset. They had consultants doing these checks manually for 8 hours per day, in which time a consultant could complete 20 checks and do the necessary next steps in rectifying errors. With our solution, they were able to do the same amount of work in 30 minutes because our tool took away the time-consuming portion of the task in checking the metadata against the file name. Now they can utilise those consultants on other projects for their clients.

Conclusion

This is meant to be a framework to help you start to think about how you might approach solving problems in your organisation, and in particular which of those problems are well suited for the use of data science. We are here as a resource to help you make use of this framework so do reach out if you have questions or want to run through specifics in your organisation.

Data Science Terms: A Data Mettle Guide

As someone who joined a company that does data science for a living, I quickly realised that what I knew about the field was only the tip of the iceberg. There was and is a huge learning curve for me to overcome to understand the key data science terms and concepts. So, I’ve decided to take you on my learning journey as I learn and understand these concepts. We’ll keep this guide updated as new terms pop up, so check back in regularly or bookmark us.

Artificial Intelligence (AI)

 In general, it is the ability of computers to mimic the ability of humans to perform tasks that require a complex set of thought processes.

Artificial intelligence is a bit of a nebulous term, and what constitutes AI evolves over time. In fact, there is a phenomenon called the “AI effect” which is that once a machine is properly trained to solve a particular problem, that problem is no longer classified as intelligent and thus removed from the definition of AI. Tesler’s theorem, in fact, states that “AI is whatever hasn’t been done yet.”

Self-driving cars are a perfect example of artificial intelligence in action. The systems developed for autonomous vehicles must mimic the hundreds of decisions human must make every minute while they drive.

AI is as an umbrella term for many concepts within data science. For instance, you would consider machine learning (ML) a subset of AI.

Clustering

Also called cluster analysis, an unsupervised learning technique that identifies patterns within a set of data based on common characteristics. An example of clustering would be to take poll data and start to define personas and typologies of responses to the polling questions.

Watch our video on clustering and segmentation to learn more.

Data engineering

The role of data engineering is to make data accessible and usable within an organisation to whoever needs it. It might be for data scientists to train and run models, or analysts or marketers to make decisions on historical data/performance.

Read our blog on the difference between data science and data engineering.

Decision Tree

In the context of machine learning and data science, a decision tree is a series of yes/no questions asked of data to arrive at a prediction.

Deep Learning

A subset of machine learning based on deep neural networks. Neural networks have been around for over 50 years, but it wasn’t until recently when we had enough computing power to use networks with many layers (deep) that they really became ubiquitous and really good.

K-means Clustering

Type of clustering where the number of clusters the user is splitting the data into is pre-defined. The algorithm would then find representative ‘centres’ of each cluster, and a point is defined to be in the cluster whose centre it is closest to. 

An example is identifying fraudulent activity from non-fraudulent activity, where you can use prior data to identify the characterises of activity you know to be fraudulent, and activity that was legitimate. You could then identify future activity and how closely they resemble the characteristics of each centre of each group.

Machine Learning

The ability of computer systems to automatically learn and improve from experience without specific oversight or input by a human. Machine learning models access data on an ongoing basis and use it to learn for themselves, based on a pre-defined goal of the developer of that model.

Matching Data

When you are working with multiple data sets, there are often issues when matching records from two different datasets. For example, you might have customer data in several databases. There might be a unique identifier, such as an email address, that links these two records together. However, this is not always the case. A way to overcome this is in building a statistical model that connects the records based on points of commonality you do have in the data.

Watch our video on matching data sets to learn more.

Neural Networks

Layers of algorithms that are inspired by biological neural systems, where data is sent through ‘synapses’ that trigger information to the next layer algorithms. For example, in image recognition, the neural network will be fed pictures of an object to train on such as a cat, and will then learn all the characteristics that make up that object (fur, whiskers, etc) to distinguish it from other objects.

NumPy

Software library for Python, and stands for Numerical Python. NumPy supports large multidimensional arrays and matrices and allows users the ability to perform mathematical operations.

Optimisation

The process to achieve optimal results by reducing some element of risk or waste. An excellent example of this would be scheduling delivery driver routes for a significant grocery delivery company. The goal would be to deliver the most groceries with the fewest number of drivers and in the shortest amount of time. However, there are countless complexities to account for: drivers, various routes they can take, traffic, which drivers are scheduled for the day, or are on holiday. Computers are much more adept at optimising for this problem than humans because they quickly run all the various permutations required to find the optimal output.

Watch our video on optimisation to learn more.

pandas

A software library for Python, available for free online. It enables users to explore efficiently, clean and process their data before jumping into running machine learning applications on it.

Predictive Models

Watch our video on predictive models to learn more.

Python

A programming language with an extensive library of third party software packages and tools available to extend its functionality, such as NumPy and pandas, which are commonly used by data scientists.

R

A programming language primarily used for statistical computing. It has a history of being used by academic statisticians, and many cutting edge statistical models are first implemented in R.

Random Forest

Also called random decision forests, an algorithm that relies on several decision trees running concurrently to arrive at the best possible prediction. Essentially, it is crowdsourcing a prediction based on many decision trees running. Each of these decision trees trains on randomly picked subsets of data, allowing them to be built separately from each other. The idea behind this is not to overfit the model on the training data.

Segmentation

Watch our video on clustering and segmentation to learn more.

Supervised vs. Unsupervised Learning

Supervised learning is where models are trained on labelled data, whereas unsupervised has no existing labels. For example, in a supervised learning environment, we might run a model on a collection of images labelled as car, bike or horse to train a model to distinguish between these objects. In an unsupervised learning environment, we would instead start with images and allow the model to run and group these images by common characteristics.

SQL

Stands for Structured Query Language, SQL is a programming language for managing data held in a relational database. SQL allows users to query and update records from a database.

TensorFlow

Free software commonly used for machine learning applications such as neural networks. The Google Brain team developed TensorFlow for internal use, and then made open source in 2015.

_data journey

Your data can tell you a lot about your customer's journey. Our services can provide you with the information and tools that you need to match your services to customers.