How much data do I need to do machine learning?
If you needed to make a cooking analogy for data science, then the data scientist is the chef. The models and algorithms are the recipes. That makes data the raw ingredients.
When a chef is ordering up raw ingredients, there are a couple of main things they need to consider for a great dining experience. The first is that they have enough to feed everyone. The second is that the ingredients are of high quality. Three-star Michelin meals are built around the best ingredients, not just putting the right quantity of ingredients together.
Ok, what does this have to do with machine learning
Thanks for humouring me to this point. The reason I bring this up is that so often, the question we get is “how much data do I need to do machine learning” when approaching a new project. That is, of course, an important question. If you have 10 rows of data, that is not enough. If you have 10k that probably is. However, as with cooking, it’s not just about quantity. Your ability to do machine learning that matters is tied directly to the quality of the data you are putting in.
Garbage in, Garbage out
No matter how good the chef, starting with tasteless strawberries, hothouse tomatoes, and sad lettuce will lead to a bad dish. Similarly, incomplete, unclean, or irrelevant data could mean producing bad outputs from a machine learning model.
It’s the old computer science adage: Garbage in, Garbage out. Bad data inputted into the best machine learning models lead to wrong predictions and outputs.
What is bad data?
There are a few primary things we mean by ‘bad’ data:
- Doesn’t tell you what you need it to
- Missing, fragmented or not readily accessible
- Unstructured
The data doesn’t tell you what you need it to
If your recipe calls for chocolate cake, but you only have strawberries, well then I have some bad news for you.
Similarly, if you are using a machine learning model to make a prediction, but the information required to make that prediction isn’t present in the data. You are not going to end up with a quality prediction. Say, for example, you are an e-commerce site, and you want to show the exact right product to whoever lands on your website. You might know information on your users like the country they are in, what browser they use, and what device they are on. But perhaps none of that information correlates with what product they are likely to buy.
Then it becomes a matter of identifying what data source does correlate with their product preferences. This can come from some intuition, or trial and error. And if you aren’t collecting this data and the necessary scale, then you need to find ways to do so.
The data is missing, fragmented or not readily accessible
Everyone’s been there: you are about to make waffles on a lazy Sunday morning. You pull up your recipe, grab all the ingredients, and get to work. By the time you have all your dry ingredients together, you start in with the eggs and milk. Then, of course, you realize, you only have half as much milk as you need for the recipe. You try to augment it – water probably works fine in its place, right? The waffles turn out terrible, and you ruin breakfast. This may or may not have happened to me recently.
The data science equivalent is having all the necessary ingredients – but there are gaps in the data. Some entries are missing the critical details – say the age of the user to predict the right product to show them.
Frequently this data exists somewhere. Like with the waffles example, you might have more milk in another fridge, or you might have a corner store a short walk away. But, if it isn’t where it needs to be, then you can’t make use of it.
The real-world example of this are organizations where data is collected on customers and stored across different databases and platforms. You might have some data in Google Analytics, your CRM, surveys, email marketing software, and so on. Separately, each of these products performs a useful function. However, your models will often require inputs from multiple sources, brought together to be useful in making predictions.
This is a data engineering challenge – bringing data from these various sources and combining into a single place for data scientists to first test and build models. Then, once deployed, these data pipelines need to run continuously to feed into the models and output what you need from them.
You have unstructured data
Sometimes, maybe in a pandemic situation, you find yourself digging into the freezer to find those food items you saved months ago. By now they are caked in layers of frost. You see lots of things, but there is no order or structure.
This is the unstructured data problem. You have lots of it, you just don’t know what it is, and classifying it isn’t straightforward. For example, it’s customer feedback or other forms of free text. You can have terabytes of this data, but without some classification of that data then you can’t do much with it. The challenge then becomes finding a way to put structure around this data – for instance, classifying customer feedback as positive, neutral, or negative.
How to rescue a meal gone wrong
The other day I found myself with a flavourless pineapple. What could be more disappointing? Well, I also had some tequila, triple sec and lime. So I threw some pineapple chunks in the freezer and once frozen made some frozen pineapple margaritas. The outcome was radically different than what I had originally intended, but I was able to work with what I had and produce a very tasty alternative.
Bad data doesn’t mean starting from scratch. In fact, usually, our projects start off with cleaning and connecting up data. This usually leads us to interesting insights and a better understanding of what is possible to do with the ingredients available. Sometimes these exactly fit with the vision of the original recipe, but in many cases, we discover new and unanticipated insights during this process.
Digestif ?
To bring this all together – it’s essential to not only account for the amount of data you have but the quality of your data. You may have lots of ingredients. But, it’s impossible to cook a great meal if you start with bad ingredients, or ingredients aren’t where they need to be, or you aren’t sure what ingredients you have in the first place.
Similarly, asking how much data I need for machine learning is like asking how much salt I need to make a great meal. Instead, it’s about balancing the quantity of the data you have with the quality. Does it tell you what you need it to tell you? Is it in a place where you can make use of it? And do you know what it is?
However, all is not lost! You can often rescue bad raw ingredients through exploration and cleaning. Sometimes these lead to new recipes and combinations that you didn’t anticipate at first, or help you solve your problem in an unexpected way.
Once you have all these elements, then you are ready to cook.