Are we collecting the right data?
In previous posts, we address keys to starting a data science project, and the amount of data you need to do machine learning. In both, we highlight that the question of ‘how much data do we need to do machine learning?’ often ignores the equally important question of ‘how good is the data in the first place?’
To better answer this question on data quality, there’s an additional question you’ll need to ask yourself early on. Are we collecting the right data for the problem we are trying to solve? In this post, we share insights on how you can start to try to answer this. It’s best to avoid a scenario where you are spending time and money getting the requisite quantity of data to do machine learning, only to find it doesn’t contain essential information you need.
What do we mean by the ‘right data’?
Having poor quality data can broadly be categorised into two buckets. The first is that the data is not cleaned, or readily available to be used by data scientists. This problem is often solved by cleaning and data engineering.
The second, and the focus of this post, is when the data doesn’t contain the underlying information you need to solve the problem you are attempting to solve.
As a practical example: say you are an enterprise software company, and you want to predict which website visitors are most likely to convert to paid users. You might collect all kinds of useful information, such as geography, browser, where they were referred from, and so on. However, after playing with the data, you find none of these factors tells a story or can help you make a meaningful prediction. This data isn’t the right data.
How do you then find it?
Start with a hypothesis
Here, the science part of data science comes into play. Start with some hypotheses for why specific data helps you make a meaningful prediction.
Using our previous example, you might come up with a few hypotheses. What are some things you believe contribute to someone converting to a paid user? Maybe the size of the organisation? Or perhaps, their role at the company?
When developing the hypothesis, it makes sense to spend time with the stakeholders who have the most intuition about this particular problem. In our paid user example, this could include customer support or sales team who have probably developed their own mental models of what a strong lead looks like.
This guides you towards the data you need to collect to begin making this prediction. Then it becomes a matter of collecting it.
Squint at the data
Once you have some hypotheses about what data you think answers the question you need it to, and you’ve started collecting it, then comes the less scientific part.
Early on, you’ll need to continually analyse the data in a non-scientific way, and see if it is telling you the story you expected. Or perhaps it’s telling you a story you didn’t expect it to. At this stage, you won’t have enough records to do proper machine learning with or even to make any statistically significant conclusion. But, you can try to get a better feeling using your instincts and experience.
In the previous enterprise software example, you can monitor correlations. The bigger the company, the more likely they are to convert. Or, breakdown who is purchasing the product and whether it does align with your hypothesis.
When doing machine learning, it’s critical to ask yourself along the way whether you are collecting the right data. You can glean information from your early data collecting by relying on intuition and internal knowledge of what characteristics have some impact on answering the questions you want to answer.
Generate hypotheses about what data answers this problem for you and collect that data. Then, squint at the data and see if anecdotally it is telling you what you expected. This by no means guarantees you are collecting the right data. But it is an additional item that helps you predict if you are on the right track.
Finally, keep checking and keep answering this question on whether you have the right data. You can avoid spending time trying to fit increasingly complex models to your data by continually considering whether you have the right data in the first place.