We are your data science workshop.

Useful Out-of-the-Box Machine Learning Models

With the growing popularity of data science, out-of-the-box machine learning models are becoming increasingly available in free, open-source packages. These models have already been trained on datasets that often large and therefore time-consuming. They are also packaged up in ways that are easy to use, thus simplifying the process of applying data science. While there is no replacing customising a model to suit your objectives, these packages are useful for obtaining quick results and provide a gentle, easy introduction into data science models. In this article, we touch on some of these tools and their applications to well-known machine learning problems.   

Image classification using out-of-the-box machine learning models

Image classification is a machine learning task where an image is categorised into one of several categories through identifying the object in the image. It is easy for humans to recognise the same object shown in different backgrounds and different colours, and it turns out that it’s also possible for algorithms to classify images up to a certain amount of accuracy. Significant advances have been made in image classification as well as other computer vision problems with deep learning, and some of these available as out-of-the-box machine learning models in keras/tensorflow

As an example, the inception-v3 model, trained on ImageNet, a database of over 14 million labelled images, is available as a tool for prediction. It is easy to set up, only taking a couple lines of code before you’re ready to begin classifying your own images – simply read in the image and the model will return a category and a score. The inception-v3 and other image classification models can also be fine-tuned by keeping the weights of some of the layers fixed, and tuning the weights of other layers by training on a new, more relevant dataset. This is a well-known procedure of transfer learning to customise the model to the new data. It is especially useful when the new dataset is small.

image classification, pre-trained model, out-of-the-mox machine learning models

Named Entity Recognition

With Named Entity Recognition (NER), the goal of the task is to identify named entities within a chunk of text, such as the name of an organisation or a person, geographical locations, numerical units, dates and times, and so forth. This is useful for automatic information extraction, eg. in articles, reports, or invoices, and saves the effort of having a human manually read through a large number of documents. NER algorithms can help to answer questions such as which named entities are mentioned most frequently, or to consistently pick out a monetary value within the text. 

Spacy has a NER module available in several languages, as well as a range of text processing capabilities such as part-of-speech tagging, dependency parsing, lemmatization etc. Installation of the package is straightforward and no more than a few lines of code is required to begin extracting entities. Spacy v2.0 NER models consist of subword features and a deep convolutional neural network architecture, and the v3.0 models have been updated with transformer-based models. Spacy also has the functionality to allow training on new data to update the model and improve accuracy, as well as a component for rule-based entity matching where a rule-based approach is more convenient. 

Sentiment Analysis using out-of-the-box machine learning models

Sentiment analysis is used for identifying the polarity of a piece of text, i.e, whether it is positive, negative or neutral. This is useful for monitoring customer and brand sentiment, analysing any text-based feedback. More recently, it has been used for analysing public sentiment to the COVID-19 pandemic through social media or article headlines.

Hugging face transformers is a library containing a range of well-known transformer-based models that have obtained state-of-the-art results in a number of different natural language processing tasks. It includes both language models and task-specific models, and has a sentiment analysis model with a base architecture of distilBERT (a smaller/faster version of the BERT language model) which is fine-tuned for the sentiment analysis downstream task using the SST-2 dataset. The model returns positive and negative labels (i.e, excludes neutral) as well as a confidence score. Other options for an out-of-the-box sentiment analysis model are TextBlob, which has both a rules-based sentiment classifier and a naive-bayes model trained on movie reviews, and Stanza, which has a classifier based on a convolutional neural network architecture, and can handle English, German and Chinese texts. Both TextBlob and Stanza return a continuous score for polarity. 

Transfer learning using pre-trained models

While there is a lot of value in an out-of-the-box model, its accuracy may not be the same when applied directly to an unseen dataset. This usually depends on how similar the characteristics of the new dataset is to the dataset the model is trained on – the more similar it is, the more the model’s results can be relied on. If direct use of an out-of-the-box model is not sufficiently accurate and/or not directly relevant to the problem at hand, it may still be possible to apply transfer learning and fine-tuning depending on package functionality. This involves using new data to build additional layers on top of an existing model, or to retrain the weights of existing layers while keeping the same model architecture.

The idea behind transfer learning is to take advantage of the model’s general training, and transfer its knowledge to a different problem or different data domain. In image classification for example, the pre-trained models are usually trained on a large, general dataset, with broad categories of objects. These pre-trained models can then be further trained to classify, for example cats vs dogs only. Transfer learning is another way to make use of out-of-the-box machine learning models, and is an approach worth considering when data is scarce.

Are we collecting the right data?

In previous posts, we address keys to starting a data science project, and the amount of data you need to do machine learning. By focusing on the question of ‘how much data do we need to do machine learning?’ it’s possible to overlook the equally important question of ‘how good is the data in the first place?’ 

To better answer this question on data quality, there’s an additional question you’ll need to ask yourself early on. “Are we collecting the right data for the problem we are trying to solve?” In this post, we share insights on how you can start to try to answer this. It’s best to avoid a scenario where you are spending time and money getting the requisite quantity of data to do machine learning, only to find it doesn’t contain essential information you need.

What do we mean by the ‘right data’?

Having poor quality data can broadly be categorised into two buckets. The first is that the data is not cleaned, or readily available to be used by data scientists. This problem is often solved by cleaning and data engineering.

The second, and the focus of this post, is when the data doesn’t contain the underlying information you need to solve the problem you are attempting to solve.

As a practical example: say you are an enterprise software company, and you want to predict which website visitors are most likely to convert to paid users. You might collect all kinds of useful information, such as user location, browser type, where they were referred from, and so on. However, after playing with the data, you find none of these factors tells a story or can help you make a meaningful prediction. This data isn’t the right data.

How do you then find it?

Start with a hypothesis

Here, the science part of data science comes into play. Start with some hypotheses for why specific data helps you make a meaningful prediction.

Using our previous example, you might come up with a few hypotheses. What are some things you believe contribute to someone converting to a paid user? Maybe the size of the organisation? Or perhaps, their role at the company?

When developing the hypothesis, it makes sense to spend time with the stakeholders who have the most intuition about this particular problem. In our paid user example, this could include customer support or sales team members who have probably developed their own mental models of what a strong lead looks like.

This guides you towards the data you need to collect to begin making this prediction. Then it becomes a matter of collecting it.

Squint at the data

Once you have some hypotheses about what data you think answers the question you need it to, and you’ve started collecting it, then comes the less scientific part.

Early on, you’ll need to continually analyse the data in a non-scientific way, and see if it is telling you the story you expected. Or perhaps it’s telling you a story you didn’t expect it to. At this stage, you won’t have enough records to do proper machine learning, or even to make any statistically significant conclusion. But, you can try to get a better feeling using your instincts and experience.

In the previous enterprise software example, you can monitor correlations. The bigger the company, the more likely they are to convert. Or, breakdown who is purchasing the product and whether it does align with your hypothesis.

Summary

When doing machine learning, it’s critical to ask yourself along the way whether you are collecting the right data. You can glean information from your early data collecting by relying on intuition and internal knowledge of what characteristics have some impact on answering the questions you want to answer. 

Generate hypotheses about what data answers this problem for you and collect that data. Then, squint at the data and see if anecdotally it is telling you what you expected. This by no means guarantees you are collecting the right data. But it is an additional item that helps you predict if you are on the right track.

Finally, keep checking and keep answering this question of whether you are collecting the right data. You can avoid spending time trying to fit increasingly complex models to your data by continually considering whether you have the right data in the first place.

How much data do I need to do machine learning?

If you needed to make a cooking analogy for data science, then the data scientist is the chef. The models and algorithms are the recipes. That makes data the raw ingredients.

When a chef is ordering up raw ingredients, there are a couple of main things they need to consider for a great dining experience. The first is that they have enough to feed everyone. The second is that the ingredients are of high quality. Three-star Michelin meals are built around the best ingredients, not just putting the right quantity of ingredients together.

Ok, what does this have to do with machine learning

Thanks for humouring me to this point. The reason I bring this up is that so often, the question we get is “how much data do I need to do machine learning” when approaching a new project. That is, of course, an important question. If you have 10 rows of data, that is not enough. If you have 10k that probably is. However, as with cooking, it’s not just about quantity. Your ability to do machine learning that matters is tied directly to the quality of the data you are putting in.

Garbage in, Garbage out

No matter how good the chef, starting with tasteless strawberries, hothouse tomatoes, and sad lettuce will lead to a bad dish. Similarly, incomplete, unclean, or irrelevant data could mean producing bad outputs from a machine learning model.

It’s the old computer science adage: Garbage in, Garbage out. Bad data inputted into the best machine learning models lead to wrong predictions and outputs.

What is bad data?

There are a few primary things we mean by ‘bad’ data: 

  1. Doesn’t tell you what you need it to
  2. Missing, fragmented or not readily accessible
  3. Unstructured

The data doesn’t tell you what you need it to

If your recipe calls for chocolate cake, but you only have strawberries, well then I have some bad news for you.

Similarly, if you are using a machine learning model to make a prediction, but the information required to make that prediction isn’t present in the data. You are not going to end up with a quality prediction. Say, for example, you are an e-commerce site, and you want to show the exact right product to whoever lands on your website. You might know information on your users like the country they are in, what browser they use, and what device they are on. But perhaps none of that information correlates with what product they are likely to buy.

Then it becomes a matter of identifying what data source does correlate with their product preferences. This can come from some intuition, or trial and error. And if you aren’t collecting this data and the necessary scale, then you need to find ways to do so.

The data is missing, fragmented or not readily accessible

Everyone’s been there: you are about to make waffles on a lazy Sunday morning. You pull up your recipe, grab all the ingredients, and get to work. By the time you have all your dry ingredients together, you start in with the eggs and milk. Then, of course, you realize, you only have half as much milk as you need for the recipe. You try to augment it – water probably works fine in its place, right? The waffles turn out terrible, and you ruin breakfast. This may or may not have happened to me recently.

The data science equivalent is having all the necessary ingredients – but there are gaps in the data. Some entries are missing the critical details – say the age of the user to predict the right product to show them. 

Frequently this data exists somewhere. Like with the waffles example, you might have more milk in another fridge, or you might have a corner store a short walk away. But, if it isn’t where it needs to be, then you can’t make use of it.

The real-world example of this are organizations where data is collected on customers and stored across different databases and platforms. You might have some data in Google Analytics, your CRM, surveys, email marketing software, and so on. Separately, each of these products performs a useful function. However, your models will often require inputs from multiple sources, brought together to be useful in making predictions.

This is a data engineering challenge – bringing data from these various sources and combining into a single place for data scientists to first test and build models. Then, once deployed, these data pipelines need to run continuously to feed into the models and output what you need from them.

You have unstructured data

Sometimes, maybe in a pandemic situation, you find yourself digging into the freezer to find those food items you saved months ago. By now they are caked in layers of frost. You see lots of things, but there is no order or structure.

This is the unstructured data problem. You have lots of it, you just don’t know what it is, and classifying it isn’t straightforward. For example, it’s customer feedback or other forms of free text. You can have terabytes of this data, but without some classification of that data then you can’t do much with it. The challenge then becomes finding a way to put structure around this data – for instance, classifying customer feedback as positive, neutral, or negative. 

How to rescue a meal gone wrong

The other day I found myself with a flavourless pineapple. What could be more disappointing? Well, I also had some tequila, triple sec and lime. So I threw some pineapple chunks in the freezer and once frozen made some frozen pineapple margaritas. The outcome was radically different than what I had originally intended, but I was able to work with what I had and produce a very tasty alternative.

Bad data doesn’t mean starting from scratch. In fact, usually, our projects start off with cleaning and connecting up data. This usually leads us to interesting insights and a better understanding of what is possible to do with the ingredients available. Sometimes these exactly fit with the vision of the original recipe, but in many cases, we discover new and unanticipated insights during this process.

Digestif ?

To bring this all together – it’s essential to not only account for the amount of data you have but the quality of your data. You may have lots of ingredients. But, it’s impossible to cook a great meal if you start with bad ingredients, or ingredients aren’t where they need to be, or you aren’t sure what ingredients you have in the first place.

Similarly, asking how much data I need for machine learning is like asking how much salt I need to make a great meal. Instead, it’s about balancing the quantity of the data you have with the quality. Does it tell you what you need it to tell you? Is it in a place where you can make use of it? And do you know what it is? 

However, all is not lost! You can often rescue bad raw ingredients through exploration and cleaning. Sometimes these lead to new recipes and combinations that you didn’t anticipate at first, or help you solve your problem in an unexpected way.

Once you have all these elements, then you are ready to cook.

_data journey

Your data can tell you a lot about your customer's journey. Our services can provide you with the information and tools that you need to match your services to customers.