We are your data science workshop.

What makes a good data science model?

A core part of data science is applying different techniques and algorithms to data, in order to construct a mathematical model that fits the data. Once we’ve done that, we need a way of knowing if the solution we’ve come up with is sufficient. In other words, is it good enough? There are two ways to answer this. 

The technically accurate model

The technical answer is about quantitative measures designed to assess the predictive capability of a model or the goodness-of-fit of data to model. Some models have derivable statistical theory for model validation. Other models that are much harder to interpret and analyse often rely on testing the model predictions on a withheld part of the dataset to directly assess accuracy. This is known as cross-validation. There are also a range of accuracy metrics which can be used depending on what is required of the model. For example, precision and recall are two metrics to assess the accuracy of binary classification models. One measures how often the model is accurate when it predicts a positive outcome, and the other measures how many of the positive cases the model is able to detect. The higher the model scores on these metrics, the better the model is. These tools for quantitative model validation thus help to define a good model – a good model should score high on the relevant quantitative metrics. If there is more than one relevant metric, then a good model should score high on both metrics. 

However, since data science is typically tied to a business context, what makes a good data science model should therefore also be answered from a business point of view. This opens up a whole philosophical discussion of what good means. For the rest of this discussion, we will explain why the quantitative measures of model validation are in fact necessary but not sufficient. 

The useful model

If the goal of the model is prediction, for example, a prediction in the future or a prediction on a spatially new data set (also known as prediction on out-of-sample data), then goodness-of-fit to data and cross-validation scores are only one part of the picture. The model must be able to generalise beyond its training sample, otherwise it will not be able to meet its goal of prediction. The better the model can generalise, the better the model is, since this ability to generalise indicates having captured some fundamentally true relationships within the data. The ability to be able to predict on new out-of-sample data is also dependent on the assumption that the new out-of-sample data distribution is similar to the data the model has been trained on. In practice, this is something to be considered prior to beginning a data science project, and failure to do so will result in a model that satisfies all the quantitative checks but cannot actually be used by the business. 

Next, consider a model that is to be used for decision-making by way of the prediction of class probabilities. Naturally, there are consequences for the model making a bad prediction – false positives and false negatives. The default method of interpreting probabilities produced by the model is using a threshold of 0.5. If the probability is predicted to be greater than 0.5, then the data point is considered to be in the class, and if not, it is not in the class. The choice of this threshold is influenced by business requirements and is typically made by weighing up the cost of a false positive versus a false negative. Two relevant examples here are fraud prediction and email spam prediction. Choosing the default threshold of 0.5 might identify a certain amount of fraud cases while other fraud cases go undetected. If the model is used for making a decision to initiate a fraud investigation, then we might prefer that as many fraud cases are detected as possible. An email spam filter, on the other hand, is more likely to lean conservatively to the other side. We want the model to pick out fewer spam emails, and prefer the ones that are picked out to be definitely spam. A good model is the one that solves the problem at hand, or more formally: a good model is one that is able to fulfill some sort of business objective.

One subtlety that comes into play here is that business objectives may be very specific to the business, or to the industry it belongs in – that is why data science models are often customised to suit those needs. Perhaps we have a daily forecasting model where overall prediction accuracy is important, but getting the forecast right on special holidays is especially important. Or, perhaps we have a product demand forecasting model where the company has the capacity to store unused stock, but would always like to be able to always meet demand. In this case, a model that tends to overpredict more often than it underpredicts is preferable to the other way round. Business objectives may also be modified over time. An interesting example is at the start of the COVID-19 pandemic, where quantifying the impact of the pandemic on the business was a problem that many businesses were interested in. Forecasting models that were already in place during this time would have become severely inaccurate due to the change in conditions that led to the violation of the assumption that the new data’s distribution is similar to the data the model has been trained on. However, by retrospectively taking the difference between model predictions and true values, these models were actually pretty useful for a business objective of calculating the impact of the pandemic. 

The model that ticks everyone’s boxes

Although business objectives are largely tied to the intended use of the model, there is sometimes an implied objective relating to the business, the industry and/or the stakeholders of the model. In data science, there is usually a choice between a model that is explainable, and easily interpreted, and a black-box model. For example, in linear and logistic regression, the coefficients found through the fitting process can be interpreted to tell us something about the relationships between variables in the model. The same relationships would be harder to understand if a neural network model was used. This choice of model is very much tied to business objectives. Financial models in particular are often deeply scrutinised and regulated by relevant regulatory bodies, and thus there is usually a preference for models that are explainable. Internal governance processes and stakeholder preferences are also reasons why one might be preferred over the other. A good model is thus, also, the type of model that successfully passes through governance processes. 

The ethical model

Finally, in some cases, there are objectives, often unspecified, that perhaps, should in fact, be objectives of all models. The Australian Department of Industry, Science, Energy and Resources lists 8 Artificial Intelligence Ethics principles, designed to ensure that the use of AI models are “safe, secure and reliable”. Focusing on accuracy metrics, on building a model that is fit for purpose, that all the stakeholders are willing to adopt is not enough. Models are more than an algorithm and an API. They are often embedded into a wider system, a decision-making process, and may be used by more than one part of an organisation. The construction of this wider system and its flow-on effects should be evaluated by both data scientists and business experts to prevent unintended harmful consequences. A good model is the one that has been built with an awareness of the environment it is operating in, and any identified ethical issues resolved. 

 

Conclusion

The question of what makes a good model needs to start from the consideration that we’re solving business problems first and foremost. Statistics and algorithms are tools of the trade, and while it is necessary to validate the model quantitatively via appropriate techniques, it is critical not to lose track of business objectives. A good model is built as a result of the successful collaboration between the data scientist and business experts, and by ensuring that all business objectives are realised in every choice of the model built process. 

 

Acknowledgements

I would like to thank Dr Nigel Clay for a wonderful discussion and debate on this topic, from which inspired some of the key points in this blog post.

Useful Out-of-the-Box Machine Learning Models

With the growing popularity of data science, out-of-the-box machine learning models are becoming increasingly available in free, open-source packages. These models have already been trained on datasets that often large and therefore time-consuming. They are also packaged up in ways that are easy to use, thus simplifying the process of applying data science. While there is no replacing customising a model to suit your objectives, these packages are useful for obtaining quick results and provide a gentle, easy introduction into data science models. In this article, we touch on some of these tools and their applications to well-known machine learning problems.   

Image classification using out-of-the-box machine learning models

Image classification is a machine learning task where an image is categorised into one of several categories through identifying the object in the image. It is easy for humans to recognise the same object shown in different backgrounds and different colours, and it turns out that it’s also possible for algorithms to classify images up to a certain amount of accuracy. Significant advances have been made in image classification as well as other computer vision problems with deep learning, and some of these available as out-of-the-box machine learning models in keras/tensorflow

As an example, the inception-v3 model, trained on ImageNet, a database of over 14 million labelled images, is available as a tool for prediction. It is easy to set up, only taking a couple lines of code before you’re ready to begin classifying your own images – simply read in the image and the model will return a category and a score. The inception-v3 and other image classification models can also be fine-tuned by keeping the weights of some of the layers fixed, and tuning the weights of other layers by training on a new, more relevant dataset. This is a well-known procedure of transfer learning to customise the model to the new data. It is especially useful when the new dataset is small.

image classification, pre-trained model, out-of-the-mox machine learning models
The inception-v3 model managed to classify my golden retriever accurately!

Named Entity Recognition

With Named Entity Recognition (NER), the goal of the task is to identify named entities within a chunk of text, such as the name of an organisation or a person, geographical locations, numerical units, dates and times, and so forth. This is useful for automatic information extraction, eg. in articles, reports, or invoices, and saves the effort of having a human manually read through a large number of documents. NER algorithms can help to answer questions such as which named entities are mentioned most frequently, or to consistently pick out a monetary value within the text. 

Spacy has a NER module available in several languages, as well as a range of text processing capabilities such as part-of-speech tagging, dependency parsing, lemmatization etc. Installation of the package is straightforward and no more than a few lines of code is required to begin extracting entities. Spacy v2.0 NER models consist of subword features and a deep convolutional neural network architecture, and the v3.0 models have been updated with transformer-based models. Spacy also has the functionality to allow training on new data to update the model and improve accuracy, as well as a component for rule-based entity matching where a rule-based approach is more convenient. 

Sentiment Analysis using out-of-the-box machine learning models

Sentiment analysis is used for identifying the polarity of a piece of text, i.e, whether it is positive, negative or neutral. This is useful for monitoring customer and brand sentiment, analysing any text-based feedback. More recently, it has been used for analysing public sentiment to the COVID-19 pandemic through social media or article headlines.

Hugging face transformers is a library containing a range of well-known transformer-based models that have obtained state-of-the-art results in a number of different natural language processing tasks. It includes both language models and task-specific models, and has a sentiment analysis model with a base architecture of distilBERT (a smaller/faster version of the BERT language model) which is fine-tuned for the sentiment analysis downstream task using the SST-2 dataset. The model returns positive and negative labels (i.e, excludes neutral) as well as a confidence score. Other options for an out-of-the-box sentiment analysis model are TextBlob, which has both a rules-based sentiment classifier and a naive-bayes model trained on movie reviews, and Stanza, which has a classifier based on a convolutional neural network architecture, and can handle English, German and Chinese texts. Both TextBlob and Stanza return a continuous score for polarity. 

Transfer learning using pre-trained models

While there is a lot of value in an out-of-the-box model, its accuracy may not be the same when applied directly to an unseen dataset. This usually depends on how similar the characteristics of the new dataset is to the dataset the model is trained on – the more similar it is, the more the model’s results can be relied on. If direct use of an out-of-the-box model is not sufficiently accurate and/or not directly relevant to the problem at hand, it may still be possible to apply transfer learning and fine-tuning depending on package functionality. This involves using new data to build additional layers on top of an existing model, or to retrain the weights of existing layers while keeping the same model architecture.

The idea behind transfer learning is to take advantage of the model’s general training, and transfer its knowledge to a different problem or different data domain. In image classification for example, the pre-trained models are usually trained on a large, general dataset, with broad categories of objects. These pre-trained models can then be further trained to classify, for example cats vs dogs only. Transfer learning is another way to make use of out-of-the-box machine learning models, and is an approach worth considering when data is scarce.

GPT-3: What is It?

Recently you might have come actress GPT-3, and some of the cutting edge tasks this innovation in artificial intelligence can carry out. If you are at all on tech twitter, there is absolutely no way you avoided the hype.

Some of the things people have demonstrated thus far with GPT-3 is dictating the layout of a webpage they wanted built using natural language and having the model code the actual front end of the website. Or, using the model to generate a full article about how humans shouldn’t be worried about robots any time soon.

Like self-driving cars, or other promises of artificial intelligence, it’s easy to get caught up in the hype. So we wanted to share a bit more about GPT-3, what it is, and what it can do. And, perhaps, find a use case for your organization.

Background

GPT-3 is a project of OpenAI, a California based AI research laboratory. Initial backing for OpenAI came from Elon Musk, Peter Thiel, and other Silicon Valley technologists. The mission of the organization is “to ensure that artificial general intelligence benefits all of humanity.”

Among their projects is the Generative Pre-trained Transformer, or GPT. As the name suggests, the latest release is the third iteration of the model, and is a massive leap forward from its predecessors. Or, any other model from other labs for that matter. GPT-3 is able to run on 175 billion parameters. For context, this is 10 times the number of parameters of the next biggest model, Microsoft’s Turing NLG.

GPT-3 is a deep learning model that leverages neural networks. The model was trained mostly on the Common Crawl dataset, which is essentially data pulled from websites on the internet on a monthly basis. In addition, it used data from book transcripts, and even Wikipedia.

What Does it Do?

GPT-3 is a model that focuses on natural language. In the simplest terms, you can pose a question to it, and have it respond back like any human could. 

The most impactful use cases for GPT-3 are probably still being dreamed up by startups all over the world. But, some of the obvious ways companies can use GPT-3 are for chatbots, answering customer support questions, or even translating plain English into SQL queries or regex expressions.

GPT-3 can also write some code, including CSS and Python.

How To Get Access

OpenAI is providing API access to GPT-3 so anyone can build applications on top of it. You can join the waitlist to get access by filling out a form with the use case you have in mind.

Summary

There is certainly a lot of hype around GPT-3, and we’ve yet to see killer applications built on it. Like any new technology, it will follow the traditional hype cycle. But the technology is clearly a massive improvement on its predecessors and holds massive potential.

_data journey

Your data can tell you a lot about your customer's journey. Our services can provide you with the information and tools that you need to match your services to customers.