What makes a good data science model?
A core part of data science is applying different techniques and algorithms to data, in order to construct a mathematical model that fits the data. Once we’ve done that, we need a way of knowing if the solution we’ve come up with is sufficient. In other words, is it good enough? There are two ways to answer this.
The technically accurate model
The technical answer is about quantitative measures designed to assess the predictive capability of a model or the goodness-of-fit of data to model. Some models have derivable statistical theory for model validation. Other models that are much harder to interpret and analyse often rely on testing the model predictions on a withheld part of the dataset to directly assess accuracy. This is known as cross-validation. There are also a range of accuracy metrics which can be used depending on what is required of the model. For example, precision and recall are two metrics to assess the accuracy of binary classification models. One measures how often the model is accurate when it predicts a positive outcome, and the other measures how many of the positive cases the model is able to detect. The higher the model scores on these metrics, the better the model is. These tools for quantitative model validation thus help to define a good model – a good model should score high on the relevant quantitative metrics. If there is more than one relevant metric, then a good model should score high on both metrics.
However, since data science is typically tied to a business context, what makes a good data science model should therefore also be answered from a business point of view. This opens up a whole philosophical discussion of what good means. For the rest of this discussion, we will explain why the quantitative measures of model validation are in fact necessary but not sufficient.
The useful model
If the goal of the model is prediction, for example, a prediction in the future or a prediction on a spatially new data set (also known as prediction on out-of-sample data), then goodness-of-fit to data and cross-validation scores are only one part of the picture. The model must be able to generalise beyond its training sample, otherwise it will not be able to meet its goal of prediction. The better the model can generalise, the better the model is, since this ability to generalise indicates having captured some fundamentally true relationships within the data. The ability to be able to predict on new out-of-sample data is also dependent on the assumption that the new out-of-sample data distribution is similar to the data the model has been trained on. In practice, this is something to be considered prior to beginning a data science project, and failure to do so will result in a model that satisfies all the quantitative checks but cannot actually be used by the business.
Next, consider a model that is to be used for decision-making by way of the prediction of class probabilities. Naturally, there are consequences for the model making a bad prediction – false positives and false negatives. The default method of interpreting probabilities produced by the model is using a threshold of 0.5. If the probability is predicted to be greater than 0.5, then the data point is considered to be in the class, and if not, it is not in the class. The choice of this threshold is influenced by business requirements and is typically made by weighing up the cost of a false positive versus a false negative. Two relevant examples here are fraud prediction and email spam prediction. Choosing the default threshold of 0.5 might identify a certain amount of fraud cases while other fraud cases go undetected. If the model is used for making a decision to initiate a fraud investigation, then we might prefer that as many fraud cases are detected as possible. An email spam filter, on the other hand, is more likely to lean conservatively to the other side. We want the model to pick out fewer spam emails, and prefer the ones that are picked out to be definitely spam. A good model is the one that solves the problem at hand, or more formally: a good model is one that is able to fulfill some sort of business objective.
One subtlety that comes into play here is that business objectives may be very specific to the business, or to the industry it belongs in – that is why data science models are often customised to suit those needs. Perhaps we have a daily forecasting model where overall prediction accuracy is important, but getting the forecast right on special holidays is especially important. Or, perhaps we have a product demand forecasting model where the company has the capacity to store unused stock, but would always like to be able to always meet demand. In this case, a model that tends to overpredict more often than it underpredicts is preferable to the other way round. Business objectives may also be modified over time. An interesting example is at the start of the COVID-19 pandemic, where quantifying the impact of the pandemic on the business was a problem that many businesses were interested in. Forecasting models that were already in place during this time would have become severely inaccurate due to the change in conditions that led to the violation of the assumption that the new data’s distribution is similar to the data the model has been trained on. However, by retrospectively taking the difference between model predictions and true values, these models were actually pretty useful for a business objective of calculating the impact of the pandemic.
The model that ticks everyone’s boxes
Although business objectives are largely tied to the intended use of the model, there is sometimes an implied objective relating to the business, the industry and/or the stakeholders of the model. In data science, there is usually a choice between a model that is explainable, and easily interpreted, and a black-box model. For example, in linear and logistic regression, the coefficients found through the fitting process can be interpreted to tell us something about the relationships between variables in the model. The same relationships would be harder to understand if a neural network model was used. This choice of model is very much tied to business objectives. Financial models in particular are often deeply scrutinised and regulated by relevant regulatory bodies, and thus there is usually a preference for models that are explainable. Internal governance processes and stakeholder preferences are also reasons why one might be preferred over the other. A good model is thus, also, the type of model that successfully passes through governance processes.
The ethical model
Finally, in some cases, there are objectives, often unspecified, that perhaps, should in fact, be objectives of all models. The Australian Department of Industry, Science, Energy and Resources lists 8 Artificial Intelligence Ethics principles, designed to ensure that the use of AI models are “safe, secure and reliable”. Focusing on accuracy metrics, on building a model that is fit for purpose, that all the stakeholders are willing to adopt is not enough. Models are more than an algorithm and an API. They are often embedded into a wider system, a decision-making process, and may be used by more than one part of an organisation. The construction of this wider system and its flow-on effects should be evaluated by both data scientists and business experts to prevent unintended harmful consequences. A good model is the one that has been built with an awareness of the environment it is operating in, and any identified ethical issues resolved.
The question of what makes a good model needs to start from the consideration that we’re solving business problems first and foremost. Statistics and algorithms are tools of the trade, and while it is necessary to validate the model quantitatively via appropriate techniques, it is critical not to lose track of business objectives. A good model is built as a result of the successful collaboration between the data scientist and business experts, and by ensuring that all business objectives are realised in every choice of the model built process.
I would like to thank Dr Nigel Clay for a wonderful discussion and debate on this topic, from which inspired some of the key points in this blog post.