We are your data science workshop.

## Useful Out-of-the-Box Machine Learning Models

With the growing popularity of data science, out-of-the-box machine learning models are becoming increasingly available in free, open-source packages. These models have already been trained on datasets that often large and therefore time-consuming. They are also packaged up in ways that are easy to use, thus simplifying the process of applying data science. While there is no replacing customising a model to suit your objectives, these packages are useful for obtaining quick results and provide a gentle, easy introduction into data science models. In this article, we touch on some of these tools and their applications to well-known machine learning problems.

## Image classification using out-of-the-box machine learning models

Image classification is a machine learning task where an image is categorised into one of several categories through identifying the object in the image. It is easy for humans to recognise the same object shown in different backgrounds and different colours, and it turns out that it’s also possible for algorithms to classify images up to a certain amount of accuracy. Significant advances have been made in image classification as well as other computer vision problems with deep learning, and some of these available as out-of-the-box machine learning models in keras/tensorflow

As an example, the inception-v3 model, trained on ImageNet, a database of over 14 million labelled images, is available as a tool for prediction. It is easy to set up, only taking a couple lines of code before you’re ready to begin classifying your own images – simply read in the image and the model will return a category and a score. The inception-v3 and other image classification models can also be fine-tuned by keeping the weights of some of the layers fixed, and tuning the weights of other layers by training on a new, more relevant dataset. This is a well-known procedure of transfer learning to customise the model to the new data. It is especially useful when the new dataset is small.

## Named Entity Recognition

With Named Entity Recognition (NER), the goal of the task is to identify named entities within a chunk of text, such as the name of an organisation or a person, geographical locations, numerical units, dates and times, and so forth. This is useful for automatic information extraction, eg. in articles, reports, or invoices, and saves the effort of having a human manually read through a large number of documents. NER algorithms can help to answer questions such as which named entities are mentioned most frequently, or to consistently pick out a monetary value within the text.

Spacy has a NER module available in several languages, as well as a range of text processing capabilities such as part-of-speech tagging, dependency parsing, lemmatization etc. Installation of the package is straightforward and no more than a few lines of code is required to begin extracting entities. Spacy v2.0 NER models consist of subword features and a deep convolutional neural network architecture, and the v3.0 models have been updated with transformer-based models. Spacy also has the functionality to allow training on new data to update the model and improve accuracy, as well as a component for rule-based entity matching where a rule-based approach is more convenient.

## Sentiment Analysis using out-of-the-box machine learning models

Sentiment analysis is used for identifying the polarity of a piece of text, i.e, whether it is positive, negative or neutral. This is useful for monitoring customer and brand sentiment, analysing any text-based feedback. More recently, it has been used for analysing public sentiment to the COVID-19 pandemic through social media or article headlines.

Hugging face transformers is a library containing a range of well-known transformer-based models that have obtained state-of-the-art results in a number of different natural language processing tasks. It includes both language models and task-specific models, and has a sentiment analysis model with a base architecture of distilBERT (a smaller/faster version of the BERT language model) which is fine-tuned for the sentiment analysis downstream task using the SST-2 dataset. The model returns positive and negative labels (i.e, excludes neutral) as well as a confidence score. Other options for an out-of-the-box sentiment analysis model are TextBlob, which has both a rules-based sentiment classifier and a naive-bayes model trained on movie reviews, and Stanza, which has a classifier based on a convolutional neural network architecture, and can handle English, German and Chinese texts. Both TextBlob and Stanza return a continuous score for polarity.

## Transfer learning using pre-trained models

While there is a lot of value in an out-of-the-box model, its accuracy may not be the same when applied directly to an unseen dataset. This usually depends on how similar the characteristics of the new dataset is to the dataset the model is trained on – the more similar it is, the more the model’s results can be relied on. If direct use of an out-of-the-box model is not sufficiently accurate and/or not directly relevant to the problem at hand, it may still be possible to apply transfer learning and fine-tuning depending on package functionality. This involves using new data to build additional layers on top of an existing model, or to retrain the weights of existing layers while keeping the same model architecture.

The idea behind transfer learning is to take advantage of the model’s general training, and transfer its knowledge to a different problem or different data domain. In image classification for example, the pre-trained models are usually trained on a large, general dataset, with broad categories of objects. These pre-trained models can then be further trained to classify, for example cats vs dogs only. Transfer learning is another way to make use of out-of-the-box machine learning models, and is an approach worth considering when data is scarce.

## Data Mettle: New office, new team member

It’s been an exciting couple of months here at Data Mettle, and we wanted to give you a quick update.

## We’ve gone global

We are very excited to announce we’ve officially gone global! We’ve expanded our reach to the other side of the globe and are pleased that our new Australian office in Perth is open.

Over the past couple of months, we’ve been developing our presence in Perth and getting to know the digital challenges here. The technology and business landscape in the city is exciting, and the weather isn’t bad either.

## Allow myself to introduce… myself

I should probably introduce myself – I’m Matt, and I recently joined Data Mettle as COO. Its an incredibly exciting opportunity for me to work with the team here, and dive into the world of data science. My background is in product management and operations, with interest in technology and startups.

Best of all, I worked on a project with the team at Data Mettle before I joined. My first-hand experience of their expertise, professionalism and the quality of their work made joining a no brainer. I am personally very excited about the experience and knowledge I’m going to gain by working with the team.

Some of the key areas I’ll be focusing on are:

• Outreach
• Product development
• Customer engagement

I’ll be working out of the London office. You can reach me at matt@datamettle.com if you want to grab a coffee.

Finally, we are launching a quarterly newsletter. In it, we will address a range of topics in the world of data science, including news, research, and resources. It is geared both technical and non-technical readers alike to help you stay on the cutting edge of what is going on in this rapidly evolving field.

## Modelling Eurovision voting

This is a follow-up to our previous blog on Eurovision voting, where we’ll explain how we modelled the objective quality of songs and voting biases between countries in Eurovision, and how we grouped the countries into blocks based on their biases. The source code can be found here. We’ve taken the data from a Kaggle competition, and sourced data for any missing years from Wikipedia.

## The hierarchical model

The idea is that the voting outcome is dependent of both the inherent quality of the entry, and the biases countries have for voting for each other. There are lots of possible ways of doing this, but ours is fairly simple and works quite well.

Let $$r_{c_ic_jy_k}$$ denote the fraction of people in country $$c_i$$ that voted for country $$c_j$$ in the year $$y_k$$. Note that $$\sum_{j=1}^Nr_{c_ic_jy_k} = 1$$, so it is reasonable to model the vector $$\mathbf{r}_{c_iy_k} = (r_{c_ic_1y_k}, \dots, r_{c_ic_Ny_k})$$ as following a Dirichlet distribution:

$\mathbf{r}_{c_iy_k} \sim \operatorname{Dir}(\beta_{c_ic_1y_k}, \dots, \beta_{c_ic_Ny_k}).$

We choose a model where the parameters $$\beta_{c_ic_jy_k}$$ decompose as

$\beta_{c_ic_jy_k} = \operatorname{Exp}\bigl(\theta_{c_jy_k} + \phi_{c_ic_j}\bigr),$

where $$\theta_{c_jy_k}$$ captures the objective quality of the song from country $$c_j$$ in the year $$y_k$$, and $$\phi_{c_ic_j}$$ captures the bias country $$c_i$$ has in voting (or not voting) for country $$c_j$$. Furthermore, we assume that the $$\theta_{c_jy_k}$$’s and $$\phi_{c_ic_j}$$’s are drawn from an (unknown) normal distribution:

$\phi_{c_ic_j}, \theta_{c_jy_k}\sim N(\mu, \sigma).$

Note that we don’t actually have access to $$r_{c_ic_jy_k}$$, we only have data on the number of points each country was awarded. But we make do with what we have and approximate $$r_{c_ic_jy_k}$$ by

$r_{c_ic_jy_k} \simeq \frac{\text{(points awarded to country $$c_j$$ by country $$c_i$$ in the year $$y_k$$}) + \alpha}{(\text{total points awarded by country $$c_i$$ in the year $$y_k$$}) + N\alpha},$

where $$\alpha$$ is a constant that we set to 0.1.

It’s hard to say for definite whether this is a reasonable approximation without being able to actually see the voting data, but preferences often follow power laws, and the decreasing sequence of points 12, 10, 8, 7, 6, 5, 4, 3, 2, 1, 0, 0, 0, … at least follow a similar shape:

It’s not perfect, but hopefully good enough. Note that we do completely miss out on any information about the tail, but we assume that this is mostly noise that don’t contribute much anyway.

## Fitting the model

We fit the model using using Stan, a programming language that is great for making Bayesian inferences. It uses Markov chain Monte Carlo methods to find the distribution of the parameters which best explain the responses. Stan is very powerful, all we really need to do is to specify the model, then pass in our data and Stan does the rest!

As Stan uses Bayesian methods, it returns a sample of the distribution of your parameters, in our case consisting of 16 000 (paired) values for each parameter. In our previous analysis we simply took the means as point estimates for our parameters, but having the distribution lets us talk about the uncertainty of these estimates. For example, the point estimate for objective quality $$\theta$$ of the winner in 2015 (Sweden) is 1.63, and 1.41 for the runner up (Russia). This however, doesn’t tell us the full picture. Here’s a plot for the joint distribution of $$\theta$$ for these entries:

From this joint distribution we can calculate the probability that Sweden’s entry was objectively better than Russia’s entry as the proportion of samples above the blue line, and this turns out to be about 93%.

## Finding the blocks

Looking at the bias terms $$\phi_{c_ic_j}$$, we can attempt to group the countries into groups that are tightly connected, i.e. where there’s positive biases within a group and neutral or negative biases between groups. We use a method based on Information Theoretic Co-Clustering, where we choose a clustering where we loose as little mutual information as possible.

The basic idea can be described as follows: For each vote from country A to country B, take a marble and label it ‘from: country A, to: country B’. Now put all the marbles in a jar, and pick one at random. How much does knowing from which country the vote was from tell you who it was for? For example, Romania and Moldova almost always trade 12 points, so knowing that the vote was from Romania tells me there is a high probability that the vote was for Moldova. Mutual information gives us a quantitative value of this knowledge for the whole jar.

Now if we have clustered the countries into blocks, we can instead label the marbles ‘from: block A, to: block B’. We generally loose information by doing this. As we don’t know which country the vote was actually from it’s harder to predict which block the vote was for. By finding the clustering that looses the least amount of information, we get the clustering that best represents the biases.

Below is a heatmap showing the probabilities of countries voting for each other, with our identified blocks separated by lines. We do see that the blocks certainly capture voting behaviour fairly well, voting within the blocks is far more likely than between the blocks (with Lithuania and Georgia being a notable exception). Also, we can identify an “ex-Soviet block” within the “Eastern Europe” block, and a “Northern Europe” block with in the “Western Europe” block, both highlighted in gray.