We are your data science workshop.

Useful Out-of-the-Box Machine Learning Models

With the growing popularity of data science, out-of-the-box machine learning models are becoming increasingly available in free, open-source packages. These models have already been trained on datasets that often large and therefore time-consuming. They are also packaged up in ways that are easy to use, thus simplifying the process of applying data science. While there is no replacing customising a model to suit your objectives, these packages are useful for obtaining quick results and provide a gentle, easy introduction into data science models. In this article, we touch on some of these tools and their applications to well-known machine learning problems.   

Image classification using out-of-the-box machine learning models

Image classification is a machine learning task where an image is categorised into one of several categories through identifying the object in the image. It is easy for humans to recognise the same object shown in different backgrounds and different colours, and it turns out that it’s also possible for algorithms to classify images up to a certain amount of accuracy. Significant advances have been made in image classification as well as other computer vision problems with deep learning, and some of these available as out-of-the-box machine learning models in keras/tensorflow

As an example, the inception-v3 model, trained on ImageNet, a database of over 14 million labelled images, is available as a tool for prediction. It is easy to set up, only taking a couple lines of code before you’re ready to begin classifying your own images – simply read in the image and the model will return a category and a score. The inception-v3 and other image classification models can also be fine-tuned by keeping the weights of some of the layers fixed, and tuning the weights of other layers by training on a new, more relevant dataset. This is a well-known procedure of transfer learning to customise the model to the new data. It is especially useful when the new dataset is small.

image classification, pre-trained model, out-of-the-mox machine learning models

Named Entity Recognition

With Named Entity Recognition (NER), the goal of the task is to identify named entities within a chunk of text, such as the name of an organisation or a person, geographical locations, numerical units, dates and times, and so forth. This is useful for automatic information extraction, eg. in articles, reports, or invoices, and saves the effort of having a human manually read through a large number of documents. NER algorithms can help to answer questions such as which named entities are mentioned most frequently, or to consistently pick out a monetary value within the text. 

Spacy has a NER module available in several languages, as well as a range of text processing capabilities such as part-of-speech tagging, dependency parsing, lemmatization etc. Installation of the package is straightforward and no more than a few lines of code is required to begin extracting entities. Spacy v2.0 NER models consist of subword features and a deep convolutional neural network architecture, and the v3.0 models have been updated with transformer-based models. Spacy also has the functionality to allow training on new data to update the model and improve accuracy, as well as a component for rule-based entity matching where a rule-based approach is more convenient. 

Sentiment Analysis using out-of-the-box machine learning models

Sentiment analysis is used for identifying the polarity of a piece of text, i.e, whether it is positive, negative or neutral. This is useful for monitoring customer and brand sentiment, analysing any text-based feedback. More recently, it has been used for analysing public sentiment to the COVID-19 pandemic through social media or article headlines.

Hugging face transformers is a library containing a range of well-known transformer-based models that have obtained state-of-the-art results in a number of different natural language processing tasks. It includes both language models and task-specific models, and has a sentiment analysis model with a base architecture of distilBERT (a smaller/faster version of the BERT language model) which is fine-tuned for the sentiment analysis downstream task using the SST-2 dataset. The model returns positive and negative labels (i.e, excludes neutral) as well as a confidence score. Other options for an out-of-the-box sentiment analysis model are TextBlob, which has both a rules-based sentiment classifier and a naive-bayes model trained on movie reviews, and Stanza, which has a classifier based on a convolutional neural network architecture, and can handle English, German and Chinese texts. Both TextBlob and Stanza return a continuous score for polarity. 

Transfer learning using pre-trained models

While there is a lot of value in an out-of-the-box model, its accuracy may not be the same when applied directly to an unseen dataset. This usually depends on how similar the characteristics of the new dataset is to the dataset the model is trained on – the more similar it is, the more the model’s results can be relied on. If direct use of an out-of-the-box model is not sufficiently accurate and/or not directly relevant to the problem at hand, it may still be possible to apply transfer learning and fine-tuning depending on package functionality. This involves using new data to build additional layers on top of an existing model, or to retrain the weights of existing layers while keeping the same model architecture.

The idea behind transfer learning is to take advantage of the model’s general training, and transfer its knowledge to a different problem or different data domain. In image classification for example, the pre-trained models are usually trained on a large, general dataset, with broad categories of objects. These pre-trained models can then be further trained to classify, for example cats vs dogs only. Transfer learning is another way to make use of out-of-the-box machine learning models, and is an approach worth considering when data is scarce.

GPT-3: What is It?

Recently you might have come actress GPT-3, and some of the cutting edge tasks this innovation in artificial intelligence can carry out. If you are at all on tech twitter, there is absolutely no way you avoided the hype.

Some of the things people have demonstrated thus far with GPT-3 is dictating the layout of a webpage they wanted built using natural language and having the model code the actual front end of the website. Or, using the model to generate a full article about how humans shouldn’t be worried about robots any time soon.

Like self-driving cars, or other promises of artificial intelligence, it’s easy to get caught up in the hype. So we wanted to share a bit more about GPT-3, what it is, and what it can do. And, perhaps, find a use case for your organization.


GPT-3 is a project of OpenAI, a California based AI research laboratory. Initial backing for OpenAI came from Elon Musk, Peter Thiel, and other Silicon Valley technologists. The mission of the organization is “to ensure that artificial general intelligence benefits all of humanity.”

Among their projects is the Generative Pre-trained Transformer, or GPT. As the name suggests, the latest release is the third iteration of the model, and is a massive leap forward from its predecessors. Or, any other model from other labs for that matter. GPT-3 is able to run on 175 billion parameters. For context, this is 10 times the number of parameters of the next biggest model, Microsoft’s Turing NLG.

GPT-3 is a deep learning model that leverages neural networks. The model was trained mostly on the Common Crawl dataset, which is essentially data pulled from websites on the internet on a monthly basis. In addition, it used data from book transcripts, and even Wikipedia.

What Does it Do?

GPT-3 is a model that focuses on natural language. In the simplest terms, you can pose a question to it, and have it respond back like any human could. 

The most impactful use cases for GPT-3 are probably still being dreamed up by startups all over the world. But, some of the obvious ways companies can use GPT-3 are for chatbots, answering customer support questions, or even translating plain English into SQL queries or regex expressions.

GPT-3 can also write some code, including CSS and Python.

How To Get Access

OpenAI is providing API access to GPT-3 so anyone can build applications on top of it. You can join the waitlist to get access by filling out a form with the use case you have in mind.


There is certainly a lot of hype around GPT-3, and we’ve yet to see killer applications built on it. Like any new technology, it will follow the traditional hype cycle. But the technology is clearly a massive improvement on its predecessors and holds massive potential.

Onboarding a data scientist

Hiring a new data scientist into your team can be a very exciting time. The right candidate can provide new insight to your organisation, automate time consuming tasks, and help to transform decision making to become more data driven. As your new team member’s start date approaches, you might start to think about how to best onboard them. In other words, what can you do, as a manager, to help them get up to speed and start adding value to your team?


Integrating a new data scientist into your organisation may not be straightforward for several reasons:

  • You don’t understand enough of what they do to know what they need.
  • The role itself is often more open and flexible.
  • The data scientist’s background can range widely from engineering, mathematics to computer science, and be quite varied in prior experience.
  • Their day to day work may be different depending on a number of factors, such as whether the company has more or less structured processes, whether it is a consultancy or product company, and whether the pace of work is fast or slow.

In this article, we provide a wide range of suggestions to design an onboarding process that considers the work environment and the data scientist’s background. There are obvious things one should do, such as introducing them to the immediate and wider team and setting up any system accesses they require. These are steps that you are likely to take for any new starter, and we will not cover them in this article. We focus on onboarding actions that would help a data scientist specifically.

Have them design their own training

Chances are that you have hired someone you think is smart, has programming skills and prior experience in solving problems using data. But they may not be familiar with the specific problems in your industry, or maybe they haven’t been using the specific modelling technique that is commonly used in your industry, or perhaps they have previously coded in a different programming language. These are not showstoppers to them performing well in the role, and can easily be addressed with a good onboarding/training process.

By nature of the role, many data scientists either have a research background or are experienced with some form of research. What this means is that they should be able to identify the gaps in their knowledge, and effectively look for ways to learn the things they don’t know. Have them take charge of their training. This will cater for their individual background and prior experience. For example, someone with a lot of programming experience might want to spend less time learning new packages and software, and focus more on learning new mathematical concepts. Likewise, someone with a strong statistics or mathematics background might want to spend more time on programming material. Furthermore, they may already have a preference for their approach to learning new skills – some people learn best by doing, some people prefer reading conceptual material, and others benefit more from watching video courses.

Learning is most effective when sufficiently spaced out. If it is feasible as part of the onboarding process, suggest that your data scientist spend some time every day on a training resource of their choice. This could be books, research articles, video lectures, industry workshops, and industry documentation, for example.

Assign them a small first project

Since a data scientist’s job will involve a fair amount of programming, a good onboarding activity is to give them a small easy programming task. Consider whether to choose a task that has time constraints associated with it or not. There are advantages and disadvantages to both, and the choice will depend on the company’s situation. If the work environment is more fast paced, then giving them a task that fits into the team’s day-to-day work will be immediately useful. The time constraints will mimic the real work they are expected to perform, and get them up to speed on doing this work. If their work is not as urgent, then you might prefer to give them sufficient time to learn not just the specific task, but also any peripheral knowledge. This will allow them flexibility in their learning, to focus on best practices instead of just rushing to ‘get the job done’.

Examples of small projects are:

  • Perform an analysis to obtain insight on a section of company data
  • Build a simple dashboard using data from the company’s database
  • Write a short piece of code that fits into software that your company owns, for example, adding a new feature. Or modifying a small section of the code to make it more efficient, or to reframe it for a different purpose
  • Following documentation to execute a piece of software your company owns

Integrate your data scientist into the business

Introducing your data scientist to key subject matter experts across the business is essential – these will be the people they may go back to again and again to obtain domain information essential to their analysis. You can do this through formal or informal channels. Examples of formal channels would be including them in stakeholder meetings, and any discussions involving core business strategy, day-to-day running of the business, factors that impact on profit and loss and the types of decision making involved. This will allow them to gain a context of their work and how it fits in with the company’s overall strategy. Informal discussions are sometimes the most efficient form of knowledge transfer. You could organise a chat over lunch with the relevant stakeholders to facilitate this.

While understanding how the business operates is helpful to the new data scientist, be mindful that they need to spend their time on other areas as well, and try not to overwhelm them with too much business information at once.

Communicate, communicate, communicate

At the start, it is important to communicate the expectations of the role, the type of problems you want them to solve, and the available resources in the company. This will help them to determine how best to get up to speed, to set learning objectives for themselves, and gather the resources to work towards your goals. If you assign them a task and there are deadlines to meet, make sure this is communicated clearly too. On the other hand, if you would like them to be free to spend their initial weeks on general upskilling, ensure they know this too. Make sure you are both aligned on a project plan to avoid rework down the track.

It’s entirely possible that the role will evolve, or you might change your mind on what you want them to work on. That’s okay too, as long as you keep them in the loop, and include them in these discussions. Data science is an interdisciplinary field and your data scientist should be adaptable.

Set up an environment for data science

Do you have the right environment set up for your data scientist? It is important to discuss from the very beginning what kind of tools and software they will need, and what resources you currently have. This will help them to figure out what’s achievable and what’s not. Whether or not the resources you currently have are sufficient depends on the end goals you have in mind for the data science project.

One-off analysis and proof-of-concept models will most likely not require any complex set up. However, imagine for example, that the end goal is to build a predictive model that automatically updates itself, then have them integrated into the business and made available to key persons on a dashboard. In this scenario, you may want to consider the technology you require in order to achieve this, and whether you might like to purchase cloud computing services or dashboard software. Also, if there is going to be more than one person working on the same set of code, then it is typically necessary to have version control software. If your company doesn’t already own a database, you may want to consider developing this alongside the data science project, especially if you envision having to make use of much more complex data in the future.

Start these discussions early, and plan as much as you can on choosing these initial systems, as it will be much harder to switch once you have set things up a certain way. Your vision and constraints will help your data scientist plan their workflow accordingly.


To summarise, this article outlines some approaches on how to design an onboarding process for a data scientist. Always communicate with your data scientist, as they may have their own thoughts on technical training, any resources they require from you, and how best to work towards a data science goal. In turn, as a manager with a lot of experience in your industry, you can help to provide context and domain information to your data scientist, and connect them to key stakeholders in your company. Providing your data scientist with the right environment and resources will ensure they are set up for success.

_data journey

Your data can tell you a lot about your customer's journey. Our services can provide you with the information and tools that you need to match your services to customers.