We are your data science workshop.

Data Science Terms: A Data Mettle Guide

As someone who joined a company that does data science for a living, I quickly realised that what I knew about the field was only the tip of the iceberg. There was and is a huge learning curve for me to overcome to understand the key data science terms and concepts. So, I’ve decided to take you on my learning journey as I learn and understand these concepts. We’ll keep this guide updated as new terms pop up, so check back in regularly or bookmark us.

Artificial Intelligence (AI)

 In general, it is the ability of computers to mimic the ability of humans to perform tasks that require a complex set of thought processes.

Artificial intelligence is a bit of a nebulous term, and what constitutes AI evolves over time. In fact, there is a phenomenon called the “AI effect” which is that once a machine is properly trained to solve a particular problem, that problem is no longer classified as intelligent and thus removed from the definition of AI. Tesler’s theorem, in fact, states that “AI is whatever hasn’t been done yet.”

Self-driving cars are a perfect example of artificial intelligence in action. The systems developed for autonomous vehicles must mimic the hundreds of decisions human must make every minute while they drive.

AI is as an umbrella term for many concepts within data science. For instance, you would consider machine learning (ML) a subset of AI.

Clustering

Also called cluster analysis, an unsupervised learning technique that identifies patterns within a set of data based on common characteristics. An example of clustering would be to take poll data and start to define personas and typologies of responses to the polling questions.

Watch our video on clustering and segmentation to learn more.

Data engineering

The role of data engineering is to make data accessible and usable within an organisation to whoever needs it. It might be for data scientists to train and run models, or analysts or marketers to make decisions on historical data/performance.

Read our blog on the difference between data science and data engineering.

Decision Tree

In the context of machine learning and data science, a decision tree is a series of yes/no questions asked of data to arrive at a prediction.

Deep Learning

A subset of machine learning based on deep neural networks. Neural networks have been around for over 50 years, but it wasn’t until recently when we had enough computing power to use networks with many layers (deep) that they really became ubiquitous and really good.

K-means Clustering

Type of clustering where the number of clusters the user is splitting the data into is pre-defined. The algorithm would then find representative ‘centres’ of each cluster, and a point is defined to be in the cluster whose centre it is closest to. 

An example is identifying fraudulent activity from non-fraudulent activity, where you can use prior data to identify the characterises of activity you know to be fraudulent, and activity that was legitimate. You could then identify future activity and how closely they resemble the characteristics of each centre of each group.

Machine Learning

The ability of computer systems to automatically learn and improve from experience without specific oversight or input by a human. Machine learning models access data on an ongoing basis and use it to learn for themselves, based on a pre-defined goal of the developer of that model.

Matching Data

When you are working with multiple data sets, there are often issues when matching records from two different datasets. For example, you might have customer data in several databases. There might be a unique identifier, such as an email address, that links these two records together. However, this is not always the case. A way to overcome this is in building a statistical model that connects the records based on points of commonality you do have in the data.

Watch our video on matching data sets to learn more.

Neural Networks

Layers of algorithms that are inspired by biological neural systems, where data is sent through ‘synapses’ that trigger information to the next layer algorithms. For example, in image recognition, the neural network will be fed pictures of an object to train on such as a cat, and will then learn all the characteristics that make up that object (fur, whiskers, etc) to distinguish it from other objects.

NumPy

Software library for Python, and stands for Numerical Python. NumPy supports large multidimensional arrays and matrices and allows users the ability to perform mathematical operations.

Optimisation

The process to achieve optimal results by reducing some element of risk or waste. An excellent example of this would be scheduling delivery driver routes for a significant grocery delivery company. The goal would be to deliver the most groceries with the fewest number of drivers and in the shortest amount of time. However, there are countless complexities to account for: drivers, various routes they can take, traffic, which drivers are scheduled for the day, or are on holiday. Computers are much more adept at optimising for this problem than humans because they quickly run all the various permutations required to find the optimal output.

Watch our video on optimisation to learn more.

pandas

A software library for Python, available for free online. It enables users to explore efficiently, clean and process their data before jumping into running machine learning applications on it.

Predictive Models

Watch our video on predictive models to learn more.

Python

A programming language with an extensive library of third party software packages and tools available to extend its functionality, such as NumPy and pandas, which are commonly used by data scientists.

R

A programming language primarily used for statistical computing. It has a history of being used by academic statisticians, and many cutting edge statistical models are first implemented in R.

Random Forest

Also called random decision forests, an algorithm that relies on several decision trees running concurrently to arrive at the best possible prediction. Essentially, it is crowdsourcing a prediction based on many decision trees running. Each of these decision trees trains on randomly picked subsets of data, allowing them to be built separately from each other. The idea behind this is not to overfit the model on the training data.

Segmentation

Watch our video on clustering and segmentation to learn more.

Supervised vs. Unsupervised Learning

Supervised learning is where models are trained on labelled data, whereas unsupervised has no existing labels. For example, in a supervised learning environment, we might run a model on a collection of images labelled as car, bike or horse to train a model to distinguish between these objects. In an unsupervised learning environment, we would instead start with images and allow the model to run and group these images by common characteristics.

SQL

Stands for Structured Query Language, SQL is a programming language for managing data held in a relational database. SQL allows users to query and update records from a database.

TensorFlow

Free software commonly used for machine learning applications such as neural networks. The Google Brain team developed TensorFlow for internal use, and then made open source in 2015.

Five Good Uses of Data Science in Products

There was a period, not all that long ago, where startups pitched themselves first as a machine learning or artificial intelligence company, using these technologies to solve complex problems and provide a unique user experience. Now, data science methodologies are much more ubiquitous, that for many new companies and products in specific sectors, to even think about not leveraging them would be heretical.

We all interact with data science daily in the products we use. Like any well-implemented product feature, it blends in seamlessly with the user experience. As a user, you don’t need to know what technology is running in the background of the products you use. You want them to solve your headaches, or provide you joy.

Here is our list of five good uses of machine learning and data science in products,

Ocado

The thing that always turned me off to shopping for food online is that there is a flow to a supermarket or grocery store. You walk through the various aisles, and the food on the shelves speak to you, catch your attention, make you think of a recipe that you want to try. You may start with a list, but you always end up finding something new that you want to try out.

Ocado is one of the leading employers of data scientists and engineers (in fact our data scientists Jeremy and Johan hail from Ocado). AI and machine learning underpin all of Ocado, including factory layout, driver logistics, customer feedback analysis, responding to customer complaints, and the shopping experience. Ocado technology also helps users to navigate through their shopping more efficiently, having the right next product suggested to them to help them get their shopping done better and quicker. Or, more cynically, so you buy more.

Smart Compose in Gmail

I am a nervous emailer. I’ll often write something and go over it three or four times, changing tiny details, because it doesn’t sound right to me. That all changed when smart compose came around. Somehow, the machine predicting what I should say gave me more confidence to say it.

While that might not be the exact use case or problem to be solved when they started building the product, it does make it one of my favourite features of G Suite. I’d imagine for many power users, and people who live in their inbox, it presents a considerable amount of time savings.

When I first came across this feature, I thought the UI would be a bit awkward, as you have to hit tab to utilise the suggestion. However, in my experience, it fits in quite nicely with how I type. And now as I tap this blog post draft out in Google Docs I wonder when they will bring this to other parts of the G Suite.

The tech behind Smart Compose is pretty impressive. There are many challenges the Google team needed to overcome, including speed (it needs to suggest quicker than people can type after all), scale (providing the right predictions for a given user), and reducing bias in the suggestions.

It uses neural networks to take into account contexts, such as email subject and prior correspondence, and predict what the next phrase might be. They have an excellent blog writeup here on the technology.

Face Grouping in Google Photos/Other Photo Services

This post might give me away as a Google product power user. I love the facial grouping of Google photos. It makes finding the right picture of people, in a sea of the millions of photos we all have on our phones, super quick. I am always impressed by how well it groups people, particularly with my kids. The technology can connect their newborn photos with them as a toddler, even as I struggle to remember” is that Frankie or Archie in this one?” It can also distinguish my cat from the many other cat photos I have on my phone (don’t ask).

This facial recognition technology used across product and features within Google, and they allow developers to deploy the technology in their products, for instance, with the Firebase ML Kit.

Spotify Song Recommendations

I recently switched from the Google Play streaming service to Spotify (see, I can use non-Google products). One of the reasons it took me so long to do so was the headache of having to build a whole new library of music in Spotify. I didn’t want to go through it all and follow my favourite artists. What really surprised me when I made the move was how quickly, and how little data was actually required for Spotify to fairly accurately understand my musical tastes and actually start suggesting to me artists and songs that I frequently listened to on Google Play.

There are a few technologies and techniques Spotify uses to predict your musical tastes and create your tailored playlists. First is collaborative filtering, which makes recommendations to you based on crossover with other listeners with similar preferences. Spotify also uses natural language processing (NLP) and scours the internet, and tags songs based on how frequently they are mentioned alongside other artists and songs. The third method is raw audio processing and recommending similar songs based on like tempos, key and signatures. (more on these methodologies here).

Wealthfront ‘roboadvisor’

Financial services is an area ripe for the application of machine learning and other data science techniques. The vast amounts of available data, along with the inefficiencies, fraud, waste and high fees, make it particularly exciting as a wave of financial technology startups turns the space on its head.

My favourite consumer application in this area thus far is Wealthfront. It automatically builds users a balanced portfolio of exchange-traded funds based on risk profile. It even rebalances your portfolio for you to maximise efficiency. They have also released new features to help with financial planning, such as helping set budgets for when you want to buy a house, start a family, make large purchases, even plan to take an extended holiday. It plugs in all financial accounts you have, your current portfolio and risk preferences, and market data to help you prepare.

Wealthfront’s model allows more consumers to have access to financial planning, advice and portfolio management for significantly lower fees. Previously you would have to pay financial advisors to help you budget, and generally, they require clients to have a minimum net worth. To manage a balanced portfolio, you’d have to either do it yourself, and pay fees to whichever account manager you had, and also have to remember to rebalance your portfolio, and change it as your risk profile changes. Instead, automation, data and machine learning helps you accomplish all this at a fraction of the cost.

_data journey

Your data can tell you a lot about your customer's journey. Our services can provide you with the information and tools that you need to match your services to customers.