Data Science Terms: A Data Mettle Guide
As someone who joined a company that does data science for a living, I quickly realised that what I knew about the field was only the tip of the iceberg. There was and is a huge learning curve for me to overcome to understand the key data science terms and concepts. So, I’ve decided to take you on my learning journey as I learn and understand these concepts. We’ll keep this guide updated as new terms pop up, so check back in regularly or bookmark us.
Artificial Intelligence (AI)
In general, it is the ability of computers to mimic the ability of humans to perform tasks that require a complex set of thought processes.
Artificial intelligence is a bit of a nebulous term, and what constitutes AI evolves over time. In fact, there is a phenomenon called the “AI effect” which is that once a machine is properly trained to solve a particular problem, that problem is no longer classified as intelligent and thus removed from the definition of AI. Tesler’s theorem, in fact, states that “AI is whatever hasn’t been done yet.”
Self-driving cars are a perfect example of artificial intelligence in action. The systems developed for autonomous vehicles must mimic the hundreds of decisions human must make every minute while they drive.
AI is as an umbrella term for many concepts within data science. For instance, you would consider machine learning (ML) a subset of AI.
Also called cluster analysis, an unsupervised learning technique that identifies patterns within a set of data based on common characteristics. An example of clustering would be to take poll data and start to define personas and typologies of responses to the polling questions.
Watch our video on clustering and segmentation to learn more.
The role of data engineering is to make data accessible and usable within an organisation to whoever needs it. It might be for data scientists to train and run models, or analysts or marketers to make decisions on historical data/performance.
Read our blog on the difference between data science and data engineering.
In the context of machine learning and data science, a decision tree is a series of yes/no questions asked of data to arrive at a prediction.
A subset of machine learning based on deep neural networks. Neural networks have been around for over 50 years, but it wasn’t until recently when we had enough computing power to use networks with many layers (deep) that they really became ubiquitous and really good.
Type of clustering where the number of clusters the user is splitting the data into is pre-defined. The algorithm would then find representative ‘centres’ of each cluster, and a point is defined to be in the cluster whose centre it is closest to.
An example is identifying fraudulent activity from non-fraudulent activity, where you can use prior data to identify the characterises of activity you know to be fraudulent, and activity that was legitimate. You could then identify future activity and how closely they resemble the characteristics of each centre of each group.
The ability of computer systems to automatically learn and improve from experience without specific oversight or input by a human. Machine learning models access data on an ongoing basis and use it to learn for themselves, based on a pre-defined goal of the developer of that model.
When you are working with multiple data sets, there are often issues when matching records from two different datasets. For example, you might have customer data in several databases. There might be a unique identifier, such as an email address, that links these two records together. However, this is not always the case. A way to overcome this is in building a statistical model that connects the records based on points of commonality you do have in the data.
Watch our video on matching data sets to learn more.
Layers of algorithms that are inspired by biological neural systems, where data is sent through ‘synapses’ that trigger information to the next layer algorithms. For example, in image recognition, the neural network will be fed pictures of an object to train on such as a cat, and will then learn all the characteristics that make up that object (fur, whiskers, etc) to distinguish it from other objects.
Software library for Python, and stands for Numerical Python. NumPy supports large multidimensional arrays and matrices and allows users the ability to perform mathematical operations.
The process to achieve optimal results by reducing some element of risk or waste. An excellent example of this would be scheduling delivery driver routes for a significant grocery delivery company. The goal would be to deliver the most groceries with the fewest number of drivers and in the shortest amount of time. However, there are countless complexities to account for: drivers, various routes they can take, traffic, which drivers are scheduled for the day, or are on holiday. Computers are much more adept at optimising for this problem than humans because they quickly run all the various permutations required to find the optimal output.
Watch our video on optimisation to learn more.
A software library for Python, available for free online. It enables users to explore efficiently, clean and process their data before jumping into running machine learning applications on it.
Watch our video on predictive models to learn more.
A programming language with an extensive library of third party software packages and tools available to extend its functionality, such as NumPy and pandas, which are commonly used by data scientists.
A programming language primarily used for statistical computing. It has a history of being used by academic statisticians, and many cutting edge statistical models are first implemented in R.
Also called random decision forests, an algorithm that relies on several decision trees running concurrently to arrive at the best possible prediction. Essentially, it is crowdsourcing a prediction based on many decision trees running. Each of these decision trees trains on randomly picked subsets of data, allowing them to be built separately from each other. The idea behind this is not to overfit the model on the training data.
Watch our video on clustering and segmentation to learn more.
Supervised vs. Unsupervised Learning
Supervised learning is where models are trained on labelled data, whereas unsupervised has no existing labels. For example, in a supervised learning environment, we might run a model on a collection of images labelled as car, bike or horse to train a model to distinguish between these objects. In an unsupervised learning environment, we would instead start with images and allow the model to run and group these images by common characteristics.
Stands for Structured Query Language, SQL is a programming language for managing data held in a relational database. SQL allows users to query and update records from a database.
Free software commonly used for machine learning applications such as neural networks. The Google Brain team developed TensorFlow for internal use, and then made open source in 2015.