The post What makes a good data science model? appeared first on _datamettle.

]]>The technical answer is about quantitative measures designed to assess the predictive capability of a model or the goodness-of-fit of data to model. Some models have derivable statistical theory for model validation. Other models that are much harder to interpret and analyse often rely on testing the model predictions on a withheld part of the dataset to directly assess accuracy. This is known as cross-validation. There are also a range of accuracy metrics which can be used depending on what is required of the model. For example, precision and recall are two metrics to assess the accuracy of binary classification models. One measures how often the model is accurate when it predicts a positive outcome, and the other measures how many of the positive cases the model is able to detect. The higher the model scores on these metrics, the better the model is. These tools for quantitative model validation thus help to define a good model – a good model should score high on the relevant quantitative metrics. If there is more than one relevant metric, then a good model should score high on both metrics.

However, since data science is typically tied to a business context, *what makes a good data science model* should therefore also be answered from a business point of view. This opens up a whole philosophical discussion of what good means. For the rest of this discussion, we will explain why the quantitative measures of model validation are in fact necessary but not sufficient.

If the goal of the model is prediction, for example, a prediction in the future or a prediction on a spatially new data set (also known as prediction on out-of-sample data), then goodness-of-fit to data and cross-validation scores are only one part of the picture. The model must be able to generalise beyond its training sample, otherwise it will not be able to meet its goal of prediction. The better the model can generalise, the better the model is, since this ability to generalise indicates having captured some fundamentally true relationships within the data. The ability to be able to predict on new out-of-sample data is also dependent on the assumption that the new out-of-sample data distribution is similar to the data the model has been trained on. In practice, this is something to be considered prior to beginning a data science project, and failure to do so will result in a model that satisfies all the quantitative checks but cannot actually be used by the business.

Next, consider a model that is to be used for decision-making by way of the prediction of class probabilities. Naturally, there are consequences for the model making a bad prediction – false positives and false negatives. The default method of interpreting probabilities produced by the model is using a threshold of 0.5. If the probability is predicted to be greater than 0.5, then the data point is considered to be in the class, and if not, it is not in the class. The choice of this threshold is influenced by business requirements and is typically made by weighing up the cost of a false positive versus a false negative. Two relevant examples here are fraud prediction and email spam prediction. Choosing the default threshold of 0.5 might identify a certain amount of fraud cases while other fraud cases go undetected. If the model is used for making a decision to initiate a fraud investigation, then we might prefer that as many fraud cases are detected as possible. An email spam filter, on the other hand, is more likely to lean conservatively to the other side. We want the model to pick out fewer spam emails, and prefer the ones that are picked out to be definitely spam. A good model is the one that solves the problem at hand, or more formally: a good model is one that is able to fulfill some sort of business objective.

One subtlety that comes into play here is that business objectives may be very specific to the business, or to the industry it belongs in – that is why data science models are often customised to suit those needs. Perhaps we have a daily forecasting model where overall prediction accuracy is important, but getting the forecast right on special holidays is especially important. Or, perhaps we have a product demand forecasting model where the company has the capacity to store unused stock, but would always like to be able to always meet demand. In this case, a model that tends to overpredict more often than it underpredicts is preferable to the other way round. Business objectives may also be modified over time. An interesting example is at the start of the COVID-19 pandemic, where quantifying the impact of the pandemic on the business was a problem that many businesses were interested in. Forecasting models that were already in place during this time would have become severely inaccurate due to the change in conditions that led to the violation of the assumption that the new data’s distribution is similar to the data the model has been trained on. However, by retrospectively taking the difference between model predictions and true values, these models were actually pretty useful for a business objective of calculating the impact of the pandemic.

Although business objectives are largely tied to the intended use of the model, there is sometimes an implied objective relating to the business, the industry and/or the stakeholders of the model. In data science, there is usually a choice between a model that is explainable, and easily interpreted, and a black-box model. For example, in linear and logistic regression, the coefficients found through the fitting process can be interpreted to tell us something about the relationships between variables in the model. The same relationships would be harder to understand if a neural network model was used. This choice of model is very much tied to business objectives. Financial models in particular are often deeply scrutinised and regulated by relevant regulatory bodies, and thus there is usually a preference for models that are explainable. Internal governance processes and stakeholder preferences are also reasons why one might be preferred over the other. A good model is thus, also, the type of model that successfully passes through governance processes.

Finally, in some cases, there are objectives, often unspecified, that perhaps, should in fact, be objectives of all models. The Australian Department of Industry, Science, Energy and Resources lists 8 Artificial Intelligence Ethics principles, designed to ensure that the use of AI models are “safe, secure and reliable”. Focusing on accuracy metrics, on building a model that is fit for purpose, that all the stakeholders are willing to adopt is not enough. Models are more than an algorithm and an API. They are often embedded into a wider system, a decision-making process, and may be used by more than one part of an organisation. The construction of this wider system and its flow-on effects should be evaluated by both data scientists and business experts to prevent unintended harmful consequences. A good model is the one that has been built with an awareness of the environment it is operating in, and any identified ethical issues resolved.

The question of what makes a good model needs to start from the consideration that we’re solving business problems first and foremost. Statistics and algorithms are tools of the trade, and while it is necessary to validate the model quantitatively via appropriate techniques, it is critical not to lose track of business objectives. A good model is built as a result of the successful collaboration between the data scientist and business experts, and by ensuring that all business objectives are realised in every choice of the model built process.

__Acknowledgements__

*I would like to thank **Dr Nigel Clay** for a wonderful discussion and debate on this topic, from which inspired some of the key points in this blog post.*

The post What makes a good data science model? appeared first on _datamettle.

]]>The post Useful Out-of-the-Box Machine Learning Models appeared first on _datamettle.

]]>Image classification is a machine learning task where an image is categorised into one of several categories through identifying the object in the image. It is easy for humans to recognise the same object shown in different backgrounds and different colours, and it turns out that it’s also possible for algorithms to classify images up to a certain amount of accuracy. Significant advances have been made in image classification as well as other computer vision problems with deep learning, and some of these available as out-of-the-box machine learning models in keras/tensorflow.

As an example, the inception-v3 model, trained on ImageNet, a database of over 14 million labelled images, is available as a tool for prediction. It is easy to set up, only taking a couple lines of code before you’re ready to begin classifying your own images – simply read in the image and the model will return a category and a score. The inception-v3 and other image classification models can also be fine-tuned by keeping the weights of some of the layers fixed, and tuning the weights of other layers by training on a new, more relevant dataset. This is a well-known procedure of transfer learning to customise the model to the new data. It is especially useful when the new dataset is small.

With Named Entity Recognition (NER), the goal of the task is to identify named entities within a chunk of text, such as the name of an organisation or a person, geographical locations, numerical units, dates and times, and so forth. This is useful for automatic information extraction, eg. in articles, reports, or invoices, and saves the effort of having a human manually read through a large number of documents. NER algorithms can help to answer questions such as which named entities are mentioned most frequently, or to consistently pick out a monetary value within the text.

Spacy has a NER module available in several languages, as well as a range of text processing capabilities such as part-of-speech tagging, dependency parsing, lemmatization etc. Installation of the package is straightforward and no more than a few lines of code is required to begin extracting entities. Spacy v2.0 NER models consist of subword features and a deep convolutional neural network architecture, and the v3.0 models have been updated with transformer-based models. Spacy also has the functionality to allow training on new data to update the model and improve accuracy, as well as a component for rule-based entity matching where a rule-based approach is more convenient.

Sentiment analysis is used for identifying the polarity of a piece of text, i.e, whether it is positive, negative or neutral. This is useful for monitoring customer and brand sentiment, analysing any text-based feedback. More recently, it has been used for analysing public sentiment to the COVID-19 pandemic through social media or article headlines.

Hugging face transformers is a library containing a range of well-known transformer-based models that have obtained state-of-the-art results in a number of different natural language processing tasks. It includes both language models and task-specific models, and has a sentiment analysis model with a base architecture of distilBERT (a smaller/faster version of the BERT language model) which is fine-tuned for the sentiment analysis downstream task using the SST-2 dataset. The model returns positive and negative labels (i.e, excludes neutral) as well as a confidence score. Other options for an out-of-the-box sentiment analysis model are TextBlob, which has both a rules-based sentiment classifier and a naive-bayes model trained on movie reviews, and Stanza, which has a classifier based on a convolutional neural network architecture, and can handle English, German and Chinese texts. Both TextBlob and Stanza return a continuous score for polarity.

While there is a lot of value in an out-of-the-box model, its accuracy may not be the same when applied directly to an unseen dataset. This usually depends on how similar the characteristics of the new dataset is to the dataset the model is trained on – the more similar it is, the more the model’s results can be relied on. If direct use of an out-of-the-box model is not sufficiently accurate and/or not directly relevant to the problem at hand, it may still be possible to apply transfer learning and fine-tuning depending on package functionality. This involves using new data to build additional layers on top of an existing model, or to retrain the weights of existing layers while keeping the same model architecture.

The idea behind transfer learning is to take advantage of the model’s general training, and transfer its knowledge to a different problem or different data domain. In image classification for example, the pre-trained models are usually trained on a large, general dataset, with broad categories of objects. These pre-trained models can then be further trained to classify, for example cats vs dogs only. Transfer learning is another way to make use of out-of-the-box machine learning models, and is an approach worth considering when data is scarce.

The post Useful Out-of-the-Box Machine Learning Models appeared first on _datamettle.

]]>The post GPT-3: What is It? appeared first on _datamettle.

]]>Some of the things people have demonstrated thus far with GPT-3 is dictating the layout of a webpage they wanted built using natural language and having the model code the actual front end of the website. Or, using the model to generate a full article about how humans shouldn’t be worried about robots any time soon.

Like self-driving cars, or other promises of artificial intelligence, it’s easy to get caught up in the hype. So we wanted to share a bit more about GPT-3, what it is, and what it can do. And, perhaps, find a use case for your organization.

GPT-3 is a project of OpenAI, a California based AI research laboratory. Initial backing for OpenAI came from Elon Musk, Peter Thiel, and other Silicon Valley technologists. The mission of the organization is “to ensure that artificial general intelligence benefits all of humanity.”

Among their projects is the Generative Pre-trained Transformer, or GPT. As the name suggests, the latest release is the third iteration of the model, and is a massive leap forward from its predecessors. Or, any other model from other labs for that matter. GPT-3 is able to run on 175 billion parameters. For context, this is 10 times the number of parameters of the next biggest model, Microsoft’s Turing NLG.

GPT-3 is a deep learning model that leverages neural networks. The model was trained mostly on the Common Crawl dataset, which is essentially data pulled from websites on the internet on a monthly basis. In addition, it used data from book transcripts, and even Wikipedia.

GPT-3 is a model that focuses on natural language. In the simplest terms, you can pose a question to it, and have it respond back like any human could.

The most impactful use cases for GPT-3 are probably still being dreamed up by startups all over the world. But, some of the obvious ways companies can use GPT-3 are for chatbots, answering customer support questions, or even translating plain English into SQL queries or regex expressions.

GPT-3 can also write some code, including CSS and Python.

OpenAI is providing API access to GPT-3 so anyone can build applications on top of it. You can join the waitlist to get access by filling out a form with the use case you have in mind.

There is certainly a lot of hype around GPT-3, and we’ve yet to see killer applications built on it. Like any new technology, it will follow the traditional hype cycle. But the technology is clearly a massive improvement on its predecessors and holds massive potential.

The post GPT-3: What is It? appeared first on _datamettle.

]]>The post Onboarding a data scientist appeared first on _datamettle.

]]>Integrating a new data scientist into your organisation may not be straightforward for several reasons:

- You don’t understand enough of what they do to know what they need.
- The role itself is often more open and flexible.
- The data scientist’s background can range widely from engineering, mathematics to computer science, and be quite varied in prior experience.
- Their day to day work may be different depending on a number of factors, such as whether the company has more or less structured processes, whether it is a consultancy or product company, and whether the pace of work is fast or slow.

In this article, we provide a wide range of suggestions to design an onboarding process that considers the work environment and the data scientist’s background. There are obvious things one should do, such as introducing them to the immediate and wider team and setting up any system accesses they require. These are steps that you are likely to take for any new starter, and we will not cover them in this article. We focus on onboarding actions that would help a data scientist specifically.

Chances are that you have hired someone you think is smart, has programming skills and prior experience in solving problems using data. But they may not be familiar with the specific problems in your industry, or maybe they haven’t been using the specific modelling technique that is commonly used in your industry, or perhaps they have previously coded in a different programming language. These are not showstoppers to them performing well in the role, and can easily be addressed with a good onboarding/training process.

By nature of the role, many data scientists either have a research background or are experienced with some form of research. What this means is that they should be able to identify the gaps in their knowledge, and effectively look for ways to learn the things they don’t know. Have them take charge of their training. This will cater for their individual background and prior experience. For example, someone with a lot of programming experience might want to spend less time learning new packages and software, and focus more on learning new mathematical concepts. Likewise, someone with a strong statistics or mathematics background might want to spend more time on programming material. Furthermore, they may already have a preference for their approach to learning new skills – some people learn best by doing, some people prefer reading conceptual material, and others benefit more from watching video courses.

Learning is most effective when sufficiently spaced out. If it is feasible as part of the onboarding process, suggest that your data scientist spend some time every day on a training resource of their choice. This could be books, research articles, video lectures, industry workshops, and industry documentation, for example.

Since a data scientist’s job will involve a fair amount of programming, a good onboarding activity is to give them a small easy programming task. Consider whether to choose a task that has time constraints associated with it or not. There are advantages and disadvantages to both, and the choice will depend on the company’s situation. If the work environment is more fast paced, then giving them a task that fits into the team’s day-to-day work will be immediately useful. The time constraints will mimic the real work they are expected to perform, and get them up to speed on doing this work. If their work is not as urgent, then you might prefer to give them sufficient time to learn not just the specific task, but also any peripheral knowledge. This will allow them flexibility in their learning, to focus on best practices instead of just rushing to ‘get the job done’.

Examples of small projects are:

- Perform an analysis to obtain insight on a section of company data
- Build a simple dashboard using data from the company’s database
- Write a short piece of code that fits into software that your company owns, for example, adding a new feature. Or modifying a small section of the code to make it more efficient, or to reframe it for a different purpose
- Following documentation to execute a piece of software your company owns

Introducing your data scientist to key subject matter experts across the business is essential – these will be the people they may go back to again and again to obtain domain information essential to their analysis. You can do this through formal or informal channels. Examples of formal channels would be including them in stakeholder meetings, and any discussions involving core business strategy, day-to-day running of the business, factors that impact on profit and loss and the types of decision making involved. This will allow them to gain a context of their work and how it fits in with the company’s overall strategy. Informal discussions are sometimes the most efficient form of knowledge transfer. You could organise a chat over lunch with the relevant stakeholders to facilitate this.

While understanding how the business operates is helpful to the new data scientist, be mindful that they need to spend their time on other areas as well, and try not to overwhelm them with too much business information at once.

At the start, it is important to communicate the expectations of the role, the type of problems you want them to solve, and the available resources in the company. This will help them to determine how best to get up to speed, to set learning objectives for themselves, and gather the resources to work towards your goals. If you assign them a task and there are deadlines to meet, make sure this is communicated clearly too. On the other hand, if you would like them to be free to spend their initial weeks on general upskilling, ensure they know this too. Make sure you are both aligned on a project plan to avoid rework down the track.

It’s entirely possible that the role will evolve, or you might change your mind on what you want them to work on. That’s okay too, as long as you keep them in the loop, and include them in these discussions. Data science is an interdisciplinary field and your data scientist should be adaptable.

Do you have the right environment set up for your data scientist? It is important to discuss from the very beginning what kind of tools and software they will need, and what resources you currently have. This will help them to figure out what’s achievable and what’s not. Whether or not the resources you currently have are sufficient depends on the end goals you have in mind for the data science project.

One-off analysis and proof-of-concept models will most likely not require any complex set up. However, imagine for example, that the end goal is to build a predictive model that automatically updates itself, then have them integrated into the business and made available to key persons on a dashboard. In this scenario, you may want to consider the technology you require in order to achieve this, and whether you might like to purchase cloud computing services or dashboard software. Also, if there is going to be more than one person working on the same set of code, then it is typically necessary to have version control software. If your company doesn’t already own a database, you may want to consider developing this alongside the data science project, especially if you envision having to make use of much more complex data in the future.

Start these discussions early, and plan as much as you can on choosing these initial systems, as it will be much harder to switch once you have set things up a certain way. Your vision and constraints will help your data scientist plan their workflow accordingly.

To summarise, this article outlines some approaches on how to design an onboarding process for a data scientist. Always communicate with your data scientist, as they may have their own thoughts on technical training, any resources they require from you, and how best to work towards a data science goal. In turn, as a manager with a lot of experience in your industry, you can help to provide context and domain information to your data scientist, and connect them to key stakeholders in your company. Providing your data scientist with the right environment and resources will ensure they are set up for success.

The post Onboarding a data scientist appeared first on _datamettle.

]]>The post Are we collecting the right data? appeared first on _datamettle.

]]>To better answer this question on data quality, there’s an additional question you’ll need to ask yourself early on. “Are we collecting the right data for the problem we are trying to solve?” In this post, we share insights on how you can start to try to answer this. It’s best to avoid a scenario where you are spending time and money getting the requisite quantity of data to do machine learning, only to find it doesn’t contain essential information you need.

Having poor quality data can broadly be categorised into two buckets. The first is that the data is not cleaned, or readily available to be used by data scientists. This problem is often solved by cleaning and data engineering.

The second, and the focus of this post, is when the data doesn’t contain the underlying information you need to solve the problem you are attempting to solve.

As a practical example: say you are an enterprise software company, and you want to predict which website visitors are most likely to convert to paid users. You might collect all kinds of useful information, such as user location, browser type, where they were referred from, and so on. However, after playing with the data, you find none of these factors tells a story or can help you make a meaningful prediction. This data isn’t the right data.

How do you then find it?

Here, the science part of data science comes into play. Start with some hypotheses for why specific data helps you make a meaningful prediction.

Using our previous example, you might come up with a few hypotheses. What are some things you believe contribute to someone converting to a paid user? Maybe the size of the organisation? Or perhaps, their role at the company?

When developing the hypothesis, it makes sense to spend time with the stakeholders who have the most intuition about this particular problem. In our paid user example, this could include customer support or sales team members who have probably developed their own mental models of what a strong lead looks like.

This guides you towards the data you need to collect to begin making this prediction. Then it becomes a matter of collecting it.

Once you have some hypotheses about what data you think answers the question you need it to, and you’ve started collecting it, then comes the less scientific part.

Early on, you’ll need to continually analyse the data in a non-scientific way, and see if it is telling you the story you expected. Or perhaps it’s telling you a story you didn’t expect it to. At this stage, you won’t have enough records to do proper machine learning, or even to make any statistically significant conclusion. But, you can try to get a better feeling using your instincts and experience.

In the previous enterprise software example, you can monitor correlations. The bigger the company, the more likely they are to convert. Or, breakdown who is purchasing the product and whether it does align with your hypothesis.

When doing machine learning, it’s critical to ask yourself along the way whether you are collecting the right data. You can glean information from your early data collecting by relying on intuition and internal knowledge of what characteristics have some impact on answering the questions you want to answer.

Generate hypotheses about what data answers this problem for you and collect that data. Then, squint at the data and see if anecdotally it is telling you what you expected. This by no means guarantees you are collecting the right data. But it is an additional item that helps you predict if you are on the right track.

Finally, keep checking and keep answering this question of whether you are collecting the right data. You can avoid spending time trying to fit increasingly complex models to your data by continually considering whether you have the right data in the first place.

The post Are we collecting the right data? appeared first on _datamettle.

]]>The post How much data do I need to do machine learning? appeared first on _datamettle.

]]>When a chef is ordering up raw ingredients, there are a couple of main things they need to consider for a great dining experience. The first is that they have enough to feed everyone. The second is that the ingredients are of high quality. Three-star Michelin meals are built around the best ingredients, not just putting the right quantity of ingredients together.

Thanks for humouring me to this point. The reason I bring this up is that so often, the question we get is “how much data do I need to do machine learning” when approaching a new project. That is, of course, an important question. If you have 10 rows of data, that is not enough. If you have 10k that probably is. However, as with cooking, it’s not just about quantity. Your ability to do machine learning that matters is tied directly to the quality of the data you are putting in.

No matter how good the chef, starting with tasteless strawberries, hothouse tomatoes, and sad lettuce will lead to a bad dish. Similarly, incomplete, unclean, or irrelevant data could mean producing bad outputs from a machine learning model.

It’s the old computer science adage: Garbage in, Garbage out. Bad data inputted into the best machine learning models lead to wrong predictions and outputs.

There are a few primary things we mean by ‘bad’ data:

- Doesn’t tell you what you need it to
- Missing, fragmented or not readily accessible
- Unstructured

If your recipe calls for chocolate cake, but you only have strawberries, well then I have some bad news for you.

Similarly, if you are using a machine learning model to make a prediction, but the information required to make that prediction isn’t present in the data. You are not going to end up with a quality prediction. Say, for example, you are an e-commerce site, and you want to show the exact right product to whoever lands on your website. You might know information on your users like the country they are in, what browser they use, and what device they are on. But perhaps none of that information correlates with what product they are likely to buy.

Then it becomes a matter of identifying what data source *does *correlate with their product preferences. This can come from some intuition, or trial and error. And if you aren’t collecting this data and the necessary scale, then you need to find ways to do so.

Everyone’s been there: you are about to make waffles on a lazy Sunday morning. You pull up your recipe, grab all the ingredients, and get to work. By the time you have all your dry ingredients together, you start in with the eggs and milk. Then, of course, you realize, you only have half as much milk as you need for the recipe. You try to augment it – water probably works fine in its place, right? The waffles turn out terrible, and you ruin breakfast. This may or may not have happened to me recently.

The data science equivalent is having all the necessary ingredients – but there are gaps in the data. Some entries are missing the critical details – say the age of the user to predict the right product to show them.

Frequently this data exists somewhere. Like with the waffles example, you might have more milk in another fridge, or you might have a corner store a short walk away. But, if it isn’t where it needs to be, then you can’t make use of it.

The real-world example of this are organizations where data is collected on customers and stored across different databases and platforms. You might have some data in Google Analytics, your CRM, surveys, email marketing software, and so on. Separately, each of these products performs a useful function. However, your models will often require inputs from multiple sources, brought together to be useful in making predictions.

This is a data engineering challenge – bringing data from these various sources and combining into a single place for data scientists to first test and build models. Then, once deployed, these data pipelines need to run continuously to feed into the models and output what you need from them.

Sometimes, maybe in a pandemic situation, you find yourself digging into the freezer to find those food items you saved months ago. By now they are caked in layers of frost. You see lots of things, but there is no order or structure.

This is the unstructured data problem. You have lots of it, you just don’t know what it is, and classifying it isn’t straightforward. For example, it’s customer feedback or other forms of free text. You can have terabytes of this data, but without some classification of that data then you can’t do much with it. The challenge then becomes finding a way to put structure around this data – for instance, classifying customer feedback as positive, neutral, or negative.

The other day I found myself with a flavourless pineapple. What could be more disappointing? Well, I also had some tequila, triple sec and lime. So I threw some pineapple chunks in the freezer and once frozen made some frozen pineapple margaritas. The outcome was radically different than what I had originally intended, but I was able to work with what I had and produce a very tasty alternative.

Bad data doesn’t mean starting from scratch. In fact, usually, our projects start off with cleaning and connecting up data. This usually leads us to interesting insights and a better understanding of what is possible to do with the ingredients available. Sometimes these exactly fit with the vision of the original recipe, but in many cases, we discover new and unanticipated insights during this process.

To bring this all together – it’s essential to not only account for the amount of data you have but the quality of your data. You may have lots of ingredients. But, it’s impossible to cook a great meal if you start with bad ingredients, or ingredients aren’t where they need to be, or you aren’t sure what ingredients you have in the first place.

Similarly, asking how much data I need for machine learning is like asking how much salt I need to make a great meal. Instead, it’s about balancing the quantity of the data you have with the quality. Does it tell you what you need it to tell you? Is it in a place where you can make use of it? And do you know what it is?

However, all is not lost! You can often rescue bad raw ingredients through exploration and cleaning. Sometimes these lead to new recipes and combinations that you didn’t anticipate at first, or help you solve your problem in an unexpected way.

Once you have all these elements, then you are ready to cook.

The post How much data do I need to do machine learning? appeared first on _datamettle.

]]>The post The Birthday Problem the hard way appeared first on _datamettle.

]]>Anyone who has recruited senior data scientists knows it’s a complex role to hire for. Even for a team of senior data scientists. Candidates need to have skills across coding and engineering, as well as statistics.

To get a sense of a candidate’s maths abilities, we’ve taken to asking them the birthday problem question.

If you aren’t familiar: the birthday problem, or birthday paradox, addresses the probability that any two people in a room will have the same birthday. The paradox comes from the fact that you reach 50 per cent likelihood two people will share a birthday with just 23 people in a room. With 70 people you get to 99.9% likelihood.

So, during our interview process, we ask the candidate to work through this problem by simply asking, “If you have N people in a room, how likely are they to share a birthday with each other”?

Stated otherwise, Given \(n\) people in a room, what is the probability \(P(A_n)\) that two or more of them have the same birthday? We like this problem since it shouldn’t take much longer than 15-20 minutes to solve given a reasonable background in statistics, along with some helping pointers from us.

I does have one big flaw though, and that is that \(P(A_n)\) is very complicated to calculate directly. Instead, the trick to solving the problem is to look at the complement \(P(A’_n)\): What is the probability that **no** two people share their birthday? This is reasonably straightforward to calculate, and \(P(A_n)\) is readily calculated from \(P(A’_n)\) as \(P(A_n) = 1 – P(A’_n)\).

Very few candidates that we interviewed actually thought of this trick, and instead attempt to plow ahead solving it as stated. But, once they get stuck we drop this hint and it generally leads to a breakthrough.

Now that our round of interviews are complete, we wanted to explore this problem in greater detail, and share the underlying mathematics that helps us solve the birthday problem.

We’ll explain the solution with the standard trick, but we’ll also attempt to solve it without considering the trick, which turns out to be a very interesting (and complex) combinatorial problem in its own right.

Note: we assume all days equally probable, and no leap year

Most candidates need a bit of help getting started, so we usually suggest considering the situation for two people, \(P(A_2)\). Let’s call them Alice and Bob.

The probability that Alice and Bob have the same birthday is fairly straightforward to calculate. Simply take Alice’s birthday as given, then the probability of Bob also having the same birthday is \(1/365\).

However, to set things up for what follows, we’ll express this probability slightly differently. First if we consider Alice in isolation, ignoring Bob, her birthday can fall on any day of the year, so the probability of her having a unique birthday (ignoring Bob for now) is \(365/365\). Now Bob’s birthday has to fall on the same day as Alice’s, and the probability for that is \(1/365\), which gives us

\[

P(A_2) = \frac{365}{365}\cdot\frac{1}{365}.

\]

Now let’s move to the case of 3 people, \(P(A_3)\). Meet Carol.

At this point, a few candidates blurt out “1/365²”, which is usually a warning sign that they are at a loss for how to proceed. This is where things start getting complicated. We need to carefully consider the various possible cases (in the pictures to the left people stacked on top of each other share the same birthday):

I. They all have different birthdays,

II. Alice & Bob share the same birthday, while Carol has a different birthday,

III. Alice & Carol share the same birthday, while Bob has a different birthday,

IV. Bob & Carol share the same birthday, while Alice has a different birthday,

V. Alice, Bob and Carol all have the same birthday.

As these are disjoint events, we can see that

\begin{align}

P(A_3) &= P(\textrm{either event II-V occurs}) \\

&= P(\textrm{event II occurs}) \ + \\

&\phantom{==} P(\textrm{event III occurs}) \ + \\

&\phantom{==} P(\textrm{event IV occurs}) \ + \\

&\phantom{==} P(\textrm{event V occurs}).

\end{align}

We can also note that the probability of either event II, III or IV happening are all equal, so we can simplify this further:

\begin{align}

P(A_3) &= P(\textrm{either event II-V occurs}) \\

&= 3P(\textrm{event II occurs}) \ + \\

&\phantom{==} P(\textrm{event V occurs}).

\end{align}

However, this is where it starts getting a bit hairy, and generalising this approach for more people quickly gets very complex (as we’ll see in a bit). Instead, at this point we indicate that perhaps there’s a better way, which usually leads the candidate to realise that

\[

P(\textrm{either event II-IV occurs}) = P(\textrm{event I does not occur}) = 1 – P(\textrm{event I occurs}),

\]

i.e. \(P(A_3) = 1 – P(A’_3)\).

So, let’s turn our attention to \(P(A’_3)\), that is, Alice, Bob and Carol all have different birthdays. Following the same approach as when calculating \(P(A_2)\), we have that

- Alice’s birthday can fall on any day of the year (probability \(365/365\)),
- Bob’s birthday has to fall on any day other than Alice’s birthday (probability \(364/365\)), and
- Carol’s birthday has to fall on any day other than Alice’s and Bob’s birthday (probability \(363/365\)).

So all in all,

\[

P(A’_3) = \frac{365}{365}\cdot\frac{364}{365}\cdot\frac{363}{365}.

\]

This line of reasoning is easily extended to n people, which leads to

\[

P(A’_n) = \frac{365}{365}\cdot\frac{364}{365}\cdots\frac{365-n+1}{365} = \frac{365!}{(365-n)!\cdot 365^n},

\]

and finally we arrive at the answer:

\[

P(A_n) = 1\, -\, P(A’_n) = 1\, -\, \frac{365!}{(365-n)!\cdot 365^n}.

\]

How do we calculate this *without* using the trick? If we stick to the case where \(n=3\) for a moment, it’s fairly straightforward to calculate \(P(\textrm{event II occurs})\) and \(P(\textrm{event V occurs})\), following the same argument as above. For event II, we have that:

- Alice’s birthday can fall on any day of the year (probability \(365/365\)),
- Bob’s birthday has to fall on Alice’s birthday (probability \(1/365\)), and
- Carol’s birthday has to fall on any day other than Alice’s and Bob’s joint birthday (probability \(364/365\)),

so \(P(\textrm{event III occurs}) = 364/365^2\). For \(P(\textrm{event V occurs})\), they all need to have the same birthday, and the chance of this happening is

\(1/365^2\). So all in all,

\[

P(A_3) = 3\frac{364}{365^2} + \frac{1}{365^2}.

\]

Note that we should have \(P(A_3) + P(A_3′) = 1\), which is indeed true. For those interested, this is shown in the Appendix below.

Let’s turn to the general case of n people. Now the number of cases quickly gets very complicated. For illustration, in the case of 4 people (introducing Dan) there are 15 different cases (again, people stacked on top of each other share the same birthday):

With 5 people there’s 52 cases, with 6 people there’s 203 cases and so on, growing very rapidly. With 23 people there are over \(4\cdot 10^{16}\) cases!

Note that the 15 events for 4 people come if 5 distinct “shapes”:

- No-one shares their birthday with anyone else (1 event),
- Two people share their birthday, and the other two have their own birthdays (6 events),
- Two people share their birthday, and the other two also share their birthday (3 events),
- Three people share their birthday, and the remaining do not (4 events),
- All four people have the same birthday (1 event).

Two key observations are that all events of the same shape have the same probability, and that a shape is conveniently captured by the mathematical concept of a partition of a number \(n\).

A partition of a number \(n\) is simply one way of writing it as a sum of positive numbers:

\[

n = a_1 + a_2 + \cdots + a_k.

\]

For example, there are five partitions of the number four,

\begin{align}

4 &= 1 + 1 + 1 + 1, \\

4 &= 2 + 1 + 1, \\

4 &= 2 + 2, \\

4 &= 3 + 1, \\

4 &= 4,

\end{align}

each corresponding to a specific shape above (i.e. \(2 + 1 + 1\) corresponds to the third case, where two people share their birthday, and the other two have distinct birthdays). Partitions are usually denoted by the Greek letter \(\lambda\), where \(\lambda\vdash n\) means that \(\lambda\) is a specific partition of the number \(n\). If we let \(N(\lambda)\) denote the number of events of shape \(\lambda\), let \(P(\lambda)\) denote the probability of a specific event of shape \(\lambda\) happening, and \(\max(\lambda)\) be the maximum term in the partition, we can write \(P(A_n)\) as

\[

P(A_n) = \mathop{\sum_{\lambda\vdash n}}_{\max(\lambda) \geq 2} N(\lambda)P(\lambda).

\qquad\qquad\textrm{(1)}

\]

Note that the sum runs over all partitions except \(n = 1 + \cdots + 1\), which corresponds to the unique event of everyone having separate birthdays.

We begin by working out \(P(\lambda)\), as it’s the easier to handle. As it turns out, \(P(\lambda)\) depends only on the “length” of the partition, i.e. the number of days covered by birthdays. Instead of proving this rigorously we’ll show it by example. Let’s look at two different cases that cover two days. First consider Alice and Bob sharing birthdays, and Carol and Dan sharing birthdays:

- Alice’s birthday can fall on any day of the year (probability \(365/365\)),
- Bob’s birthday has to fall on Alice’s birthday (probability \(1/365\)), and
- Carol’s birthday has to fall on any day other than Alice’s and Bob’s joint birthday (probability \(364/365\)),
- Dan’s birthday has to fall on Carol’s birthday (probability \(1/365\)),

so the probability of this happening is:

\[

\frac{365}{365}\cdot\frac{1}{365}\cdot\frac{364}{365}\cdot\frac{1}{365}

=

\frac{365\cdot 364}{365^4}.

\]

Now consider Alice, Bob and Carol sharing birthdays, with Dan having his birthday on a different day:

- Alice’s birthday can fall on any day of the year (probability \(365/365\)),
- Bob’s birthday has to fall on Alice’s birthday (probability \(1/365\)),
- Carol’s birthday has to fall on Alice’s and Bob’s shared birthday (probability \(1/365\)), and
- Dan’s birthday has to fall on any day apart from Alice’s, Bob’s and Carol’s shared birthday (probability \(364/365\)),

so the probability of this happening is:

\[

\frac{365}{365}\cdot\frac{1}{365}\cdot\frac{1}{365}\cdot\frac{364}{365}

=

\frac{365\cdot 364}{365^4}.

\]

Note in particular that the left hand side of both expressions have the same factors, just in a different order. Whenever we add someone sharing their birthday with the previous person, we add a factor \(1/365\), and whenever we add someone who has a “new” birthday we add a factor:

\[

\frac{365 – \textrm{number of days already covered}}{365}.

\]

So, in general, if an event covers \(k\) days, the probability of it occurring is

\[

\frac{365!}{(365-k)!365^n}.

\]

If we let \(\operatorname{length}(\lambda)\) denote the length of \(\lambda\), this means that

\[

P(\lambda) = \frac{365!}{(365-\operatorname{length}(\lambda)!365^n}.\qquad\qquad\textrm{(2)}

\]

Now all that is left is to figure out what \(N(\lambda)\) is. We can think of this as “how many ways can we fill the shape with people?”. The multinomial coefficients tell us exactly that! For a partition \(\lambda = a_1 + \cdots + a_k\) they are defined as

\[

\binom{n}{\lambda} = \binom{n}{a_1, \dots, a_k} = \frac{n!}{a_1!\cdots a_k!}.

\]

For example, for the partition \(4 = 2 + 2\) we have

\[

\binom{4}{2, 2} = \frac{4!}{2!\cdot 2!} = \frac{1\cdot 2\cdot 3\cdot 4}{1\cdot 2\cdot 1\cdot 2} = 6,

\]

so there are 6 ways to fill in the corresponding shape:

Hm, this doesn’t seem right, do we not have 3 cases of this shape above? Indeed we do! This way of “filling in the shape” actually counts the same event several times. Note that the first and fourth both correspond to Alice and Bob sharing birthdays, and Carol and Dan sharing birthdays, the two days are simply swapped. Each event in the top row is the same as the event below it. So, how do we fix this?

The problem is that the filling in method distinguishes between days in a way that we don’t want to do. We don’t care which specific day peoples birthday fall on. Whenever we have two or more days that have the same number of people sharing birthdays on those days, we count that as one case, whereas the corresponding multinomial coefficient counts all possible ways we can permute those days. To deal with this, we need to consider the length of runs of equal terms in a partition. For example, the partition

\[

7 = 2 + 2 + 1 + 1 + 1 = 2 \cdot 2 + 3 \cdot 1

\]

has one run of length \(2\) (the two 2’s) and one run of length \(3\) (the three 1’s).

To formalise this, for a permutation \(\lambda \vdash n\) we can collect equal terms and write it as

\[

n = c_1b_1 + \cdots + c_mb_m,

\]

where \(b_1 > b_2 > \cdots > b_m\). Then the numbers \(c_1\), \(c_2\), …, \(c_m\) are the lengths of all the runs of equal terms in our partition. We can now define \(s(\lambda)\), which counts the number of ways we can swap days in a shape while still being in the same event, as

\[

s(\lambda) = c_1!\cdots c_m!

\]

With all this machinery in place we’re finally in a place to write out the full formula for \(P(A_n)\). As \(s(\lambda)\) accounts for the “double counting” of the multinomial coefficients, we have that

\[

N(\lambda) = \binom{n}{\lambda} / s(\lambda). \qquad\qquad\textrm{(3)}

\]

Substituting the expressions (2) and (3) for \(P(\lambda)\) and \(N(\lambda)\) into Equation (1) above gives us

\[

P(A_n) = \mathop{\sum_{\lambda\vdash n}}_{\max(\lambda) \geq 2} \binom{n}{\lambda}\cdot\frac{365!}{(365-\operatorname{length}(\lambda))!365^ns(\lambda)}.

\]

Lovely, isn’t it?

Since we’ve done all this work, it’s worth noting that, as the number of partitions of a number \(n\) grows with \(\tilde e^{\sqrt{n}}\), the computational complexity of calculating this formula is exponential in \(n\). This should be contrasted with the linear complexity of the “trick” solution. On the other hand, this more complex solution is easy to adapt to answer more complex questions, like “what is the probability that no *three* people share the same birthday”.

The birthday problem isn’t intuitive right away – the idea that only 23 people need to be present to have a coinflip’s chance that two people share the same birthday. Solving this paradox is even less intuitive. If you’ve read this far though, you’ll surely nail your next interview for a data scientist position at Data Mettle.

And next time you go to a birthday party with more than 70 people just remember, you are probably forgetting to say happy birthday to at least one person.

If you’ve made it this far, The fact that \(P(A_3) + P(A_3′) = 1\) can be seen by rearranging the terms:

\begin{align}

P(A_3) + P(A_3′)

&=

3\frac{1}{365}\cdot\frac{364}{365} + \frac{1}{365^2} + \frac{364}{365}\cdot\frac{363}{365} \\

&=

\frac{3\cdot 364 + 1 + 364\cdot 363}{365^2} \\

&=

\frac{364 + 1 + 2\cdot 364 + 363\cdot 364}{365^2} \\

&=

\frac{365 + (2 + 363)\cdot 364}{365^2} \\

&=

\frac{365\cdot 1 + 365\cdot 364}{365^2} \\

&=

\frac{365^2}{365^2} \\

&=

1.

\end{align}

Icons made by Freepik from www.flaticon.com.

The post The Birthday Problem the hard way appeared first on _datamettle.

]]>The post Four Questions to Consider Before Launching a Machine Learning Project appeared first on _datamettle.

]]>In this post we take you through a typical process we run on data science projects involving questions to consider before launching a machine learning project. We start with the why of the project, through to discovery, implementation and last-mile delivery. This process isn’t always linear – in fact it’s almost never linear. But, in general, these are the steps that go through.

Why is always a good place to start. Why are we doing this? What value do we think, or hope, it will deliver?

The “why” helps form the vision of what you want to achieve with your machine learning project. It makes it easier to communicate with other stakeholders, such as your CEO, your colleagues, or the data science team working on it. This is also when you sketch out what you* think *the final product will look like. Of course, this will change over the course of the machine learning project.

To use an example: you are facing a high rate of customer churn. You think there’s some evidence you can predict when this will happen. You think you can use this information to stop customers from leaving. The why is clear here: you want to lose fewer customers. But should also think about how you’ll operationalise this. Who the key stakeholders are (maybe account managers or customer success), where you will need data from (CRM and your product), and what a final tool might look like (could be something as simple as an email notification to the account manager to intervene).

Importantly, this is the time to focus on a minimum viable product (MVP). What is the smallest thing you can build to validate the tool? The example above is an approach: rather than implementing a full solution that integrates with your CRM, a simple email notification is quicker to set up and will get the model’s predictions in the hands of your time. They can give feedback on the accuracy of the model, and whether the interventions. You can then embed it further in your processes down the line. You can always go and improve it further down the line once you know it works.

One of the most common questions we get is ‘how much data do we need’. The answer we give is a bit unsatisfying: it depends. Quantity of data is important. But this goes hand in hand with the quality of the data.

First, let’s consider the quantity. If the data is good and provides some signal, you may not have enough of it to make meaningful predictions. If you rely solely on data you collect to train models, you solve this by collecting more data. That means more customers or more interactions with them. Until you have enough data, you can’t use machine learning.

There are ways of generating ‘synthetic’ data. This is artificially created and mimics real data Learn more on this at Towards Data Science.

Alternatively, if the data you require isn’t organisation specific, there are also public data repositories. Check out Google’s Dataset search, and Kaggle.

All of the above is caveated with whether the data is *good* to start with. Does the answer we are looking for live in the underlying data? And, is it in the right shape to make use of it for the machine learning project?

For the customer churn prediction model example: say the overwhelmingly greatest indicator of churn is country of residence. If we don’t collect that information from enough customers, then the model’s outputs will be weak.

Additionally, is the data structured so that we can make use of it? If it’s unstructured data, such as free text like customer comments, or if it doesn’t properly link customers with the data you have on them, then you can’t use the data in the current state. That work is likely to be a project into itself. In this case, it comes down to identifying what other information you need.

Does pursuing a long term machine learning project, implementing it, and maintaining it justify the resource you put into it? Or, are there more straightforward solutions to achieve the same result?

Often, this doesn’t mean that you should stop the project. It might mean that data science isn’t the right tool for the job. It could be a much simpler solution, like better reporting dashboards and data visualisation.

However, often pursuing a machine lerning project is well worth the cost and resource that goes into it. Consider our Concentre project as an example, where our tool allowed them to reduce time spent analysing documents from 200 in 8 hours to 200 in 30 minutes.

A good machine learning project – or any project for that matter – considers the last-mile delivery. How will we get this model operational, and in the hands of stakeholders who will get value from it?

Consider our customer churn problem again. Our tool enables the customer success team to intervene and offer some perk or incentive to retain that customer. In the abstract, this sounds great. You build the tool, you hand it over to the customer support team, and then they get on with it.

However, this is where machine learning projects can fail. Embedding these tools into existing workflows or products is the most critical bit. In our example, you might build this tool and deploy it in the cloud somewhere, and the customer success team needs to navigate over there to find a list of soon to churn customers. However, the customer success team spends all their time in a CRM or email. And this additional step of navigating away means they won’t use it as much, which means the organisation isn’t able to get the maximum value from this.

That’s again why in the beginning it is important to define a minimum viable product or outcome – what’s the least amount of time and resource we can spend to prove that this tool solves a problem, and can be operationalised. Avoid spending time on figuring out how you can integrate the tool with your CRM, and instead make it much simpler. You can always extend it down the line.

While we have thus far presented a linear sequence, you should consider this question in the beginning, before diving into the data and value. What does an end-product potentially look like? How do we ensure it gets used? Why are we doing this in the first place and for whom?

But this linear sequence is essential. Inevitably when you dive into this project, the final output will vary from where you originally started. So, this last-mile delivery might look different from what you expected, and you may need to adapt based on technology or data restraints.

If you have a task or problem in front of you that is well suited for data science, then this framework is the next step. Start with why this project is worth pursuing, and what the simplest version of it could look like. Then, dive into the data you have and see if there is enough of it, and it is telling you something meaningful. With this information in mind, make sure continuing to pursue this machine learning project justifies the costs and resources. Finally, make sure you get it operational in a way that your team or customers can make use of it. This might look like the initial MVP you sketched out at the beginning. Or, it could have drastically changed over the course of the project.

The post Four Questions to Consider Before Launching a Machine Learning Project appeared first on _datamettle.

]]>The post A Framework for the Use of Data Science in Your Company: Four Approaches appeared first on _datamettle.

]]>In order to help businesses think about the types of problems or tasks they might approach using data science, we’ve developed a general framework for how you look at opportunities to use data science. This is meant to jumpstart a brainstorm as to where data science might be applied within an organisation, and to ensure it delivers value.

As a general rule, if you can teach a person, you can teach a machine. Machine learning and data science can be very well suited for repetitive and time-consuming tasks. This generally involves classifying large volumes of information by subject matter.

An example of this problem is our project with Concentre. They have a large volume of documents to check, and previously this was a completely human-powered endeavour. We effectively trained a machine to do this task for them, which cut the time they needed to spend checking these documents down dramatically.

Another great example of this comes from a cucumber farmer in Japan. Cucumbers need to be sorted by size and curvature once picked. This task fell to people. The cucumber farmer realised this is a perfect problem for image recognition and deep learning. So, he built a tool that took images of the cucumbers and automatically sorted them into groups based on those factors.

Tightly related to the above, these are tasks that humans often make a mistake when doing. This might be due to the amount of data a human needs to process or consider to complete these tasks. Or it could be because as above, the repetition might lead to boredom and mistakes being made.

A great example of this in our past work is a major consultancy. Their large team of consultants had specific skills and experience and were located across the world. Their clients needed certain skills, over specific time periods in specific locations. Assigning the right teams for the right jobs at the right times is an enormous undertaking, with countless factors to consider. You have to do many permutations, each change having knock-on effects for other assignments.

If you’ve had a wedding, you might have experienced a similar problem when setting up your table assignments. You want to have the right people together at tables, and there are always pairs of people you can’t seat together. Uncle Joe doesn’t get along with Aunt Emma. Each change you make leads to a cascading effect on other tables, and it ends up taking way more time than you thought. Then you share it with the groom’s mother in law and she doesn’t like her table and so you start over. Machine learning is a perfect solution. You could tag participants by personality type and group them together, keeping in mind pairs that need to be excluded from each other.

Don’t do data science for the sake of doing data science. There needs to be a complete understanding of the problem you are trying to solve. For the cucumber farmer in Japan, they needed to classify cucumbers by size and shape. For Concentre, they needed to validate files against the metadata.

Starting with a solution or technology you want to use is the wrong way around. You can’t just throw algorithms at something and hope it returns something useful for you. Data science, machine learning and AI are not magic solutions.

Wanting to use data science to improve marketing or sales is not enough information to start with. You should avoid top-down approaches like this, and instead work up from the problem to see if data science is the right tool for the job. An example problem for sales might be not having a good way of predicting which leads will convert into customers, which leads to inefficiency and salespeople wasting time on the wrong leads when they could be nurturing the right ones. Now, this is an interesting problem to solve, and potentially one for the use of data science. Which brings us to the next part of the framework.

This is also related to the previous point. Is the problem a big enough headache that it is worth spending time-solving? Does it deliver enough value to users, or enough savings, or increased revenue to embark on a potentially uncertain and time-consuming journey? This needs to be clear and obvious to all stakeholders from the beginning because it’s not just about kicking off the project, but also ensuring everyone is brought into final delivery of the data science product. For example, a product manager will need to see value in the proposed solution enough to prioritise getting it into a roadmap. A sales manager will need to be clear on why it improves outcomes enough to ask his sales team to change their workflows to use it.

Using our previous sales example, a good way to quantify this is to try and calculate how much time is spent trying to qualify leads, and how effective sales teams are at this currently. Then, you can try and better understand how the use of data science might improve accuracy, and/or cut down on the teams time spent researching leads.

For Concentre, this was clear from the outset. They had consultants doing these checks manually for 8 hours per day, in which time a consultant could complete 20 checks and do the necessary next steps in rectifying errors. With our solution, they were able to do the same amount of work in 30 minutes because our tool took away the time-consuming portion of the task in checking the metadata against the file name. Now they can utilise those consultants on other projects for their clients.

This is meant to be a framework to help you start to think about how you might approach solving problems in your organisation, and in particular which of those problems are well suited for the use of data science. We are here as a resource to help you make use of this framework so do reach out if you have questions or want to run through specifics in your organisation.

The post A Framework for the Use of Data Science in Your Company: Four Approaches appeared first on _datamettle.

]]>The post Perth Data Science and Tech Scene: A newcomers view appeared first on _datamettle.

]]>It’s exciting how much the Perth tech ecosystem has evolved. We started mapping it out ourselves and we thought it would be useful to share with our audience!

There are several Perth data science groups and resources. If you are interested in learning about data science, check out these meetups.

A WA Government initiative in partnership with Curtain University, the purpose of this hub is to support the growth of data science in WA. They have a collection of resources on events, courses and projects.

meetup.com/Perth-Machine-Learning-Group

Over 2000 strong, this group meets twice weekly and is for anyone looking to learn to code machine learning/deep learning applications.

With over 600 members, this Meetup group brings together people that are passionate about data and analytics. Unfortunately, it looks like they have gone a bit quiet as of late.

Perth has several accelerators and co-working spaces. Along with WeWork, there are some local accelerators/office space providers.

Co-working space that also runs programmes for startups. They host Perth Startup Weekend, as well as the Plus Eight accelerator.

A community of technologists working to bring cutting edge technology to industry. Industry organisations post challenges they need to solve. The Unearthed platform matches them with AI and machine learning experts and teams who have the technology or expertise to help.

Organisation group with the mission of growing the technology sector in Perth and across Western Australia. They have a collection of resources and run regular events.

meetup.com/Morning-Startup-Perth

With over 4000 members, this group hosts regular meetups (and sometimes even in the evening) to discuss all things startups.

Perth has had its share of successful startups from a range of sectors, including agriculture, health, edtech and robotics.

Farm management platform that brings all the information a farmer needs. They raised a Series D of A$9m last year, bringing the total raised to A$20.5m.

Australia’s largest platform for connecting users to healthcare practitioners, from GPs to dentists. HealthEngine has raised a total of A$37.8m to date, most recently in April 2017 with high profile investors Sequoia.

If you’ve recently been at uni, you may have used Moodle. Launched in Perth in the early 2000s (so not a startup anymore), Moodle has over 50% market share in Europe, Latin America and Oceania.

Looking to raise money for your startup? There are a few investors based in Perth to check out.

Angel network based in Perth. Active since 2010, they’ve backed over 20 founders. Sign up for their newsletter and stay up to date on pitch nights and events.

$40mn fund established with support from the government’s Innovation Investment Fund. Yuuwa’s investment focus is on biotech and software.

ventnor.com.au/venture-capital

VC based in Perth, with a focus on mining & resources, tech, biotech/health, finance, blockchain, renewables, transport and industrial.

Do you have any Perth data science, tech or startup resources we should know about? Let us know!

The post Perth Data Science and Tech Scene: A newcomers view appeared first on _datamettle.

]]>