We are your data science workshop.

How much data do I need to do machine learning?

If you needed to make a cooking analogy for data science, then the data scientist is the chef. The models and algorithms are the recipes. That makes data the raw ingredients.

When a chef is ordering up raw ingredients, there are a couple of main things they need to consider for a great dining experience. The first is that they have enough to feed everyone. The second is that the ingredients are of high quality. Three-star Michelin meals are built around the best ingredients, not just putting the right quantity of ingredients together.

Ok, what does this have to do with machine learning

Thanks for humouring me to this point. The reason I bring this up is that so often, the question we get is “how much data do I need to do machine learning” when approaching a new project. That is, of course, an important question. If you have 10 rows of data, that is not enough. If you have 10k that probably is. However, as with cooking, it’s not just about quantity. Your ability to do machine learning that matters is tied directly to the quality of the data you are putting in.

Garbage in, Garbage out

No matter how good the chef, starting with tasteless strawberries, hothouse tomatoes, and sad lettuce will lead to a bad dish. Similarly, incomplete, unclean, or irrelevant data could mean producing bad outputs from a machine learning model.

It’s the old computer science adage: Garbage in, Garbage out. Bad data inputted into the best machine learning models lead to wrong predictions and outputs.

What is bad data?

There are a few primary things we mean by ‘bad’ data: 

  1. Doesn’t tell you what you need it to
  2. Missing, fragmented or not readily accessible
  3. Unstructured

The data doesn’t tell you what you need it to

If your recipe calls for chocolate cake, but you only have strawberries, well then I have some bad news for you.

Similarly, if you are using a machine learning model to make a prediction, but the information required to make that prediction isn’t present in the data. You are not going to end up with a quality prediction. Say, for example, you are an e-commerce site, and you want to show the exact right product to whoever lands on your website. You might know information on your users like the country they are in, what browser they use, and what device they are on. But perhaps none of that information correlates with what product they are likely to buy.

Then it becomes a matter of identifying what data source does correlate with their product preferences. This can come from some intuition, or trial and error. And if you aren’t collecting this data and the necessary scale, then you need to find ways to do so.

The data is missing, fragmented or not readily accessible

Everyone’s been there: you are about to make waffles on a lazy Sunday morning. You pull up your recipe, grab all the ingredients, and get to work. By the time you have all your dry ingredients together, you start in with the eggs and milk. Then, of course, you realize, you only have half as much milk as you need for the recipe. You try to augment it – water probably works fine in its place, right? The waffles turn out terrible, and you ruin breakfast. This may or may not have happened to me recently.

The data science equivalent is having all the necessary ingredients – but there are gaps in the data. Some entries are missing the critical details – say the age of the user to predict the right product to show them. 

Frequently this data exists somewhere. Like with the waffles example, you might have more milk in another fridge, or you might have a corner store a short walk away. But, if it isn’t where it needs to be, then you can’t make use of it.

The real-world example of this are organizations where data is collected on customers and stored across different databases and platforms. You might have some data in Google Analytics, your CRM, surveys, email marketing software, and so on. Separately, each of these products performs a useful function. However, your models will often require inputs from multiple sources, brought together to be useful in making predictions.

This is a data engineering challenge – bringing data from these various sources and combining into a single place for data scientists to first test and build models. Then, once deployed, these data pipelines need to run continuously to feed into the models and output what you need from them.

You have unstructured data

Sometimes, maybe in a pandemic situation, you find yourself digging into the freezer to find those food items you saved months ago. By now they are caked in layers of frost. You see lots of things, but there is no order or structure.

This is the unstructured data problem. You have lots of it, you just don’t know what it is, and classifying it isn’t straightforward. For example, it’s customer feedback or other forms of free text. You can have terabytes of this data, but without some classification of that data then you can’t do much with it. The challenge then becomes finding a way to put structure around this data – for instance, classifying customer feedback as positive, neutral, or negative. 

How to rescue a meal gone wrong

The other day I found myself with a flavourless pineapple. What could be more disappointing? Well, I also had some tequila, triple sec and lime. So I threw some pineapple chunks in the freezer and once frozen made some frozen pineapple margaritas. The outcome was radically different than what I had originally intended, but I was able to work with what I had and produce a very tasty alternative.

Bad data doesn’t mean starting from scratch. In fact, usually, our projects start off with cleaning and connecting up data. This usually leads us to interesting insights and a better understanding of what is possible to do with the ingredients available. Sometimes these exactly fit with the vision of the original recipe, but in many cases, we discover new and unanticipated insights during this process.

Digestif ?

To bring this all together – it’s essential to not only account for the amount of data you have but the quality of your data. You may have lots of ingredients. But, it’s impossible to cook a great meal if you start with bad ingredients, or ingredients aren’t where they need to be, or you aren’t sure what ingredients you have in the first place.

Similarly, asking how much data I need for machine learning is like asking how much salt I need to make a great meal. Instead, it’s about balancing the quantity of the data you have with the quality. Does it tell you what you need it to tell you? Is it in a place where you can make use of it? And do you know what it is? 

However, all is not lost! You can often rescue bad raw ingredients through exploration and cleaning. Sometimes these lead to new recipes and combinations that you didn’t anticipate at first, or help you solve your problem in an unexpected way.

Once you have all these elements, then you are ready to cook.

The Birthday Problem the hard way

We recently completed a round of interviews for a senior data scientist to join the Data Mettle team (welcome aboard Ying!)

Anyone who has recruited senior data scientists knows it’s a complex role to hire for. Even for a team of senior data scientists. Candidates need to have skills across coding and engineering, as well as statistics.

To get a sense of a candidate’s maths abilities, we’ve taken to asking them the birthday problem question.

If you aren’t familiar: the birthday problem, or birthday paradox, addresses the probability that any two people in a room will have the same birthday. The paradox comes from the fact that you reach 50 per cent likelihood two people will share a birthday with just 23 people in a room. With 70 people you get to 99.9% likelihood.

So, during our interview process, we ask the candidate to work through this problem by simply asking, “If you have N people in a room, how likely are they to share a birthday with each other”?

Stated otherwise, Given \(n\) people in a room, what is the probability \(P(A_n)\) that two or more of them have the same birthday? We like this problem since it shouldn’t take much longer than 15-20 minutes to solve given a reasonable background in statistics, along with some helping pointers from us.

I does have one big flaw though, and that is that \(P(A_n)\) is very complicated to calculate directly. Instead, the trick to solving the problem is to look at the complement \(P(A’_n)\): What is the probability that no two people share their birthday? This is reasonably straightforward to calculate, and \(P(A_n)\) is readily calculated from \(P(A’_n)\) as \(P(A_n) = 1 – P(A’_n)\).

Very few candidates that we interviewed actually thought of this trick, and instead attempt to plow ahead solving it as stated. But, once they get stuck we drop this hint and it generally leads to a breakthrough.

Now that our round of interviews are complete, we wanted to explore this problem in greater detail, and share the underlying mathematics that helps us solve the birthday problem.

We’ll explain the solution with the standard trick, but we’ll also attempt to solve it without considering the trick, which turns out to be a very interesting (and complex) combinatorial problem in its own right.

Note: we assume all days equally probable, and no leap year

The Standard Candidate Solution

Most candidates need a bit of help getting started, so we usually suggest considering the situation for two people, \(P(A_2)\). Let’s call them Alice and Bob.

Alice & Bob

The probability that Alice and Bob have the same birthday is fairly straightforward to calculate. Simply take Alice’s birthday as given, then the probability of Bob also having the same birthday is \(1/365\).

However, to set things up for what follows, we’ll express this probability slightly differently. First if we consider Alice in isolation, ignoring Bob, her birthday can fall on any day of the year, so the probability of her having a unique birthday (ignoring Bob for now) is \(365/365\). Now Bob’s birthday has to fall on the same day as Alice’s, and the probability for that is \(1/365\), which gives us
\[
P(A_2) = \frac{365}{365}\cdot\frac{1}{365}.
\]

Now let’s move to the case of 3 people, \(P(A_3)\). Meet Carol.

Carol

At this point, a few candidates blurt out “1/365²”, which is usually a warning sign that they are at a loss for how to proceed. This is where things start getting complicated. We need to carefully consider the various possible cases (in the pictures to the left people stacked on top of each other share the same birthday):

I. They all have different birthdays,

II. Alice & Bob share the same birthday, while Carol has a different birthday,

III. Alice & Carol share the same birthday, while Bob has a different birthday,

IV. Bob & Carol share the same birthday, while Alice has a different birthday,

V. Alice, Bob and Carol all have the same birthday.

As these are disjoint events, we can see that

\begin{align}
P(A_3) &= P(\textrm{either event II-V occurs}) \\
&= P(\textrm{event II occurs}) \ + \\
&\phantom{==} P(\textrm{event III occurs}) \ + \\
&\phantom{==} P(\textrm{event IV occurs}) \ + \\
&\phantom{==} P(\textrm{event V occurs}).
\end{align}

We can also note that the probability of either event II, III or IV happening are all equal, so we can simplify this further:

\begin{align}
P(A_3) &= P(\textrm{either event II-V occurs}) \\
&= 3P(\textrm{event II occurs}) \ + \\
&\phantom{==} P(\textrm{event V occurs}).
\end{align}

However, this is where it starts getting a bit hairy, and generalising this approach for more people quickly gets very complex (as we’ll see in a bit). Instead, at this point we indicate that perhaps there’s a better way, which usually leads the candidate to realise that

\[
P(\textrm{either event II-IV occurs}) = P(\textrm{event I does not occur}) = 1 – P(\textrm{event I occurs}),
\]

i.e. \(P(A_3) = 1 – P(A’_3)\).

So, let’s turn our attention to \(P(A’_3)\), that is, Alice, Bob and Carol all have different birthdays. Following the same approach as when calculating \(P(A_2)\), we have that

  • Alice’s birthday can fall on any day of the year (probability \(365/365\)),
  • Bob’s birthday has to fall on any day other than Alice’s birthday (probability \(364/365\)), and
  • Carol’s birthday has to fall on any day other than Alice’s and Bob’s birthday (probability \(363/365\)).

So all in all,
\[
P(A’_3) = \frac{365}{365}\cdot\frac{364}{365}\cdot\frac{363}{365}.
\]

This line of reasoning is easily extended to n people, which leads to

\[
P(A’_n) = \frac{365}{365}\cdot\frac{364}{365}\cdots\frac{365-n+1}{365} = \frac{365!}{(365-n)!\cdot 365^n},
\]

and finally we arrive at the answer:

\[
P(A_n) = 1\, -\, P(A’_n) = 1\, -\, \frac{365!}{(365-n)!\cdot 365^n}.
\]

The hard way

How do we calculate this without using the trick? If we stick to the case where \(n=3\) for a moment, it’s fairly straightforward to calculate \(P(\textrm{event II occurs})\) and \(P(\textrm{event V occurs})\), following the same argument as above. For event II, we have that:

  • Alice’s birthday can fall on any day of the year (probability \(365/365\)),
  • Bob’s birthday has to fall on Alice’s birthday (probability \(1/365\)), and
  • Carol’s birthday has to fall on any day other than Alice’s and Bob’s joint birthday (probability \(364/365\)),

so \(P(\textrm{event III occurs}) = 364/365^2\). For \(P(\textrm{event V occurs})\), they all need to have the same birthday, and the chance of this happening is
\(1/365^2\). So all in all,
\[
P(A_3) = 3\frac{364}{365^2} + \frac{1}{365^2}.
\]

Note that we should have \(P(A_3) + P(A_3′) = 1\), which is indeed true. For those interested, this is shown in the Appendix below.

Let’s turn to the general case of n people. Now the number of cases quickly gets very complicated. For illustration, in the case of 4 people (introducing Dan) there are 15 different cases (again, people stacked on top of each other share the same birthday):

With 5 people there’s 52 cases, with 6 people there’s 203 cases and so on, growing very rapidly. With 23 people there are over \(4\cdot 10^{16}\) cases!

Note that the 15 events for 4 people come if 5 distinct “shapes”:

  • No-one shares their birthday with anyone else (1 event),
  • Two people share their birthday, and the other two have their own birthdays (6 events),
  • Two people share their birthday, and the other two also share their birthday (3 events),
  • Three people share their birthday, and the remaining do not (4 events),
  • All four people have the same birthday (1 event).

Two key observations are that all events of the same shape have the same probability, and that a shape is conveniently captured by the mathematical concept of a partition of a number \(n\).

Partitions and multinomial coefficients

A partition of a number \(n\) is simply one way of writing it as a sum of positive numbers:
\[
n = a_1 + a_2 + \cdots + a_k.
\]
For example, there are five partitions of the number four,
\begin{align}
4 &= 1 + 1 + 1 + 1, \\
4 &= 2 + 1 + 1, \\
4 &= 2 + 2, \\
4 &= 3 + 1, \\
4 &= 4,
\end{align}
each corresponding to a specific shape above (i.e. \(2 + 1 + 1\) corresponds to the third case, where two people share their birthday, and the other two have distinct birthdays). Partitions are usually denoted by the Greek letter \(\lambda\), where \(\lambda\vdash n\) means that \(\lambda\) is a specific partition of the number \(n\). If we let \(N(\lambda)\) denote the number of events of shape \(\lambda\), let \(P(\lambda)\) denote the probability of a specific event of shape \(\lambda\) happening, and \(\max(\lambda)\) be the maximum term in the partition, we can write \(P(A_n)\) as
\[
P(A_n) = \mathop{\sum_{\lambda\vdash n}}_{\max(\lambda) \geq 2} N(\lambda)P(\lambda).
\qquad\qquad\textrm{(1)}
\]
Note that the sum runs over all partitions except \(n = 1 + \cdots + 1\), which corresponds to the unique event of everyone having separate birthdays.

We begin by working out \(P(\lambda)\), as it’s the easier to handle. As it turns out, \(P(\lambda)\) depends only on the “length” of the partition, i.e. the number of days covered by birthdays. Instead of proving this rigorously we’ll show it by example. Let’s look at two different cases that cover two days. First consider Alice and Bob sharing birthdays, and Carol and Dan sharing birthdays:

  • Alice’s birthday can fall on any day of the year (probability \(365/365\)),
  • Bob’s birthday has to fall on Alice’s birthday (probability \(1/365\)), and
  • Carol’s birthday has to fall on any day other than Alice’s and Bob’s joint birthday (probability \(364/365\)),
  • Dan’s birthday has to fall on Carol’s birthday (probability \(1/365\)),

so the probability of this happening is:
\[
\frac{365}{365}\cdot\frac{1}{365}\cdot\frac{364}{365}\cdot\frac{1}{365}
=
\frac{365\cdot 364}{365^4}.
\]
Now consider Alice, Bob and Carol sharing birthdays, with Dan having his birthday on a different day:

  • Alice’s birthday can fall on any day of the year (probability \(365/365\)),
  • Bob’s birthday has to fall on Alice’s birthday (probability \(1/365\)),
  • Carol’s birthday has to fall on Alice’s and Bob’s shared birthday (probability \(1/365\)), and
  • Dan’s birthday has to fall on any day apart from Alice’s, Bob’s and Carol’s shared birthday (probability \(364/365\)),

so the probability of this happening is:
\[
\frac{365}{365}\cdot\frac{1}{365}\cdot\frac{1}{365}\cdot\frac{364}{365}
=
\frac{365\cdot 364}{365^4}.
\]
Note in particular that the left hand side of both expressions have the same factors, just in a different order. Whenever we add someone sharing their birthday with the previous person, we add a factor \(1/365\), and whenever we add someone who has a “new” birthday we add a factor:

\[
\frac{365 – \textrm{number of days already covered}}{365}.
\]
So, in general, if an event covers \(k\) days, the probability of it occurring is
\[
\frac{365!}{(365-k)!365^n}.
\]
If we let \(\operatorname{length}(\lambda)\) denote the length of \(\lambda\), this means that
\[
P(\lambda) = \frac{365!}{(365-\operatorname{length}(\lambda)!365^n}.\qquad\qquad\textrm{(2)}
\]

Now all that is left is to figure out what \(N(\lambda)\) is. We can think of this as “how many ways can we fill the shape with people?”. The multinomial coefficients tell us exactly that! For a partition \(\lambda = a_1 + \cdots + a_k\) they are defined as
\[
\binom{n}{\lambda} = \binom{n}{a_1, \dots, a_k} = \frac{n!}{a_1!\cdots a_k!}.
\]

For example, for the partition \(4 = 2 + 2\) we have
\[
\binom{4}{2, 2} = \frac{4!}{2!\cdot 2!} = \frac{1\cdot 2\cdot 3\cdot 4}{1\cdot 2\cdot 1\cdot 2} = 6,
\]
so there are 6 ways to fill in the corresponding shape:

Hm, this doesn’t seem right, do we not have 3 cases of this shape above? Indeed we do! This way of “filling in the shape” actually counts the same event several times. Note that the first and fourth both correspond to Alice and Bob sharing birthdays, and Carol and Dan sharing birthdays, the two days are simply swapped. Each event in the top row is the same as the event below it. So, how do we fix this?

The problem is that the filling in method distinguishes between days in a way that we don’t want to do. We don’t care which specific day peoples birthday fall on. Whenever we have two or more days that have the same number of people sharing birthdays on those days, we count that as one case, whereas the corresponding multinomial coefficient counts all possible ways we can permute those days. To deal with this, we need to consider the length of runs of equal terms in a partition. For example, the partition
\[
7 = 2 + 2 + 1 + 1 + 1 = 2 \cdot 2 + 3 \cdot 1
\]
has one run of length \(2\) (the two 2’s) and one run of length \(3\) (the three 1’s).

To formalise this, for a permutation \(\lambda \vdash n\) we can collect equal terms and write it as
\[
n = c_1b_1 + \cdots + c_mb_m,
\]
where \(b_1 > b_2 > \cdots > b_m\). Then the numbers \(c_1\), \(c_2\), …, \(c_m\) are the lengths of all the runs of equal terms in our partition. We can now define \(s(\lambda)\), which counts the number of ways we can swap days in a shape while still being in the same event, as
\[
s(\lambda) = c_1!\cdots c_m!
\]

With all this machinery in place we’re finally in a place to write out the full formula for \(P(A_n)\). As \(s(\lambda)\) accounts for the “double counting” of the multinomial coefficients, we have that
\[
N(\lambda) = \binom{n}{\lambda} / s(\lambda). \qquad\qquad\textrm{(3)}
\]
Substituting the expressions (2) and (3) for \(P(\lambda)\) and \(N(\lambda)\) into Equation (1) above gives us
\[
P(A_n) = \mathop{\sum_{\lambda\vdash n}}_{\max(\lambda) \geq 2} \binom{n}{\lambda}\cdot\frac{365!}{(365-\operatorname{length}(\lambda))!365^ns(\lambda)}.
\]

Lovely, isn’t it?

Since we’ve done all this work, it’s worth noting that, as the number of partitions of a number \(n\) grows with \(\tilde e^{\sqrt{n}}\), the computational complexity of calculating this formula is exponential in \(n\). This should be contrasted with the linear complexity of the “trick” solution. On the other hand, this more complex solution is easy to adapt to answer more complex questions, like “what is the probability that no three people share the same birthday”.

Finally

The birthday problem isn’t intuitive right away – the idea that only 23 people need to be present to have a coinflip’s chance that two people share the same birthday. Solving this paradox is even less intuitive. If you’ve read this far though, you’ll surely nail your next interview for a data scientist position at Data Mettle.

And next time you go to a birthday party with more than 70 people just remember, you are probably forgetting to say happy birthday to at least one person.

Appendix

If you’ve made it this far, The fact that \(P(A_3) + P(A_3′) = 1\) can be seen by rearranging the terms:
\begin{align}
P(A_3) + P(A_3′)
&=
3\frac{1}{365}\cdot\frac{364}{365} + \frac{1}{365^2} + \frac{364}{365}\cdot\frac{363}{365} \\
&=
\frac{3\cdot 364 + 1 + 364\cdot 363}{365^2} \\
&=
\frac{364 + 1 + 2\cdot 364 + 363\cdot 364}{365^2} \\
&=
\frac{365 + (2 + 363)\cdot 364}{365^2} \\
&=
\frac{365\cdot 1 + 365\cdot 364}{365^2} \\
&=
\frac{365^2}{365^2} \\
&=
1.
\end{align}

Icons made by Freepik from www.flaticon.com.

_data journey

Your data can tell you a lot about your customer's journey. Our services can provide you with the information and tools that you need to match your services to customers.