We are your data science workshop.

## Onboarding a data scientist

Hiring a new data scientist into your team can be a very exciting time. The right candidate can provide new insight to your organisation, automate time consuming tasks, and help to transform decision making to become more data driven. As your new team member’s start date approaches, you might start to think about how to best onboard them. In other words, what can you do, as a manager, to help them get up to speed and start adding value to your team?

# Challenges

Integrating a new data scientist into your organisation may not be straightforward for several reasons:

• You don’t understand enough of what they do to know what they need.
• The role itself is often more open and flexible.
• The data scientist’s background can range widely from engineering, mathematics to computer science, and be quite varied in prior experience.
• Their day to day work may be different depending on a number of factors, such as whether the company has more or less structured processes, whether it is a consultancy or product company, and whether the pace of work is fast or slow.

In this article, we provide a wide range of suggestions to design an onboarding process that considers the work environment and the data scientist’s background. There are obvious things one should do, such as introducing them to the immediate and wider team and setting up any system accesses they require. These are steps that you are likely to take for any new starter, and we will not cover them in this article. We focus on onboarding actions that would help a data scientist specifically.

# Have them design their own training

Chances are that you have hired someone you think is smart, has programming skills and prior experience in solving problems using data. But they may not be familiar with the specific problems in your industry, or maybe they haven’t been using the specific modelling technique that is commonly used in your industry, or perhaps they have previously coded in a different programming language. These are not showstoppers to them performing well in the role, and can easily be addressed with a good onboarding/training process.

By nature of the role, many data scientists either have a research background or are experienced with some form of research. What this means is that they should be able to identify the gaps in their knowledge, and effectively look for ways to learn the things they don’t know. Have them take charge of their training. This will cater for their individual background and prior experience. For example, someone with a lot of programming experience might want to spend less time learning new packages and software, and focus more on learning new mathematical concepts. Likewise, someone with a strong statistics or mathematics background might want to spend more time on programming material. Furthermore, they may already have a preference for their approach to learning new skills – some people learn best by doing, some people prefer reading conceptual material, and others benefit more from watching video courses.

Learning is most effective when sufficiently spaced out. If it is feasible as part of the onboarding process, suggest that your data scientist spend some time every day on a training resource of their choice. This could be books, research articles, video lectures, industry workshops, and industry documentation, for example.

# Assign them a small first project

Since a data scientist’s job will involve a fair amount of programming, a good onboarding activity is to give them a small easy programming task. Consider whether to choose a task that has time constraints associated with it or not. There are advantages and disadvantages to both, and the choice will depend on the company’s situation. If the work environment is more fast paced, then giving them a task that fits into the team’s day-to-day work will be immediately useful. The time constraints will mimic the real work they are expected to perform, and get them up to speed on doing this work. If their work is not as urgent, then you might prefer to give them sufficient time to learn not just the specific task, but also any peripheral knowledge. This will allow them flexibility in their learning, to focus on best practices instead of just rushing to ‘get the job done’.

Examples of small projects are:

• Perform an analysis to obtain insight on a section of company data
• Build a simple dashboard using data from the company’s database
• Write a short piece of code that fits into software that your company owns, for example, adding a new feature. Or modifying a small section of the code to make it more efficient, or to reframe it for a different purpose
• Following documentation to execute a piece of software your company owns

Introducing your data scientist to key subject matter experts across the business is essential – these will be the people they may go back to again and again to obtain domain information essential to their analysis. You can do this through formal or informal channels. Examples of formal channels would be including them in stakeholder meetings, and any discussions involving core business strategy, day-to-day running of the business, factors that impact on profit and loss and the types of decision making involved. This will allow them to gain a context of their work and how it fits in with the company’s overall strategy. Informal discussions are sometimes the most efficient form of knowledge transfer. You could organise a chat over lunch with the relevant stakeholders to facilitate this.

While understanding how the business operates is helpful to the new data scientist, be mindful that they need to spend their time on other areas as well, and try not to overwhelm them with too much business information at once.

# Communicate, communicate, communicate

At the start, it is important to communicate the expectations of the role, the type of problems you want them to solve, and the available resources in the company. This will help them to determine how best to get up to speed, to set learning objectives for themselves, and gather the resources to work towards your goals. If you assign them a task and there are deadlines to meet, make sure this is communicated clearly too. On the other hand, if you would like them to be free to spend their initial weeks on general upskilling, ensure they know this too. Make sure you are both aligned on a project plan to avoid rework down the track.

It’s entirely possible that the role will evolve, or you might change your mind on what you want them to work on. That’s okay too, as long as you keep them in the loop, and include them in these discussions. Data science is an interdisciplinary field and your data scientist should be adaptable.

# Set up an environment for data science

Do you have the right environment set up for your data scientist? It is important to discuss from the very beginning what kind of tools and software they will need, and what resources you currently have. This will help them to figure out what’s achievable and what’s not. Whether or not the resources you currently have are sufficient depends on the end goals you have in mind for the data science project.

One-off analysis and proof-of-concept models will most likely not require any complex set up. However, imagine for example, that the end goal is to build a predictive model that automatically updates itself, then have them integrated into the business and made available to key persons on a dashboard. In this scenario, you may want to consider the technology you require in order to achieve this, and whether you might like to purchase cloud computing services or dashboard software. Also, if there is going to be more than one person working on the same set of code, then it is typically necessary to have version control software. If your company doesn’t already own a database, you may want to consider developing this alongside the data science project, especially if you envision having to make use of much more complex data in the future.

Start these discussions early, and plan as much as you can on choosing these initial systems, as it will be much harder to switch once you have set things up a certain way. Your vision and constraints will help your data scientist plan their workflow accordingly.

# Conclusion

To summarise, this article outlines some approaches on how to design an onboarding process for a data scientist. Always communicate with your data scientist, as they may have their own thoughts on technical training, any resources they require from you, and how best to work towards a data science goal. In turn, as a manager with a lot of experience in your industry, you can help to provide context and domain information to your data scientist, and connect them to key stakeholders in your company. Providing your data scientist with the right environment and resources will ensure they are set up for success.

## Are we collecting the right data?

In previous posts, we address keys to starting a data science project, and the amount of data you need to do machine learning. By focusing on the question of ‘how much data do we need to do machine learning?’ it’s possible to overlook the equally important question of ‘how good is the data in the first place?’

To better answer this question on data quality, there’s an additional question you’ll need to ask yourself early on. “Are we collecting the right data for the problem we are trying to solve?” In this post, we share insights on how you can start to try to answer this. It’s best to avoid a scenario where you are spending time and money getting the requisite quantity of data to do machine learning, only to find it doesn’t contain essential information you need.

# What do we mean by the ‘right data’?

Having poor quality data can broadly be categorised into two buckets. The first is that the data is not cleaned, or readily available to be used by data scientists. This problem is often solved by cleaning and data engineering.

The second, and the focus of this post, is when the data doesn’t contain the underlying information you need to solve the problem you are attempting to solve.

As a practical example: say you are an enterprise software company, and you want to predict which website visitors are most likely to convert to paid users. You might collect all kinds of useful information, such as user location, browser type, where they were referred from, and so on. However, after playing with the data, you find none of these factors tells a story or can help you make a meaningful prediction. This data isn’t the right data.

How do you then find it?

Here, the science part of data science comes into play. Start with some hypotheses for why specific data helps you make a meaningful prediction.

Using our previous example, you might come up with a few hypotheses. What are some things you believe contribute to someone converting to a paid user? Maybe the size of the organisation? Or perhaps, their role at the company?

When developing the hypothesis, it makes sense to spend time with the stakeholders who have the most intuition about this particular problem. In our paid user example, this could include customer support or sales team members who have probably developed their own mental models of what a strong lead looks like.

This guides you towards the data you need to collect to begin making this prediction. Then it becomes a matter of collecting it.

# Squint at the data

Once you have some hypotheses about what data you think answers the question you need it to, and you’ve started collecting it, then comes the less scientific part.

Early on, you’ll need to continually analyse the data in a non-scientific way, and see if it is telling you the story you expected. Or perhaps it’s telling you a story you didn’t expect it to. At this stage, you won’t have enough records to do proper machine learning, or even to make any statistically significant conclusion. But, you can try to get a better feeling using your instincts and experience.

In the previous enterprise software example, you can monitor correlations. The bigger the company, the more likely they are to convert. Or, breakdown who is purchasing the product and whether it does align with your hypothesis.

# Summary

When doing machine learning, it’s critical to ask yourself along the way whether you are collecting the right data. You can glean information from your early data collecting by relying on intuition and internal knowledge of what characteristics have some impact on answering the questions you want to answer.

Generate hypotheses about what data answers this problem for you and collect that data. Then, squint at the data and see if anecdotally it is telling you what you expected. This by no means guarantees you are collecting the right data. But it is an additional item that helps you predict if you are on the right track.

Finally, keep checking and keep answering this question of whether you are collecting the right data. You can avoid spending time trying to fit increasingly complex models to your data by continually considering whether you have the right data in the first place.

## The Birthday Problem the hard way

We recently completed a round of interviews for a senior data scientist to join the Data Mettle team (welcome aboard Ying!)

Anyone who has recruited senior data scientists knows it’s a complex role to hire for. Even for a team of senior data scientists. Candidates need to have skills across coding and engineering, as well as statistics.

To get a sense of a candidate’s maths abilities, we’ve taken to asking them the birthday problem question.

If you aren’t familiar: the birthday problem, or birthday paradox, addresses the probability that any two people in a room will have the same birthday. The paradox comes from the fact that you reach 50 per cent likelihood two people will share a birthday with just 23 people in a room. With 70 people you get to 99.9% likelihood.

So, during our interview process, we ask the candidate to work through this problem by simply asking, “If you have N people in a room, how likely are they to share a birthday with each other”?

Stated otherwise, Given $$n$$ people in a room, what is the probability $$P(A_n)$$ that two or more of them have the same birthday? We like this problem since it shouldn’t take much longer than 15-20 minutes to solve given a reasonable background in statistics, along with some helping pointers from us.

I does have one big flaw though, and that is that $$P(A_n)$$ is very complicated to calculate directly. Instead, the trick to solving the problem is to look at the complement $$P(A’_n)$$: What is the probability that no two people share their birthday? This is reasonably straightforward to calculate, and $$P(A_n)$$ is readily calculated from $$P(A’_n)$$ as $$P(A_n) = 1 – P(A’_n)$$.

Very few candidates that we interviewed actually thought of this trick, and instead attempt to plow ahead solving it as stated. But, once they get stuck we drop this hint and it generally leads to a breakthrough.

Now that our round of interviews are complete, we wanted to explore this problem in greater detail, and share the underlying mathematics that helps us solve the birthday problem.

We’ll explain the solution with the standard trick, but we’ll also attempt to solve it without considering the trick, which turns out to be a very interesting (and complex) combinatorial problem in its own right.

Note: we assume all days equally probable, and no leap year

## The Standard Candidate Solution

Most candidates need a bit of help getting started, so we usually suggest considering the situation for two people, $$P(A_2)$$. Let’s call them Alice and Bob.

The probability that Alice and Bob have the same birthday is fairly straightforward to calculate. Simply take Alice’s birthday as given, then the probability of Bob also having the same birthday is $$1/365$$.

However, to set things up for what follows, we’ll express this probability slightly differently. First if we consider Alice in isolation, ignoring Bob, her birthday can fall on any day of the year, so the probability of her having a unique birthday (ignoring Bob for now) is $$365/365$$. Now Bob’s birthday has to fall on the same day as Alice’s, and the probability for that is $$1/365$$, which gives us
$P(A_2) = \frac{365}{365}\cdot\frac{1}{365}.$

Now let’s move to the case of 3 people, $$P(A_3)$$. Meet Carol.

At this point, a few candidates blurt out “1/365²”, which is usually a warning sign that they are at a loss for how to proceed. This is where things start getting complicated. We need to carefully consider the various possible cases (in the pictures to the left people stacked on top of each other share the same birthday):

I. They all have different birthdays,

II. Alice & Bob share the same birthday, while Carol has a different birthday,

III. Alice & Carol share the same birthday, while Bob has a different birthday,

IV. Bob & Carol share the same birthday, while Alice has a different birthday,

V. Alice, Bob and Carol all have the same birthday.

As these are disjoint events, we can see that

\begin{align}
P(A_3) &= P(\textrm{either event II-V occurs}) \\
&= P(\textrm{event II occurs}) \ + \\
&\phantom{==} P(\textrm{event III occurs}) \ + \\
&\phantom{==} P(\textrm{event IV occurs}) \ + \\
&\phantom{==} P(\textrm{event V occurs}).
\end{align}

We can also note that the probability of either event II, III or IV happening are all equal, so we can simplify this further:

\begin{align}
P(A_3) &= P(\textrm{either event II-V occurs}) \\
&= 3P(\textrm{event II occurs}) \ + \\
&\phantom{==} P(\textrm{event V occurs}).
\end{align}

However, this is where it starts getting a bit hairy, and generalising this approach for more people quickly gets very complex (as we’ll see in a bit). Instead, at this point we indicate that perhaps there’s a better way, which usually leads the candidate to realise that

$P(\textrm{either event II-IV occurs}) = P(\textrm{event I does not occur}) = 1 – P(\textrm{event I occurs}),$

i.e. $$P(A_3) = 1 – P(A’_3)$$.

So, let’s turn our attention to $$P(A’_3)$$, that is, Alice, Bob and Carol all have different birthdays. Following the same approach as when calculating $$P(A_2)$$, we have that

• Alice’s birthday can fall on any day of the year (probability $$365/365$$),
• Bob’s birthday has to fall on any day other than Alice’s birthday (probability $$364/365$$), and
• Carol’s birthday has to fall on any day other than Alice’s and Bob’s birthday (probability $$363/365$$).

So all in all,
$P(A’_3) = \frac{365}{365}\cdot\frac{364}{365}\cdot\frac{363}{365}.$

This line of reasoning is easily extended to n people, which leads to

$P(A’_n) = \frac{365}{365}\cdot\frac{364}{365}\cdots\frac{365-n+1}{365} = \frac{365!}{(365-n)!\cdot 365^n},$

and finally we arrive at the answer:

$P(A_n) = 1\, -\, P(A’_n) = 1\, -\, \frac{365!}{(365-n)!\cdot 365^n}.$

## The hard way

How do we calculate this without using the trick? If we stick to the case where $$n=3$$ for a moment, it’s fairly straightforward to calculate $$P(\textrm{event II occurs})$$ and $$P(\textrm{event V occurs})$$, following the same argument as above. For event II, we have that:

• Alice’s birthday can fall on any day of the year (probability $$365/365$$),
• Bob’s birthday has to fall on Alice’s birthday (probability $$1/365$$), and
• Carol’s birthday has to fall on any day other than Alice’s and Bob’s joint birthday (probability $$364/365$$),

so $$P(\textrm{event III occurs}) = 364/365^2$$. For $$P(\textrm{event V occurs})$$, they all need to have the same birthday, and the chance of this happening is
$$1/365^2$$. So all in all,
$P(A_3) = 3\frac{364}{365^2} + \frac{1}{365^2}.$

Note that we should have $$P(A_3) + P(A_3′) = 1$$, which is indeed true. For those interested, this is shown in the Appendix below.

Let’s turn to the general case of n people. Now the number of cases quickly gets very complicated. For illustration, in the case of 4 people (introducing Dan) there are 15 different cases (again, people stacked on top of each other share the same birthday):

With 5 people there’s 52 cases, with 6 people there’s 203 cases and so on, growing very rapidly. With 23 people there are over $$4\cdot 10^{16}$$ cases!

Note that the 15 events for 4 people come if 5 distinct “shapes”:

• No-one shares their birthday with anyone else (1 event),
• Two people share their birthday, and the other two have their own birthdays (6 events),
• Two people share their birthday, and the other two also share their birthday (3 events),
• Three people share their birthday, and the remaining do not (4 events),
• All four people have the same birthday (1 event).

Two key observations are that all events of the same shape have the same probability, and that a shape is conveniently captured by the mathematical concept of a partition of a number $$n$$.

## Partitions and multinomial coefficients

A partition of a number $$n$$ is simply one way of writing it as a sum of positive numbers:
$n = a_1 + a_2 + \cdots + a_k.$
For example, there are five partitions of the number four,
\begin{align}
4 &= 1 + 1 + 1 + 1, \\
4 &= 2 + 1 + 1, \\
4 &= 2 + 2, \\
4 &= 3 + 1, \\
4 &= 4,
\end{align}
each corresponding to a specific shape above (i.e. $$2 + 1 + 1$$ corresponds to the third case, where two people share their birthday, and the other two have distinct birthdays). Partitions are usually denoted by the Greek letter $$\lambda$$, where $$\lambda\vdash n$$ means that $$\lambda$$ is a specific partition of the number $$n$$. If we let $$N(\lambda)$$ denote the number of events of shape $$\lambda$$, let $$P(\lambda)$$ denote the probability of a specific event of shape $$\lambda$$ happening, and $$\max(\lambda)$$ be the maximum term in the partition, we can write $$P(A_n)$$ as
$P(A_n) = \mathop{\sum_{\lambda\vdash n}}_{\max(\lambda) \geq 2} N(\lambda)P(\lambda). \qquad\qquad\textrm{(1)}$
Note that the sum runs over all partitions except $$n = 1 + \cdots + 1$$, which corresponds to the unique event of everyone having separate birthdays.

We begin by working out $$P(\lambda)$$, as it’s the easier to handle. As it turns out, $$P(\lambda)$$ depends only on the “length” of the partition, i.e. the number of days covered by birthdays. Instead of proving this rigorously we’ll show it by example. Let’s look at two different cases that cover two days. First consider Alice and Bob sharing birthdays, and Carol and Dan sharing birthdays:

• Alice’s birthday can fall on any day of the year (probability $$365/365$$),
• Bob’s birthday has to fall on Alice’s birthday (probability $$1/365$$), and
• Carol’s birthday has to fall on any day other than Alice’s and Bob’s joint birthday (probability $$364/365$$),
• Dan’s birthday has to fall on Carol’s birthday (probability $$1/365$$),

so the probability of this happening is:
$\frac{365}{365}\cdot\frac{1}{365}\cdot\frac{364}{365}\cdot\frac{1}{365} = \frac{365\cdot 364}{365^4}.$
Now consider Alice, Bob and Carol sharing birthdays, with Dan having his birthday on a different day:

• Alice’s birthday can fall on any day of the year (probability $$365/365$$),
• Bob’s birthday has to fall on Alice’s birthday (probability $$1/365$$),
• Carol’s birthday has to fall on Alice’s and Bob’s shared birthday (probability $$1/365$$), and
• Dan’s birthday has to fall on any day apart from Alice’s, Bob’s and Carol’s shared birthday (probability $$364/365$$),

so the probability of this happening is:
$\frac{365}{365}\cdot\frac{1}{365}\cdot\frac{1}{365}\cdot\frac{364}{365} = \frac{365\cdot 364}{365^4}.$
Note in particular that the left hand side of both expressions have the same factors, just in a different order. Whenever we add someone sharing their birthday with the previous person, we add a factor $$1/365$$, and whenever we add someone who has a “new” birthday we add a factor:

$\frac{365 – \textrm{number of days already covered}}{365}.$
So, in general, if an event covers $$k$$ days, the probability of it occurring is
$\frac{365!}{(365-k)!365^n}.$
If we let $$\operatorname{length}(\lambda)$$ denote the length of $$\lambda$$, this means that
$P(\lambda) = \frac{365!}{(365-\operatorname{length}(\lambda)!365^n}.\qquad\qquad\textrm{(2)}$

Now all that is left is to figure out what $$N(\lambda)$$ is. We can think of this as “how many ways can we fill the shape with people?”. The multinomial coefficients tell us exactly that! For a partition $$\lambda = a_1 + \cdots + a_k$$ they are defined as
$\binom{n}{\lambda} = \binom{n}{a_1, \dots, a_k} = \frac{n!}{a_1!\cdots a_k!}.$

For example, for the partition $$4 = 2 + 2$$ we have
$\binom{4}{2, 2} = \frac{4!}{2!\cdot 2!} = \frac{1\cdot 2\cdot 3\cdot 4}{1\cdot 2\cdot 1\cdot 2} = 6,$
so there are 6 ways to fill in the corresponding shape:

Hm, this doesn’t seem right, do we not have 3 cases of this shape above? Indeed we do! This way of “filling in the shape” actually counts the same event several times. Note that the first and fourth both correspond to Alice and Bob sharing birthdays, and Carol and Dan sharing birthdays, the two days are simply swapped. Each event in the top row is the same as the event below it. So, how do we fix this?

The problem is that the filling in method distinguishes between days in a way that we don’t want to do. We don’t care which specific day peoples birthday fall on. Whenever we have two or more days that have the same number of people sharing birthdays on those days, we count that as one case, whereas the corresponding multinomial coefficient counts all possible ways we can permute those days. To deal with this, we need to consider the length of runs of equal terms in a partition. For example, the partition
$7 = 2 + 2 + 1 + 1 + 1 = 2 \cdot 2 + 3 \cdot 1$
has one run of length $$2$$ (the two 2’s) and one run of length $$3$$ (the three 1’s).

To formalise this, for a permutation $$\lambda \vdash n$$ we can collect equal terms and write it as
$n = c_1b_1 + \cdots + c_mb_m,$
where $$b_1 > b_2 > \cdots > b_m$$. Then the numbers $$c_1$$, $$c_2$$, …, $$c_m$$ are the lengths of all the runs of equal terms in our partition. We can now define $$s(\lambda)$$, which counts the number of ways we can swap days in a shape while still being in the same event, as
$s(\lambda) = c_1!\cdots c_m!$

With all this machinery in place we’re finally in a place to write out the full formula for $$P(A_n)$$. As $$s(\lambda)$$ accounts for the “double counting” of the multinomial coefficients, we have that
$N(\lambda) = \binom{n}{\lambda} / s(\lambda). \qquad\qquad\textrm{(3)}$
Substituting the expressions (2) and (3) for $$P(\lambda)$$ and $$N(\lambda)$$ into Equation (1) above gives us
$P(A_n) = \mathop{\sum_{\lambda\vdash n}}_{\max(\lambda) \geq 2} \binom{n}{\lambda}\cdot\frac{365!}{(365-\operatorname{length}(\lambda))!365^ns(\lambda)}.$

Lovely, isn’t it?

Since we’ve done all this work, it’s worth noting that, as the number of partitions of a number $$n$$ grows with $$\tilde e^{\sqrt{n}}$$, the computational complexity of calculating this formula is exponential in $$n$$. This should be contrasted with the linear complexity of the “trick” solution. On the other hand, this more complex solution is easy to adapt to answer more complex questions, like “what is the probability that no three people share the same birthday”.

## Finally

The birthday problem isn’t intuitive right away – the idea that only 23 people need to be present to have a coinflip’s chance that two people share the same birthday. Solving this paradox is even less intuitive. If you’ve read this far though, you’ll surely nail your next interview for a data scientist position at Data Mettle.

And next time you go to a birthday party with more than 70 people just remember, you are probably forgetting to say happy birthday to at least one person.

## Appendix

If you’ve made it this far, The fact that $$P(A_3) + P(A_3′) = 1$$ can be seen by rearranging the terms:
\begin{align}
P(A_3) + P(A_3′)
&=
3\frac{1}{365}\cdot\frac{364}{365} + \frac{1}{365^2} + \frac{364}{365}\cdot\frac{363}{365} \\
&=
\frac{3\cdot 364 + 1 + 364\cdot 363}{365^2} \\
&=
\frac{364 + 1 + 2\cdot 364 + 363\cdot 364}{365^2} \\
&=
\frac{365 + (2 + 363)\cdot 364}{365^2} \\
&=
\frac{365\cdot 1 + 365\cdot 364}{365^2} \\
&=
\frac{365^2}{365^2} \\
&=
1.
\end{align}

Icons made by Freepik from www.flaticon.com.