Modelling Eurovision voting
This is a follow-up to our previous blog on Eurovision voting, where we’ll explain how we modelled the objective quality of songs and voting biases between countries in Eurovision, and how we grouped the countries into blocks based on their biases. The source code can be found here. We’ve taken the data from a Kaggle competition, and sourced data for any missing years from Wikipedia.
The hierarchical model
The idea is that the voting outcome is dependent of both the inherent quality of the entry, and the biases countries have for voting for each other. There are lots of possible ways of doing this, but ours is fairly simple and works quite well.
Let \(r_{c_ic_jy_k}\) denote the fraction of people in country \(c_i\) that voted for country \(c_j\) in the year \(y_k\). Note that \(\sum_{j=1}^Nr_{c_ic_jy_k} = 1\), so it is reasonable to model the vector \(\mathbf{r}_{c_iy_k} = (r_{c_ic_1y_k}, \dots, r_{c_ic_Ny_k})\) as following a Dirichlet distribution:
\[\mathbf{r}_{c_iy_k} \sim \operatorname{Dir}(\beta_{c_ic_1y_k}, \dots, \beta_{c_ic_Ny_k}).\]
We choose a model where the parameters \(\beta_{c_ic_jy_k}\) decompose as
\[\beta_{c_ic_jy_k} = \operatorname{Exp}\bigl(\theta_{c_jy_k} + \phi_{c_ic_j}\bigr),\]
where \(\theta_{c_jy_k}\) captures the objective quality of the song from country \(c_j\) in the year \(y_k\), and \(\phi_{c_ic_j}\) captures the bias country \(c_i\) has in voting (or not voting) for country \(c_j\). Furthermore, we assume that the \(\theta_{c_jy_k}\)’s and \(\phi_{c_ic_j}\)’s are drawn from an (unknown) normal distribution:
\[
\phi_{c_ic_j}, \theta_{c_jy_k}\sim N(\mu, \sigma).
\]
Note that we don’t actually have access to \(r_{c_ic_jy_k}\), we only have data on the number of points each country was awarded. But we make do with what we have and approximate \(r_{c_ic_jy_k}\) by
\[
r_{c_ic_jy_k}
\simeq
\frac{\text{(points awarded to country \(c_j\) by country \(c_i\) in the year \(y_k\)}) + \alpha}{(\text{total points awarded by country \(c_i\) in the year \(y_k\)}) + N\alpha},
\]
where \(\alpha\) is a constant that we set to 0.1.
It’s hard to say for definite whether this is a reasonable approximation without being able to actually see the voting data, but preferences often follow power laws, and the decreasing sequence of points 12, 10, 8, 7, 6, 5, 4, 3, 2, 1, 0, 0, 0, … at least follow a similar shape:
It’s not perfect, but hopefully good enough. Note that we do completely miss out on any information about the tail, but we assume that this is mostly noise that don’t contribute much anyway.
Fitting the model
We fit the model using using Stan, a programming language that is great for making Bayesian inferences. It uses Markov chain Monte Carlo methods to find the distribution of the parameters which best explain the responses. Stan is very powerful, all we really need to do is to specify the model, then pass in our data and Stan does the rest!
As Stan uses Bayesian methods, it returns a sample of the distribution of your parameters, in our case consisting of 16 000 (paired) values for each parameter. In our previous analysis we simply took the means as point estimates for our parameters, but having the distribution lets us talk about the uncertainty of these estimates. For example, the point estimate for objective quality \(\theta\) of the winner in 2015 (Sweden) is 1.63, and 1.41 for the runner up (Russia). This however, doesn’t tell us the full picture. Here’s a plot for the joint distribution of \(\theta\) for these entries:
From this joint distribution we can calculate the probability that Sweden’s entry was objectively better than Russia’s entry as the proportion of samples above the blue line, and this turns out to be about 93%.
Finding the blocks
Looking at the bias terms \(\phi_{c_ic_j}\), we can attempt to group the countries into groups that are tightly connected, i.e. where there’s positive biases within a group and neutral or negative biases between groups. We use a method based on Information Theoretic Co-Clustering, where we choose a clustering where we loose as little mutual information as possible.
The basic idea can be described as follows: For each vote from country A to country B, take a marble and label it ‘from: country A, to: country B’. Now put all the marbles in a jar, and pick one at random. How much does knowing from which country the vote was from tell you who it was for? For example, Romania and Moldova almost always trade 12 points, so knowing that the vote was from Romania tells me there is a high probability that the vote was for Moldova. Mutual information gives us a quantitative value of this knowledge for the whole jar.
Now if we have clustered the countries into blocks, we can instead label the marbles ‘from: block A, to: block B’. We generally loose information by doing this. As we don’t know which country the vote was actually from it’s harder to predict which block the vote was for. By finding the clustering that looses the least amount of information, we get the clustering that best represents the biases.
Below is a heatmap showing the probabilities of countries voting for each other, with our identified blocks separated by lines. We do see that the blocks certainly capture voting behaviour fairly well, voting within the blocks is far more likely than between the blocks (with Lithuania and Georgia being a notable exception). Also, we can identify an “ex-Soviet block” within the “Eastern Europe” block, and a “Northern Europe” block with in the “Western Europe” block, both highlighted in gray.