Stochastic Variational Inference in the TrueSkill Model
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Julia Assignment 2 version 2:
Stochastic Variational Inference in the TrueSkill Model
STA414/STA2014 and CSC412/CSC2506
The goal of this assignment is to get you familiar with the basics of Bayesian inference in large models with continuous latent variables, and the basics of stochastic variational inference.
Background We’ll implement a variant of the TrueSkill model, a player ranking system for competitive games originally developed for Halo 2. It is a generalization of the Elo rating system in Chess.
0.1 Model definition
We’ll consider a slightly simplified version of the original trueskill model. We assume that each player has a true, but unknown skill zi ∈ R. We use N to denote the number of players.
The prior. The prior over each player’s skill is a standard normal distribution, and all player’s skills are
a prior independent.
The likelihood. For each observed game, the probability that player i beats player j, given the player’s skills zA and zB, is:
where
p(A beat B|zA, zB) = σ(zi − zj)
1
σ(y) =
1 + exp(−y)
There can be more than one game played between a pair of players, and in this case the outcome of each game is independent given the players’ skills. We use M to denote the number of games.
The data. The data will be an array of game outcomes. Each row contains a pair of player indices. The first index in each pair is the winner of the game, the second index is the loser. If there were M games played, then the array has shape M × 2.
-
Implementing the model [10points]
-
[2 points] Implement a function log prior that computes the log of the prior over all player’s skills. Specifically, given a K × N array where each row is a setting of the skills for all N players, it returns aK × 1 array, where each row contains a scalar giving the log-prior for that set of
-
[3 points] Implement a function logp a beats b that, given a pair of skills zaand zb evaluates the log-likelihood that player with skill za beat player with skill zb under the model detailed above. To ensure numerical stability, use the function log1pexp that computes log(1 + exp(x)) in a numerically stable way. This function is provided by jl and imported already, and also by Python’s numpy.
-
[3points] Assuming all game outcomes are i.d. conditioned on all players’ skills, implement a function all games log likelihood that takes a batch of player skills zs and a collection of observed games games and gives a batch of log-likelihoods for those observations. Specifically, given a K N array where each row is a setting of the skills for all N players, and an M 2 array of game outcomes, it returns a K 1 array, where each row contains a scalar giving the log-likelihood of all games for that set of skills. Hint: You should be able to write this function without using for loops, although you might want to start that way to make sure what you’ve written is correct. If A is an array of integers, you can index the corresponding entries of another matrix B for every entry in A by writing B[A].
-
[2 points] Implement a function joint log density which combines the log-prior and log-likelihood ofthe observations to give p(z1, z2, . . . , zN , all game outcomes)
2 Examining the posterior for only two players and toy data [10 points]
To get a feel for this model, we’ll first consider the case where we only have 2 players, A and B. We’ll examine how the prior and likelihood interact when conditioning on different sets of games.
Provided in the starter code is a function skillcontour! which evaluates a provided function on a grid of zA and zB’s and plots the isocontours of that function. As well there is a function plot line equal skill!. We have included an example for how you can use these functions.
We also provided a function two player toy games which produces toy data for two players. I.e.
two player toy games(5,3) produces a dataset where player A wins 5 games and player B wins 3 games.
-
[2 points] For two players A and B, plot the isocontours of the joint prior over their skills. Also plot theline of equal skill, zA = zB. Hint: you’ve already implemented the log of the likelihood
-
[2 points] Plot the isocontours of the likelihood function. Also plot the line of equal skill, zA= zB.
-
[2 points] Plot isocountours of the joint posterior over zAand zB given that player A beat player B in one match. Since the contours don’t depend on the normalization constant, you can simply plot the isocontours of the log of joint distribution of p(zA, zB, A beat B) Also plot the line of equal skill, zA = zB.
-
[2points] Plot isocountours of the joint posterior over zA and zB given that 10 matches were played, and player A beat player B all 10 times. Also plot the line of equal skill, zA = zB.
-
[2 points] Plot isocountours of the joint posterior over zAand zB given that 20 matches were played, and each player beat the other 10 Also plot the line of equal skill, zA = zB.
For all plots, label both axes.
3 Stochastic Variational Inference on Two Players and Toy Data [18 points]
One nice thing about a Bayesian approach is that it separates the model specification from the approxi- mate inference strategy. The original Trueskill paper from 2007 used message passing. Carl Rasmussen’s assignment uses Gibbs sampling, a form of Markov Chain Monte Carlo. We’ll use gradient-based stochastic variational inference, which wasn’t invented until around 2014.
In this question we will optimize an approximate posterior distribution with stochastic variational infer- ence to approximate the true posterior.
-
[5 points] Implement a function elbo which computes an unbiased estimate of the evidence lower As discussed in class, the ELBO is equal to the KL divergence between the true posterior p(z|data), and an approximate posterior, qφ(z|data), plus an unknown constant. Use a fully-factorized Gaussiandistribution for qφ(z|data). This estimator takes the following arguments:
-
params, the parameters φ of the approximate posteriorqφ(z|data).
A function logp, which is equal to the true posterior plus a constant. This function must take a batch of samples of z. If we have N players, we can consider B-many samples from the joint over all players’ skills. This batch of samples zs will be an array with dimensions (N, B).
-
num samples, the number of samples to
This function should return a single scalar. Hint: You will need to use the reparamterization trick when sampling zs.
-
[2 points] Write a loss function called neg toy elbo that takes variational distribution parameters andan array of game outcomes, and returns the negative elbo estimate with 100
-
[5 points] Write an optimization function called fit toy variational dist which takes initial vari- ational parameters, and the evidence. Inside it will perform a number of iterations of gradient descent where for each iteration:
-
Computethe gradient of the loss with respect to the parameters using automatic
-
Updatethe parameters by taking an lr-scaled step in the direction of the descending
-
Reportthe loss with the new parameters (using @info or print statements)
-
On the same set of axes plot the target distribution in red and the variational approximation in blue.
Return the parameters resulting from training.
-
[2 points] Initialize a variational distribution parameters and optimize them to approximate the joint where we observe player A winning 1 game. Report the final loss. Also plot the optimized variational approximationcontours (in blue) aand the target distribution (in red) on the same
-
[2 points] Initialize a variational distribution parameters and optimize them to approximate the joint where we observe player A winning 10 games. Report the final loss. Also plot the optimized variational approximationcontours (in blue) aand the target distribution (in red) on the same
-
[2 points] Initialize a variational distribution parameters and optimize them to approximate the joint where we observe player A winning 10 games and player B winning 10 games. Report the final loss. Also plot the optimized variational approximation contours (in blue) aand the target distribution (in red) on the same
For all plots, label both axes.
4 Approximate inference conditioned on real data [24 points]
Load the dataset from tennis data.mat containing two matrices:
-
Wis a 107 by 1 matrix, whose i’th entry is the name of player i.
G is a 1801 by 2 matrix of game outcomes (actually tennis matches), one row per game. The first column contains the indices of the players who won. The second column contains the indices of the player who lost.
Compute the following using your code from the earlier questions in the assignment, but conditioning on the tennis match outcomes:
-
[1point] For any two players i and j, p(zi, zj|all games) is always proportional to p(zi, zj, all games). In general, are the isocontours of p(zi, zj all games) the same as those of p(zi, zj games between i and j)? That is, do the games between other players besides i and j provide information about the skill of players i and j? A simple yes or no
Hint: One way to answer this is to draw the graphical model for three players, i, j, and k, and the results of games between all three pairs, and then examine conditional independencies. If you do this, there’s no need to include the graphical models in your assignment.
-
[5 points] Write a new optimization function fit variational dist like the one from the previous question except it does not plot anything. Initialize a variational distribution and fit it to the joint distribution with all the observed tennis games from the dataset. Report the final negative ELBO estimate after
-
[2points] Plot the approximate mean and variance of all players, sorted by For example, in Julia, you can use: perm = sortperm(means); plot(means[perm], yerror=exp.(logstd[perm])) There’s no need to include the names of the players.
-
[2points] List the names of the 10 players with the highest mean skill under the variational
-
[3 points] Plot the joint approximate posterior over the skills of Roger Federerand Rafael Nadal. Use the approximate posterior that you fit in question 4 part
-
[5 points] Derive the exact probability under a factorized Guassian over two players’ skills that one has higher skill than the other, as a function of the two means and variances over their skills. Express your answer in terms of the cumulative distribution function of a one-dimensional Gaussian random variable.
Hint 1: Use a linear change of variables yA, yB = zA zB, zB. What does the line of equal skill look like after this transformation?
-
Hint 2:If X ∼ N (µ, Σ), then AX ∼ N (Aµ, AΣAT ) where A is a linear
-
Hint 3: Marginalization in Gaussians is easy: if X ∼ N (µ, Σ), then the ith element of X has a marginaldistribution Xi ∼ N (µi, Σii)
-
[2 points] Using the formula from part c, compute the exact probability under your approximate posterior that Roger Federer has higher skill than Rafael Nadal. Then, estimate it using simple Monte Carlowith 10000 examples, again using your approximate posterior.
-
[2 points] Using the formula from part c, compute the probability that Roger Federer is better than theplayer with the lowest mean Compute this quantity exactly, and then estimate it using simple Monte Carlo with 10000 examples, again using your approximate posterior.