Maximum Likelihood as minimising KL Divergence

Jessica YungMachine Learning

Sometimes you come across connections that are simple and beautiful. Here’s one of them!

What the terms mean

Maximum likelihood is a common approach to estimating parameters of a model. An example of model parameters could be the coefficients \theta_i in a linear regression model \hat{x_2} = \theta_1 + \theta_2 x_1 + \epsilon, where \epsilon\sim N(0,\sigma_e^2) is Gaussian noise (i.e. it’s random).

Here we choose parameter values \theta that maximise the likelihood p(\mathbf{x}|\theta), i.e. the probability of the data given the model parameters are set to a certain value \theta.

That is, we choose

\theta_{MLE} = \arg\max p(\mathbf{x}|\theta).

The KL Divergence measures the dissimilarity between two probability distributions:

D_{KL}(p||q) = E_{\mathbf{x}\sim p(\mathbf{x})}[\log p(\mathbf{x})-\log q(\mathbf{x})]

It’s not symmetric (D_{KL}(p||q) \ne D_{KL}(q||p)) which is why it’s called a divergence and not a distance.

The Connection: Maximum Likelihood as minimising KL Divergence

It turns out that  the parameters that maximise the likelihood \mathbf{\theta}_{MLE} are precisely those that minimise the KL divergence between the empirical distribution \hat{p}_{\text{data}} and the model distribution p_{\text{model}}.

This is nice because it links two important concepts in machine learning. (Another cool connection is justifying using mean-squared error in linear regression by linking it with maximum likelihood.)

Here’s the proof:

\begin{aligned}  \theta_{\text{min KL}} &= \arg\min_{\theta} D_{KL}(\hat{p}_{\text{data}} || p_{\text{model}}) \\  &= \arg\min_{\theta} E_{\mathbf{x}\sim{\hat{p}_{\text{data}}}}[\log \hat{p}_\text{data}(\mathbf{x})-\log p_{\text{model}}(\mathbf{x})]  \end{aligned}

But E_{\mathbf{x}\sim{\hat{p}_{\text{data}}}}\log \hat{p}_\text{data}(\mathbf{x}) is independent of the model parameters \theta, so we can take it out of our expression:

\begin{aligned}  \theta_{\text{min KL}} &= \arg\min_{\theta} - E_{\mathbf{x}\sim{\hat{p}_{\text{data}}}}[\log p_{\text{model}}(\mathbf{x}|\theta)]  \end{aligned}

We can turn this negative argmin into an argmax. If the datapoints x are i.i.d. (independent and identically distributed), by the Law of Large Numbers, we have

\begin{aligned}  \theta_{\text{min KL}}&= \arg\max_{\theta} \lim\limits_{N\to\infty}\frac{1}{N}\sum_{i=1}^N\log(p(\mathbf{x_i}|\theta)) \\  &= \theta_{MLE}  \end{aligned}

as the number of datapoints N tends to infinity.

Aside: We could actually have left the expression for the maximum likelihood estimator in the form of an expectation, but it’s usually seen as a sum or a product.

The natural question to ask is then what do we get if we minimise D_{KL}(p_{\text{model}}||\hat{p}_{\text{data}})? I’ll leave that to you. đŸ™‚

References:

  • Deep Learning Ch. 5 Machine Learning Basics (p128-129)