# Maximum Likelihood as minimising KL Divergence

Sometimes you come across connections that are simple and beautiful. Here’s one of them!

### What the terms mean

Maximum likelihood is a common approach to estimating parameters of a model. An example of model parameters could be the coefficients $\theta_i$ in a linear regression model $\hat{x_2} = \theta_1 + \theta_2 x_1 + \epsilon$, where $\epsilon\sim N(0,\sigma_e^2)$ is Gaussian noise (i.e. it’s random).

Here we choose parameter values $\theta$ that maximise the likelihood $p(\mathbf{x}|\theta)$, i.e. the probability of the data given the model parameters are set to a certain value $\theta$.

That is, we choose

$\theta_{MLE} = \arg\max p(\mathbf{x}|\theta)$.

The KL Divergence measures the dissimilarity between two probability distributions:

$D_{KL}(p||q) = E_{\mathbf{x}\sim p(\mathbf{x})}[\log p(\mathbf{x})-\log q(\mathbf{x})]$

It’s not symmetric ($D_{KL}(p||q) \ne D_{KL}(q||p)$) which is why it’s called a divergence and not a distance.

### The Connection: Maximum Likelihood as minimising KL Divergence

It turns out that  the parameters that maximise the likelihood $\mathbf{\theta}_{MLE}$ are precisely those that minimise the KL divergence between the empirical distribution $\hat{p}_{\text{data}}$ and the model distribution $p_{\text{model}}$.

This is nice because it links two important concepts in machine learning. (Another cool connection is justifying using mean-squared error in linear regression by linking it with maximum likelihood.)

Here’s the proof:

\begin{aligned} \theta_{\text{min KL}} &= \arg\min_{\theta} D_{KL}(\hat{p}_{\text{data}} || p_{\text{model}}) \\ &= \arg\min_{\theta} E_{\mathbf{x}\sim{\hat{p}_{\text{data}}}}[\log \hat{p}_\text{data}(\mathbf{x})-\log p_{\text{model}}(\mathbf{x})] \end{aligned}

But $E_{\mathbf{x}\sim{\hat{p}_{\text{data}}}}\log \hat{p}_\text{data}(\mathbf{x})$ is independent of the model parameters $\theta$, so we can take it out of our expression:

\begin{aligned} \theta_{\text{min KL}} &= \arg\min_{\theta} - E_{\mathbf{x}\sim{\hat{p}_{\text{data}}}}[\log p_{\text{model}}(\mathbf{x}|\theta)] \end{aligned}

We can turn this negative argmin into an argmax. If the datapoints $x$ are i.i.d. (independent and identically distributed), by the Law of Large Numbers, we have

\begin{aligned} \theta_{\text{min KL}}&= \arg\max_{\theta} \lim\limits_{N\to\infty}\frac{1}{N}\sum_{i=1}^N\log(p(\mathbf{x_i}|\theta)) \\ &= \theta_{MLE} \end{aligned}

as the number of datapoints $N$ tends to infinity.

Aside: We could actually have left the expression for the maximum likelihood estimator in the form of an expectation, but it’s usually seen as a sum or a product.

The natural question to ask is then what do we get if we minimise $D_{KL}(p_{\text{model}}||\hat{p}_{\text{data}})$? I’ll leave that to you. 🙂

References:

• Deep Learning Ch. 5 Machine Learning Basics (p128-129)