*Sometimes you come across connections that are simple and beautiful. Here’s one of them!
*

### What the terms mean

**Maximum likelihood** is a common approach to estimating parameters of a model. An example of model parameters could be the coefficients in a linear regression model , where is Gaussian noise (i.e. it’s random).

Here we choose parameter values that maximise the likelihood , i.e. the probability of the data given the model parameters are set to a certain value .

That is, we choose

.

The **KL Divergence** measures the dissimilarity between two probability distributions:

It’s not symmetric () which is why it’s called a divergence and not a distance.

**The Connection: Maximum Likelihood as minimising KL Divergence**

It turns out thatÂ the parameters that maximise the likelihood are precisely those that minimise the KL divergence between the empirical distribution and the model distribution .

This is nice because it links two important concepts in machine learning. (Another cool connection is justifying using mean-squared error in linear regression by linking it with maximum likelihood.)

Here’s the proof:

$latex

\begin{aligned}

\theta_{\text{min KL}} &= \arg\min_{\theta} D_{KL}(\hat{p}_{\text{data}} || p_{\text{model}}) \\

&= \arg\min_{\theta} E_{\mathbf{x}\sim{\hat{p}_{\text{data}}}}[\log \hat{p}_\text{data}(\mathbf{x})-\log p_{\text{model}}(\mathbf{x})]
\end{aligned}

$

But is independent of the model parameters , so we can take it out of our expression:

$latex

\begin{aligned}

\theta_{\text{min KL}} &= \arg\min_{\theta} – E_{\mathbf{x}\sim{\hat{p}_{\text{data}}}}[\log p_{\text{model}}(\mathbf{x}|\theta)]
\end{aligned}

$

We can turn this negative argmin into an argmax. If the datapoints are i.i.d. (independent and identically distributed), by the Law of Large Numbers, we have

$latex

\begin{aligned}

\theta_{\text{min KL}}&= \arg\max_{\theta} \lim\limits_{N\to\infty}\frac{1}{N}\sum_{i=1}^N\log(p(\mathbf{x_i}|\theta)) \\

&= \theta_{MLE}

\end{aligned}

$

as the number of datapoints tends to infinity.

*Aside: We could actually have left the expression for the maximum likelihood estimator in the form of an expectation, but it’s usually seen as a sum or a product.*

The natural question to ask is then what do we get if we minimise ? I’ll leave that to you. đŸ™‚

**References:**

- Deep Learning Ch. 5 Machine Learning Basics (p128-129)