*Sometimes you come across connections that are simple and beautiful. Here’s one of them!
*

### What the terms mean

**Maximum likelihood** is a common approach to estimating parameters of a model. An example of model parameters could be the coefficients in a linear regression model , where is Gaussian noise (i.e. it’s random).

Here we choose parameter values that maximise the likelihood , i.e. the probability of the data given the model parameters are set to a certain value .

That is, we choose

.

The **KL Divergence** measures the dissimilarity between two probability distributions:

It’s not symmetric () which is why it’s called a divergence and not a distance.

**The Connection: Maximum Likelihood as minimising KL Divergence**

It turns out that the parameters that maximise the likelihood are precisely those that minimise the KL divergence between the empirical distribution and the model distribution .

This is nice because it links two important concepts in machine learning. (Another cool connection is justifying using mean-squared error in linear regression by linking it with maximum likelihood.)

Here’s the proof:

But is independent of the model parameters , so we can take it out of our expression:

We can turn this negative argmin into an argmax. If the datapoints are i.i.d. (independent and identically distributed), by the Law of Large Numbers, we have

as the number of datapoints tends to infinity.

*Aside: We could actually have left the expression for the maximum likelihood estimator in the form of an expectation, but it’s usually seen as a sum or a product.*

The natural question to ask is then what do we get if we minimise ? I’ll leave that to you. 🙂

**References:**

- Deep Learning Ch. 5 Machine Learning Basics (p128-129)