*MSE is a commonly used error metric. But is it principly justified?*

In this post we show that minimising the mean-squared error (MSE) is not just something vaguely intuitive, but emerges from maximising the likelihood on a linear Gaussian model.

#### Defining the terms

**Linear Gaussian Model**

Assume the data is described by the linear model , where . Assume is known and the datapoints are i.i.d. (independent and identically distributed).

*Note: the notation means that we are describing the distribution of , and that it is distributed as . *

Recall the **likelihood **is the probability of the data given the parameters of the model, in this case the weights on the features, .

**Proof**

The log likelihood of our model is

But since the noise is Gaussian (i.e. normally distributed), the likelihood is just

$latex

\begin{aligned}

\log p(\mathbf{y}|\mathbf{X, w}) &= \sum_{i=1}^N \log N(y_i;\mathbf{x_iw},\sigma^2) \\

&= \sum_{i=1}^N \log \frac{1}{\sqrt{2\pi\sigma^2_e}}\exp (-\frac{(y_i – \mathbf{x_iw})^2}{2\sigma^2_e}) \\

&= -\frac{N}{2}\log 2\pi\sigma^2_e – \sum_{i=1}^N \frac{(y_i-\mathbf{x_iw)^2}}{2\sigma^2_e}

\end{aligned}

$

where is the number of datapoints.

So

$latex

\begin{aligned}

\mathbf{w}_{MLE} &= \arg\max_{\mathbf{w}} – \sum_{i=1}^N (y_i-\mathbf{x_iw})^2 \\

&= \arg\min_{\mathbf{w}} \frac{1}{N}\sum_{i=1}^N (y_i-\mathbf{x_iw})^2 \\

&= \arg\min_{\mathbf{w}} \text{MSE}_{\text{train}}

\end{aligned}

$

That is, **the parameters chosen to maximise the likelihood are exactly those chosen to minimise the mean-squared error.**

#### Even more connections!

There are other nice connections between measures we use and principled methods: L1 regularisation is analogous to doing Bayesian inference with a Laplacian prior, and L2 regularisation is analogous to using a Gaussian (i.e. normally distributed) prior.

**L1 regularisation** is adding a penalty term proportional to the absolute value of the weights (e.g. ), whereas **L2 regularisation** is adding a penalty term proportional to the squared value of the weights, e.g. ). The numbers 1 and 2 correspond to the power of used. You can see plots of the Gaussian (normal) and Laplacian priors below.

**References and related articles**

*Deep Learning*by Ian Goodfellow, Yoshua Bengio and Aaron Courville. Ch. 5 Machine Learning Basics p130-131- Maximum Likelihood as minimising KL-Divergence (another nice connection)