*MSE is a commonly used error metric. But is it principly justified?*

In this post we show that minimising the mean-squared error (MSE) is not just something vaguely intuitive, but emerges from maximising the likelihood on a linear Gaussian model.

#### Defining the terms

**Linear Gaussian Model**

Assume the data is described by the linear model , where . Assume is known and the datapoints are i.i.d. (independent and identically distributed).

*Note: the notation means that we are describing the distribution of , and that it is distributed as . *

Recall the **likelihood **is the probability of the data given the parameters of the model, in this case the weights on the features, .

**Proof**

The log likelihood of our model is

But since the noise is Gaussian (i.e. normally distributed), the likelihood is just

where is the number of datapoints.

So

That is, **the parameters chosen to maximise the likelihood are exactly those chosen to minimise the mean-squared error.**

#### Even more connections!

There are other nice connections between measures we use and principled methods: L1 regularisation is analogous to doing Bayesian inference with a Laplacian prior, and L2 regularisation is analogous to using a Gaussian (i.e. normally distributed) prior.

**L1 regularisation** is adding a penalty term proportional to the absolute value of the weights (e.g. ), whereas **L2 regularisation** is adding a penalty term proportional to the squared value of the weights, e.g. ). The numbers 1 and 2 correspond to the power of used. You can see plots of the Gaussian (normal) and Laplacian priors below.

**References and related articles**

*Deep Learning*by Ian Goodfellow, Yoshua Bengio and Aaron Courville. Ch. 5 Machine Learning Basics p130-131- Maximum Likelihood as minimising KL-Divergence (another nice connection)