MSE is a commonly used error metric. But is it principly justified?
In this post we show that minimising the mean-squared error (MSE) is not just something vaguely intuitive, but emerges from maximising the likelihood on a linear Gaussian model.
Defining the terms
Linear Gaussian Model
Assume the data is described by the linear model , where . Assume is known and the datapoints are i.i.d. (independent and identically distributed).
Note: the notation means that we are describing the distribution of , and that it is distributed as .
Recall the likelihood is the probability of the data given the parameters of the model, in this case the weights on the features, .
The log likelihood of our model is
But since the noise is Gaussian (i.e. normally distributed), the likelihood is just
where is the number of datapoints.
That is, the parameters chosen to maximise the likelihood are exactly those chosen to minimise the mean-squared error.
Even more connections!
There are other nice connections between measures we use and principled methods: L1 regularisation is analogous to doing Bayesian inference with a Laplacian prior, and L2 regularisation is analogous to using a Gaussian (i.e. normally distributed) prior.
L1 regularisation is adding a penalty term proportional to the absolute value of the weights (e.g. ), whereas L2 regularisation is adding a penalty term proportional to the squared value of the weights, e.g. ). The numbers 1 and 2 correspond to the power of used. You can see plots of the Gaussian (normal) and Laplacian priors below.
References and related articles
- Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville. Ch. 5 Machine Learning Basics p130-131
- Maximum Likelihood as minimising KL-Divergence (another nice connection)