MSE as Maximum Likelihood

Jessica YungMachine Learning

MSE is a commonly used error metric. But is it principly justified?

In this post we show that minimising the mean-squared error (MSE) is not just something vaguely intuitive, but emerges from maximising the likelihood on a linear Gaussian model.

Defining the terms

Linear Gaussian Model
Assume the data is described by the linear model \mathbf{y} = \mathbf{wX} + \epsilon, where \epsilon_i \sim N(\epsilon_i; 0,\sigma^2_e). Assume \sigma^2_e is known and the datapoints are i.i.d. (independent and identically distributed).

Note: the notation N(\epsilon_i; 0, \sigma^2_e) means that we are describing the distribution of \epsilon_i, and that it is distributed as N(0,\sigma^2_e).

Estimates vs ground truth for our model with 1-dimensional x. MSE is the mean of the squared residuals.

Recall the likelihood is the probability of the data given the parameters of the model, in this case the weights on the features, \mathbf{w}.


The log likelihood of our model is

\log p(\mathbf{y}|\mathbf{X, w}) = \sum_{i=1}^N \log p(y_i | \mathbf{x_i, \theta})

But since the noise \epsilon is Gaussian (i.e. normally distributed), the likelihood is just

\begin{aligned}  \log p(\mathbf{y}|\mathbf{X, w}) &= \sum_{i=1}^N \log N(y_i;\mathbf{x_iw},\sigma^2) \\  &= \sum_{i=1}^N \log \frac{1}{\sqrt{2\pi\sigma^2_e}}\exp (-\frac{(y_i - \mathbf{x_iw})^2}{2\sigma^2_e}) \\  &= -\frac{N}{2}\log 2\pi\sigma^2_e - \sum_{i=1}^N \frac{(y_i-\mathbf{x_iw)^2}}{2\sigma^2_e}  \end{aligned}
where N is the number of datapoints.

\begin{aligned}  \mathbf{w}_{MLE} &= \arg\max_{\mathbf{w}} - \sum_{i=1}^N (y_i-\mathbf{x_iw})^2 \\  &= \arg\min_{\mathbf{w}} \frac{1}{N}\sum_{i=1}^N (y_i-\mathbf{x_iw})^2 \\  &= \arg\min_{\mathbf{w}} \text{MSE}_{\text{train}}  \end{aligned}

That is, the parameters \mathbf{w} chosen to maximise the likelihood are exactly those chosen to minimise the mean-squared error.

Even more connections!

There are other nice connections between measures we use and principled methods: L1 regularisation is analogous to doing Bayesian inference with a Laplacian prior, and L2 regularisation is analogous to using a Gaussian (i.e. normally distributed) prior.

L1 regularisation is adding a penalty term proportional to the absolute value of the weights (e.g. \min \text{MSE} + \lambda|\mathbf{w}|), whereas L2 regularisation is adding a penalty term proportional to the squared value of the weights, e.g. \min \text{MSE} + \lambda\mathbf{w^Tw}). The numbers 1 and 2 correspond to the power of \mathbf{w} used. You can see plots of the Gaussian (normal) and Laplacian priors below.

Plots of a Gaussian and a Laplacian prior. Credits: Austin Rochford.

References and related articles