In this post, we’ll talk about (1) what normalising data is, (2) why you might want to do it, and (3) how you can do it (with examples).
Background: The Mystery of the Horrifically Inaccurate Model
Let me tell you a story. Once upon a time, I trained a few models to classify traffic signs for Udacity’s Self-Driving Car Nanodegree. I first copied neural networks from Tensorflow tutorials and adapted them for my image set. The tutorials said that their accuracy of 91% for the handwritten digits was embarrassing. So I thought that my models, though for a different dataset, would probably still achieve accuracy of above 50%. I’d seen people with over 95% accuracy implementing fairly straightforward models on the student Slack chats.
Guess what accuracy my model had? 5-6%.
FIVE TO SIX PERCENT. It was in that range after 2 iterations and after over 120 iterations.
Something was wrong.
Fortunately, I had a mini-project later on where I trained models to classify traffic signs using Keras. And in that project, I (1) normalised the data and (2) did not use convolutions (a type of neural network layer we’ll get to later). And guess what! I got an accuracy of over 60% in two iterations. So maybe – just maybe – normalising the data will be a quick fix!
Preview: WELCOME TO THE ZOMBIE APOCALYPSE. The original image in the top left looks more normal, but the transformed image on the right is the normalised one. 😉
What does normalising the data mean?
Normalisation scales all numeric variables in the range [0,1]. You can implement this with .
- Disadvantage: If you have outliers in your dataset (e.g. one datapoint with value 10,000 when all the others are between 0 and 100), normalising your data will scale most of the data to a very small interval. Most datasets have outliers.
Another common preprocessing technique is standardisation.
Standardisation transforms your data to have a mean (average) of zero and a variance of one. You can implement this with .
- Variance is the standard deviation squared. The standard deviation is a measure of how far away from the mean (average) the datapoint is.
- Disadvantage: Your new data usually aren’t bounded, unlike normalisation.
Why do we care about standard deviations?
When data are normally (read: prettily) distributed, about two-thirds of the data is within one standard deviation away from the mean.
The normal distribution is special because it turns out that no matter how your data is distributed, as the sample size gets large, the means are normally distributed (Central Limit Theorem). It’s like magic.
Aside 1: If you’ve worked with normal distributions, you might think that normalisation is standardisation because standardisation is how you get the Z-statistic. I did.
Aside 2: A normal distribution is also called a Gaussian distribution. The shape is referred to as a bell curve.
Why might we normalise the data?
How do we normalise (or standardise) the data?
We just translate the formulae given above into code:
# Standardise input (images still in colour)
X_train_std = (X_train - X_train.mean()) / np.std(X_train)
X_test_std = (X_test - X_test.mean()) /np.std(X_test)
# Normalise input (images still in colour)
X_train_norm = (X_train - X_train.mean()) / (np.max(X_train) - np.min(X_train))
X_test_norm = (X_test - X_test.mean()) / (np.max(X_test) - np.min(X_test))
Sometimes standardising images can transform them from readable to humanly unreadable:
But at other times, normalising images brings out features we wouldn’t have been able to see otherwise.
You will also notice that the normalised representation of this image (top right) is different from the standardised representation. The normalised version is humanly readable but there is little contrast in the sign, whereas the standardised version has much more contrast. These are things you’d want to consider when choosing between normalisation and standardisation for preprocessing.
Bonus: Here’s the function plot_norm_images I used to quickly plot the normalised and un-normalised images next to each other.
So time for the big reveal. Did normalising the data save the model?
No. After 15 epochs (iterations), my model still had an accuracy of only 5.9%.
Okay, let’s try altering the network architecture next.
PS: We will compare the performance of un-normalised vs normalised data input to models later on, so stay tuned!
- Definitions or Standardisation and Normalisation, discussion of tradeoffs between the two and similar techniques. (Data Mining Blog post and comments)
- When and why do we need data normalisation?
- Bugger! Detecting Lane Lines (Computer Vision: Image processing. Self-Driving Car ND)