Sometimes your data will contain invalid values such as NaN, often because data was lost or could not be collected. There are two ways of handling them:
- Delete the datapoint
- Estimate the value of the datapoint
The first option – deleting your data – may be better if the number of anomalous datapoints is tiny and when the estimate of the value of the datapoint is likely to be inaccurate.
The second option – estimating the value of the datapoint – may be preferable if, say, a fifth of your rows are missing a value for the
Age column but have data for 10 other columns so you don’t want to discard those rows completely. We will be introducing some noise into the model with these estimates, but if the estimates are reasonable our model may still be more accurate than if we neglected
Age or the rows where there was no value for
First, we should check if there are any
NaNs or infinite values in the data.
import numpy as np
# Are there any NaNs in the data? Want this to return False.
# Are all the datapoints finite? Want this to return True.
Second, if there are NaNs, we should find the location of those NaNs.
# Prints rows that have NaNs in them
# For each cell
for i in range(0,number_of_rows):
for j in range(0,number_of_cols):
# If the cell value is NaN
# Print the row
Ways of predicting values
Now that we have found the rows with NaNs, we need to either delete those rows or replace the NaNs with estimates.
There are two straightforward methods:
- Mean: average of the values. sum(x)/len(x) where x is an array of datapoints.
- Median: the middle of the values. x[int(len(x)/2)]
Which one you use depends on the distribution. It is usually safer to use the median. This is particularly preferred if:
- The distribution of the data is skewed (not symmetric).
- You want to guard against outliers.
Implementing in code
In the Titanic dataset, many of the
Age values were given as
NaN. We decided to use the median age for each gender within each passenger class as a proxy. Below, we replace the
NaNs with our new estimates. The full code can be found here.
# Find median ages for each gender in each passenger class
import numpy as np
# There are two genders and three passenger classes in this dataset.
# So we create a 2 by 3 matrix to store the median values.
# Create a 2 by 3 matrix of zeroes
median_ages = np.zeros((2,3))
# For each cell in the 2 by 3 matrix
for i in range(0,2):
for j in range(0,3):
# Set the value of the cell to be the median of all `Age` values
# matching the criterion 'Corresponding gender and Pclass',
# leaving out all NaN values
median_ages[i,j] = df[ (df['Gender'] == i) & \
(df['Pclass'] == j+1)]['Age'].dropna().median()
# Create new column AgeFill to put values into.
# This retains the state of the original data.
df['AgeFill'] = df['Age']
df[ df['Age'].isnull()][['Age', 'AgeFill', 'Gender', 'Pclass']].head(10)
# Put our estimates into NaN rows of new column AgeFill.
# df.loc is a purely label-location based indexer for selection by label.
for i in range(0, 2):
for j in range(0, 3):
# Locate all cells in dataframe where `Gender` == i, `Pclass` == j+1
# and `Age` == null.
# Replace them with the corresponding estimate from the matrix.
df.loc[ (df.Age.isnull()) & (df.Gender == i) & (df.Pclass == j+1),\
'AgeFill'] = median_ages[i,j]
It may be useful to know whether the Age was originally missing later on. There may even be systematic reasons as to why that piece of information would be missing that we could use in our model. E.g. maybe those people did not fill in certain forms before boarding the Titanic, which might tell us something about their behaviour.
# Create a feature that records whether the Age was originally missing
df['AgeIsNull'] = pd.isnull(df['Age']).astype(int)