Sometimes your data will contain invalid values such as NaN, often because data was lost or could not be collected. There are two ways of handling them:

- Delete the datapoint
- Estimate the value of the datapoint

The first option – deleting your data – may be better if the number of anomalous datapoints is tiny and when the estimate of the value of the datapoint is likely to be inaccurate.

The second option – estimating the value of the datapoint – may be preferable if, say, a fifth of your rows are missing a value for the `Age`

column but have data for 10 other columns so you don’t want to discard those rows completely. We will be introducing some noise into the model with these estimates, but if the estimates are reasonable our model may still be more accurate than if we neglected `Age`

or the rows where there was no value for `Age`

.

In this post I will discuss handling NaNs in the Titanic Dataset on Kaggle. The full Jupyter notebook can be found here.

##### Finding NaNs

First, we should check if there are any `NaN`

s or infinite values in the data.

1 2 3 4 5 6 7 |
import numpy as np # Are there any NaNs in the data? Want this to return False. np.any(np.isnan(data)) # Are all the datapoints finite? Want this to return True. np.all(np.isfinite(data)) |

Second, if there are NaNs, we should find the location of those NaNs.

1 2 3 4 5 6 7 8 9 10 11 |
# Prints rows that have NaNs in them # For each cell for i in range(0,number_of_rows): for j in range(0,number_of_cols): # If the cell value is NaN if np.isnan(test_df.iloc[i][j]): # Print the row print(test_df[i:i+1]) |

##### Ways of predicting values

Now that we have found the rows with NaNs, we need to either delete those rows or replace the NaNs with estimates.

There are two straightforward methods:

**Mean**: average of the values. sum(x)/len(x) where x is an array of datapoints.**Median**: the middle of the values. x[int(len(x)/2)]

.

Which one you use depends on the distribution. It is usually safer to use the **median**. This is particularly preferred if:

- The distribution of the data is
**skewed**(not symmetric). - You want to guard against
**outliers**.

##### Implementing in code

In the Titanic dataset, many of the `Age`

values were given as `NaN`

. We decided to use the median age for each gender within each passenger class as a proxy. Below, we replace the `NaN`

s with our new estimates. The full code can be found here.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# Find median ages for each gender in each passenger class import numpy as np # There are two genders and three passenger classes in this dataset. # So we create a 2 by 3 matrix to store the median values. # Create a 2 by 3 matrix of zeroes median_ages = np.zeros((2,3)) # For each cell in the 2 by 3 matrix for i in range(0,2): for j in range(0,3): # Set the value of the cell to be the median of all `Age` values # matching the criterion 'Corresponding gender and Pclass', # leaving out all NaN values median_ages[i,j] = df[ (df['Gender'] == i) & \ (df['Pclass'] == j+1)]['Age'].dropna().median() # Create new column AgeFill to put values into. # This retains the state of the original data. df['AgeFill'] = df['Age'] df[ df['Age'].isnull()][['Age', 'AgeFill', 'Gender', 'Pclass']].head(10) # Put our estimates into NaN rows of new column AgeFill. # df.loc is a purely label-location based indexer for selection by label. for i in range(0, 2): for j in range(0, 3): # Locate all cells in dataframe where `Gender` == i, `Pclass` == j+1 # and `Age` == null. # Replace them with the corresponding estimate from the matrix. df.loc[ (df.Age.isnull()) & (df.Gender == i) & (df.Pclass == j+1),\ 'AgeFill'] = median_ages[i,j] |

It may be useful to know whether the Age was originally missing later on. There may even be systematic reasons as to why that piece of information would be missing that we could use in our model. E.g. maybe those people did not fill in certain forms before boarding the Titanic, which might tell us something about their behaviour.

1 2 3 |
# Create a feature that records whether the Age was originally missing df['AgeIsNull'] = pd.isnull(df['Age']).astype(int) df.head() |

**Relevant Links**: