Handling NaNs in your Data: the Titanic Dataset

Jessica YungData ScienceLeave a Comment

titanic

Sometimes your data will contain invalid values such as NaN, often because data was lost or could not be collected. There are two ways of handling them:

  1. Delete the datapoint
  2. Estimate the value of the datapoint

The first option – deleting your data – may be better if the number of anomalous datapoints is tiny and when the estimate of the value of the datapoint is likely to be inaccurate.

The second option – estimating the value of the datapoint – may be preferable if, say, a fifth of your rows are missing a value for the Age column but have data for 10 other columns so you don’t want to discard those rows completely. We will be introducing some noise into the model with these estimates, but if the estimates are reasonable our model may still be more accurate than if we neglected Age or the rows where there was no value for Age.

In this post I will discuss handling NaNs in the Titanic Dataset on Kaggle. The full Jupyter notebook can be found here.

Finding NaNs

First, we should check if there are any NaNs or infinite values in the data.

Second, if there are NaNs, we should find the location of those NaNs.

Ways of predicting values

Now that we have found the rows with NaNs, we need to either delete those rows or replace the NaNs with estimates.

There are two straightforward methods:

  1. Mean: average of the values. sum(x)/len(x) where x is an array of datapoints.
  2. Median: the middle of the values. x[int(len(x)/2)]
  3. .

Which one you use depends on the distribution. It is usually safer to use the median. This is particularly preferred if:

  • The distribution of the data is skewed (not symmetric).
  • You want to guard against outliers.
Histograms of skewed distributions

Histograms of skewed distributions

Implementing in code

In the Titanic dataset, many of the Age values were given as NaN. We decided to use the median age for each gender within each passenger class as a proxy. Below, we replace the NaNs with our new estimates. The full code can be found here.

It may be useful to know whether the Age was originally missing later on. There may even be systematic reasons as to why that piece of information would be missing that we could use in our model. E.g. maybe those people did not fill in certain forms before boarding the Titanic, which might tell us something about their behaviour.

Relevant Links:

Leave a Reply