*Hastily compiled, from uDacity’s Intro to Machine Learning videos.*

Here’s a general **recipe for removing outliers from your data**:

1. Train with all data.

2. Remove ~10% of data (points with highest residual error).

3. Train again.

Obviously don’t remove outliers blindly – sometimes they are important and you should pay attention to them. But outliers that are results of data entry errors, sensor malfunction or irrelevant freak events should be ignored. You need to exercise judgement.

Here is the new scatter plot for the data presented above. I’ve removed the outliers (defined above) and fitted it with a new linear regression line.

And here is the code I used to remove the outliers. There’s a lot of de-bugging printing. This is written in Python 2.7.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
#!/usr/bin/python def outlierCleaner(predictions, ages, net_worths): """ Clean away the 10% of points that have the largest residual errors (difference between the prediction and the actual net worth). Return a list of tuples named cleaned_data where each tuple is of the form (age, net_worth, error). """ print "outlierCleaner activated" cleaned_data = [] tuples = [] length = len(ages) for i in range(length): tuples.append((ages[i][0], net_worths[i][0], net_worths[i][0] - predictions[i][0])) print "Tuples: ", tuples differences_tuples = [] for i in range(length): differences_tuples.append((abs(net_worths[i][0] - predictions[i][0]), i)) print "Differences: ", differences_tuples differences_sorted = sorted(differences_tuples) # Return the indices of the datapoints to be removed indices_to_remove = [] for i in range(int(length/10)): indices_to_remove.append(differences_sorted[length - 1 - i][1]) indices_to_remove = sorted(indices_to_remove, reverse=True) print "Indices to remove: ", indices_to_remove # Remove the relevant tuples for i in indices_to_remove: del tuples[i] cleaned_data = tuples print "Cleaned_data: ", cleaned_data print "Number of items in cleaned data: ", len(cleaned_data) return cleaned_data |

One last thing – look at this outlier! This is bonus amounts plotted against salaries from a portion of the Enron scandal corpus. Can you guess who or what that outlier is? I’ll post the answer in my next post.