Removing Outliers from your Data

Jessica YungData ScienceLeave a Comment

Hastily compiled, from uDacity’s Intro to Machine Learning videos.

Here’s a general recipe for removing outliers from your data:

1. Train with all data.
2. Remove ~10% of data (points with highest residual error).
3. Train again.

Obviously don’t remove outliers blindly – sometimes they are important and you should pay attention to them. But outliers that are results of data entry errors, sensor malfunction or irrelevant freak events should be ignored. You need to exercise judgement.

Here is the new scatter plot for the data presented above. I’ve removed the outliers (defined above) and fitted it with a new linear regression line.

screenshot

And here is the code I used to remove the outliers. There’s a lot of de-bugging printing. This is written in Python 2.7.

One last thing – look at this outlier! This is bonus amounts plotted against salaries from a portion of the Enron scandal corpus. Can you guess who or what that outlier is? I’ll post the answer in my next post.

screenshot

Leave a Reply