Lending Club Data: Proportion of Loans that end in Default by US State

Jessica YungData ScienceLeave a Comment


Lending Club Loans Dataset: Complete loan data (over 800k records with up to ~70 attributes each!) for all loans issued through 2007-2015, including current loan status (Current, Late, Fully Paid, etc.) and latest payment information.

I’ve posted the full code on GitHub. It (1) shows how I obtained the data used in the map above and (2) includes relevant exploratory plots drawn using Seaborn.

On Feature Engineering
loan_status was given as strings (text), specifically with values in the set
{'Default', 'In Grace Period', 'Does not meet the credit policy. Status:Fully Paid', 'Issued', 'Late (16-30 days)', 'Late (31-120 days)', 'Does not meet the credit policy. Status:Charged Off', 'Current', 'Fully Paid', 'Charged Off'}.

As is often the case, I had to translate categorical data (text) into numerical data to analyse it effectively. The challenge is making meaningful enumerations of data. Arbitrarily assigning numbers to each category might not be useful.

In this case, there were two clear cuts:

  1. Whether the loan had ‘ended’ -> one cannot make judgements about the default-status of a current loan, and
  2. if the loan had ended, whether the loan ended in default.

These were then used to construct the proportion of default:

Proportion of default = (no. of loans that ended in default) / (no. of loans that ended)

Note: Sometimes it is also useful to translate numerical data into other numerical data. An example is adding an isChild column constructed from ages (years).

Leave a Reply