Debugging a Classification Model: Refining Evaluation Metrics

Jessica YungData ScienceLeave a Comment

A New Evaluation Metric

In the previous post, I discussed the problems of using a pure accuracy metric for multi-label classification when you have many labels and a small number of labels assigned to each input. Even when my model assigned no labels to anything, it had an accuracy of 92%.

In this post, I will discuss and go through the code for a more detailed evaluation metric. We will be building our own metric function as opposed to using a pre-built one.

Our metric will:

  1. Calculate accuracy by motion and not by category as we did in the previous post.
  2. Allow us to assign variable penalties for false positives and false negatives.
  3. Display the number of true and false positives and negatives.

Let’s walk through this step by step.

Problem Recap:
Build a machine learning model to classify debating topics (e.g. This House Would Break Up the Eurozone) into categories (e.g. ‘Economics’ and ‘International Relations’).

  • Input: A debating motion (string of text).
  • Output: A list of categories that the debating motion belongs to.
1. Calculating accuracy by motion

Last time we were left with this code:

This code calculates accuracy by category and then aggregates them. We would like to calculate accuracy by individual motion and aggregate figures that way. This is because it will make it easier for us to calculate metrics that depend on the proportion of errors that are false positives and false negatives later.

2. Assigning variable weights to false positives and false negatives

Idea: Previously we discovered that the model predicted that every motion in our test set did not belong to any category. That is, all the errors were false negatives.

Suppose we considered false negatives to be worse errors than false positives.
A possible reason for this case: The category labels determine what motions get shown when people search for e.g. ‘Politics’. We might prefer people to see some search results (motions labelled ‘Politics’) that are not completely accurate than to have them see no motions at all.

Thus if given two models witih similar overall accuracy, we might prefer a model with a lower false negative rate. We can do this by calculating a different score that takes the type of error into account.

In Code
To do this, we alter the mean_accuracy_per_motion() function. Instead of the previous way of calculating scores per motion

we divide the cases into correct predictions and incorrect predictions. We then divide correct predictions into true positives and negatives, and incorrect predictions into false positives and false negatives.

For correct predictions, we add 1 to the score as before.

For incorrect predictions, we subtract false_pos from the score if the prediction is a false positive, and subtract false_neg from the score if the prediction is a false negative.

3. Print the number of true and false positives and negatives

Once we have divided the correct and incorrect predictions into true and false positives and negatives, it is trivial to print the total number in each category.

The code for these two ideas is shown below:

Leave a Reply