##### A New Evaluation Metric

In the previous post, I discussed the problems of using a pure accuracy metric for multi-label classification when you have many labels and a small number of labels assigned to each input. Even when my model assigned no labels to anything, it had an accuracy of 92%.

In this post, I will discuss and go through the code for a more detailed evaluation metric. We will be building our own metric function as opposed to using a pre-built one.

Our metric will:

- Calculate accuracy by motion and not by category as we did in the previous post.
- Allow us to assign variable penalties for false positives and false negatives.
- Display the number of true and false positives and negatives.

Let’s walk through this step by step.

Problem Recap:

Build a machine learning model to classify debating topics (e.g. This House Would Break Up the Eurozone) into categories (e.g. ‘Economics’ and ‘International Relations’).

- Input: A debating motion (string of text).
- Output: A list of categories that the debating motion belongs to.

##### 1. Calculating accuracy by motion

Last time we were left with this code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# v1 Evaluation of the performance on the test set import numpy as np def mean_accuracy_across_categories(): """ Prints mean accuracy of predictions. """ overall_mean_accuracy_across_categories = [] # For each category for category_to_test in labels: # Ask model to predict whether or not motions in ``X_test`` # belong to category ``category_to_test``. predicted = category_clfs_dict[category_to_test].predict(X_test) # Calculate the mean accuracy of these predictions # for ``category_to_test`` category_mean_accuracy = np.mean( predicted == Y_dict[category_to_test][train_size:total_labelled]) print(category_to_test, category_mean_accuracy) # Append the mean accuracy to the array # containing mean accuracy for all categories overall_mean_accuracy_across_categories.append(category_mean_accuracy) # Calculate and print the mean accuracy across all categories # and motions in the test set. print('Overall Mean Accuracy across Categories: ', np.mean(overall_mean_accuracy_across_categories)) |

This code calculates accuracy by category and then aggregates them. We would like to calculate accuracy by individual motion and aggregate figures that way. This is because it will make it easier for us to calculate metrics that depend on the proportion of errors that are false positives and false negatives later.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
# v2 Evaluation of the performance on the test set def accuracy_by_category(return_predictions=False): """ Returns an array of arrays of booleans that indicates whether each prediction matched the true value. Each row is a category and each value within the row is a motion. Weighs false positives the same as false negatives. >>> accuracy_by_category(e.g. if the prediction was 1 (in category) but the true value was 0 (not in category), the value is False. if ``return_predictions=True``, returns array of arrays of predictions as well. """ accuracy_by_category = [] predictions = [] # For each category for category_to_test in labels: # Generate predictions for all motions in test set # as to whether or not a motion belongs to a category predicted = category_clfs_dict[category_to_test].predict(X_test) # Add this to our array of predictions predictions.append(predicted) # Accuracy: Is the prediction the same as the true label? accuracy_by_category.append( predicted == Y_dict[category_to_test][train_size:total_labelled]) # If we want to, the function can return the array of predictions # as well as the accuracy by category. # (Specify ``return_predictions=True`` in func argument) if return_predictions == True: return accuracy_by_category, predictions else: return accuracy_by_category def mean_accuracy_per_motion(): """ Prints the mean accuracy per motion. Assumes we weigh false positives the same as false negatives. """ accuracy_per_motion = [] acc_by_category = accuracy_by_category() # For each motion for i in range(test_size): score = 0 # For the per-category prediction for each motion for j in range(len(labels)): # Add 1 to the score if the prediction was accurate, # add 0 to the score if the prediction was inaccurate score += acc_by_category[j][i] # Normalise the score such that it's between 0 and 1 inclusive accuracy_for_one_motion = score/len(labels) # Put all the accuracy scores into one array accuracy_per_motion.append(accuracy_for_one_motion) print('Mean Accuracy Per Motion: ', np.mean(accuracy_per_motion)) mean_accuracy_per_motion() |

##### 2. Assigning variable weights to false positives and false negatives

**Idea**: Previously we discovered that the model predicted that every motion in our test set did not belong to any category. That is, all the errors were false negatives.

Suppose we considered false negatives to be worse errors than false positives.

A possible reason for this case: The category labels determine what motions get shown when people search for e.g. ‘Politics’. We might prefer people to see some search results (motions labelled ‘Politics’) that are not completely accurate than to have them see no motions at all.

Thus if given two models witih similar overall accuracy, we might prefer a model with a lower false negative rate. We can do this by calculating a different score that takes the type of error into account.

**In Code**

To do this, we alter the
mean_accuracy_per_motion() function. Instead of the previous way of calculating scores per motion

1 2 3 4 5 6 7 8 9 10 11 12 13 |
# For each motion for i in range(test_size): score = 0 # For the per-category prediction for each motion for j in range(len(labels)): # Add 1 to the score if the prediction was accurate, # add 0 to the score if the prediction was inaccurate score += acc_by_category[j][i] # Normalise the score such that it's between 0 and 1 inclusive accuracy_for_one_motion = score/len(labels) |

we divide the cases into correct predictions and incorrect predictions. We then divide correct predictions into true positives and negatives, and incorrect predictions into false positives and false negatives.

For correct predictions, we add 1 to the score as before.

For incorrect predictions, we subtract false_pos from the score if the prediction is a false positive, and subtract false_neg from the score if the prediction is a false negative.

##### 3. Print the number of true and false positives and negatives

Once we have divided the correct and incorrect predictions into true and false positives and negatives, it is trivial to print the total number in each category.

The code for these two ideas is shown below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
# Suppose we weigh each false negative with weight ``false_neg`` and # each false positive with weight ``false_pos``. false_neg = 5 false_pos = 1 ## v2 Evaluation of the performance on the test set def mean_score_per_motion_v2(): """ Prints the mean score per motion. """ # Booleans, actual predictions. # Rows: categories. Values within rows: motions. acc_by_category, predicted = accuracy_by_category(return_predictions=True) # Initialise counts tp, tn, fp, fn = 0, 0, 0, 0 accuracy_per_motion = [] # For each motion for i in range(test_size): score = 0 # For the per-category prediction for each motion for j in range(len(labels)): # Was the prediction accurate? pred_accuracy = acc_by_category[j][i] # We need the prediction to see if it is a # true positive or negative pred = predicted[j][i] # If the prediction is accurate if pred_accuracy == 1: score += 1 # True positive if pred == 1: tp += 1 # True negative elif pred == 0: tn += 1 # Else if the prediction is not accurate elif pred_accuracy == 0: # False positive if pred == 1: score -= false_pos fp += 1 # False negative elif pred == 0: score -= false_neg fn += 1 # Normalise the score score_for_one_motion = score/len(labels) # Put all the scores into one array score_per_motion.append(score_for_one_motion) print('Mean Score Per Motion: ', np.mean(score_per_motion)) print('True Positives: ', tp, '\n', 'True Negatives: ', tn, '\n', 'False Positives: ', fp, '\n', 'False Negatives: ', fn) mean_score_per_motion_v2() |