What colour is this rainbow?
Yesterday we wrangled debating motions data using Google Sheets. Today we’ll discuss building a machine learning model to classify these debating topics (e.g. This House Would Break Up the Eurozone) into categories (e.g. ‘Economics’ and ‘International Relations’).
Why is this problem interesting?
- It is primarily a text classification problem.
- It is a multi-label classification problem. That is, each topic can belong to multiple categories. For example, it can concern International Relations, Economics and Feminism at the same time.
- Feature engineering: It contains many features that may be tangentially relevant to the motion category but are more likely to cause overfitting or confuse the model. Such features include tournament date, tournament name and tournament location.
- It would be interesting if we could find correlations between the people who set the motions (chief adjudicators CAs) and motion categories, but the number of entries per individual chief adjudicator are too few to draw conclusions.
- There are few data entries: there are currently c. 180 labelled motions and c. 890 unlabelled motions for a total of c. 1070 motions. It will be interesting to see how good the model can be with so few training data. I will likely have to label more motions semi-manually to improve the model. There are also over 3000 additional unlabelled motions I could add (these still require formatting).
I will discuss my approach to multi-label classification and one possible error in evaluating performance in this post. We may discuss text classification methods and other interesting aspects of this problem in future posts. We are thus left with this simplified problem:
Simplified Problem description:
Input: A debating motion (string of text).
Output: A list of categories that the debating motion belongs to.
- The list of possible categories is fixed and has length 26.
- In the training and test sets, each motion belongs to at least one and at most three categories.
Approach to Multi-Label Classification
I chose to transform this problem into 26 single-label binary classification problems. That is, for each motion, we need to decide whether or not it belongs to the category ‘Art and Culture’, whether or not it belongs to the category ‘Business’ and so on for the remaining 24 categories. (I have not yet investigated direct methods for tackling multi-label classification – do tell if you have suggestions.)
This opens the interesting but shifty-seeming possibility of using different classifiers for different categories. It is plausible that different categories can be better modelled in different ways because of language structure. (Might do a post on this some time.)
A possible error in evaluating performance
When I asked my Multinomial Naive Bayes model to predict motion categories for the test set (size 60), I got the following results:
For 24 out of the 26 categories, the accuracy was greater than 85%, the exceptions being the categories ‘Economics’ and ‘International Relations’. I was surprised given the model was trained on only 120 examples and text classification is difficult.
But when I tried to predict the categories of a new motion, it became clear that my model seemed accurate not because it was good, but because most categories appeared infrequently. The model was almost always predicting that a motion did not belong to any category.
Testing the hypothesis using a constructed example
I further confirmed this by asking the model to predict the categories of a custom-constructed motion that contained keywords that, from a human’s perspective, should obviously place it in certain motion categories. The motion used was ‘schools teachers students politicians elections government China’, and I expected it to be categorised as ‘Education’ and ‘Politics’, but it was not put in any category. This testing method should work since we are using the Bag of Words representation for text classification in this example that only considers word frequencies and not word ordering or placement.
‘Economics’ and ‘International Relations’ had been low previously because there were many motions with those labels, whereas there was only one motion labelled ‘Funny’ across both the training and the test set. My guess is that that one motion was in the training set and so if the model predicted that no motion had that label, it would get an accuracy of 1.0. Perfectly misleading!
To provide a more accurate measure of performance, I will develop a metric that measures performance across individual motions as opposed to across individual categories. This post is long enough as it is, so stay tuned for the full solution.
In other news, I made my first merged (accepted) pull request to scikit-learn! So psyched! 😀
Scikit-learn Text classification tutorial