To get something going with text (or any) classification algorithm is easy enough, all you need is an algorithm, such as Maximum Entropy or Naive Bayes, an implementation of each is available in many different flavors across various programming languages (I use NLTK on Python for text classification), and a bunch of already classified corpus data to train your algorithm on and that is it, you got yourself a basic classifier.
But the story rarely ends here, and to get any decent production-level performance or accuracy out of your classification algorithm, you’ll need to iteratively test your algorithm for optimum configuration, understand how different classes interact with each other, and diagnose any abnormality or irregularity you’re algorithm is experiencing.
In this post I hope to cover some basic mathematical tools for diagnosing and testing a classification algorithm, I will be taking a real life algorithm that I have worked as an example, and explore the various techniques we used to better understand how well it is performing, and when it is not performing, what is the underlying characteristic of this failure.
Before (or during) embarking on any classification that is actually intended to reach production (and isn’t just done for research purposes), it is critical to set yourself a realistic target, but in order to do so you’ll have to understand what metrics you can use to establish those targets. As a running example I will be referring to a Topic Classification algorithm that I have worked on, the numbers below are real performance figures for what the algorithm has achieved at some point during the optimisation process.
In this example we tested our classifier against 110 manually classified entries (test set), the classifier had 10 classes, these were Arts, Business, Computers, Games, Health, Home, Recreation, Science, Society and Sports.
Classification Confusion Matrix
The Confusion Matrix (or Matching Matrix) is a neat way to visualize the performance of a classification algorithm, particularly useful if your expanding your binary classifier into a mutli-class one, giving you access to data that can help you better understand how your algorithm is under-performing, which helps you make targeted adjustment to the algorithm to better it, as well as provide a holistic overview of performance.
For this project I used a “pimped-up” version of the Confusion Matrix, appending statistics such as over-all prediction accuracy, classification recall (False Negative) and precision (False Positive), which I used as the basis for diagnosing and testing my classification algorithm performance, this matrix can be seen below.
Ok so lets go through and explain each bit:
- Row labels: The actual classification of our test data.
- Column labels: The classification that has been predicted by our system.
- Diagonally across the matrix (Green coloured entries): correct hits by the classification algorithm
- Red coloured entries in the matrix: incorrect hits by the classification algorithm.
- Yellow cell (end of matrix): our over-all accuracy figure.
- Vertical calculation at the end of the matrix (Grey): Our recall (FN) value, which is a row based calculation on the matrix (shown by the blue dotted line).
- Horizontal calculation at the end of the matrix (Grey): Our precision (FP) value, which is a column based calculation on the matrix (shown by the orange dotted line).
Prediction Class Precision (False Positive)
Precision as a metric is essentially answering the question:
Given the number of times the classification algorithm predicted a particular class (class total predicted), what is the fraction of correct hits.
And is calculated using the following equation:
Precision (Class) = [Class Correct]/[Class Total Predicted]
So if we take the Games class in the matrix above, we have 11 correct hits (where Game horizontally and vertically intersect in the Matrix) of 13 total hits for the Game class (sum of all data in column), hence our precision value for the Games class is 0.846
The higher the precision (closer to 1.0) the better the algorithm will be in producing less false positives for this class.
Prediction Class Recall (False Negative)
Given the number of items in our test set that belongs to a particular class, what is the fraction of correct hits compared to the total count of items in that class.
Recall (Class) = [Class Correct]/[Class Total Actual]
Text Classification Accuracy
Accuracy is the most common “over-all” metric you’ll encounter when doing classification, and it is essentially an indication of how close your algorithm was to the actual classification values, there are two ways to calculate accuracy depending on how you will use your algorithm:
A classification result is either right or wrong, in which case your accuracy is:
Accuracy (Total) = [All Correct] / Total
So if we take the matrix above, our test set size is 110 and we have 62 correct hits, this means our accuracy is 0.56.
Some classifiers have a built-in accuracy function that takes into consideration how far-off it was from the target (using the underlying probability space), rather than just report accuracy on top value pass/failure. This approach should be considered if you are considering multiple results from your classifier (with different probability distributions), rather than just the single top value (the class with the highest probability distribution). Many built-in classifiers report the 2nd value, and so tend to be a bit more optimistic in their accuracy figures.
One interesting relationship to observe between recall and precision is that they usually can be traded off (or tweaked ) against each other, in order to either improve recall or precision. This is interesting because it allows you to customize the algorithm according to what is a more acceptable failure criteria for the classification problem at hand. Girija has an awesome post that discusses how to tweak precision and recall