In this article I discuss some methods you could adopt to improve the accuracy of your text classifier, I’ve taken a generalized approach so the recommendations here should really apply for most text classification problem you are dealing with, be it Sentiment Analysis, Topic Classification or any text based classifier. This is by no means a comprehensive list, but it should provide a nice introduction into the subject of text classification algorithm optimisation.
Without further ado, here are 10 tips you could try to help improve the result of your text classification algorithm.
Eliminate Low Quality Features (Words)
Low quality features in your training data-set is more likely to contribute negatively to your classification results (particularly since they might not be classified correctly), eliminating these low quality features can often lead to better classification results. In my experience this generally leads to a very healthy increase in over-all accuracy.
It is sometimes difficult to select a cut-off point for the most important features, generally it is recommended to design a research bench-work application that recursively tries different cut-off points and selects the one with the best accuracy (against a test data-set), for example for a Topic classifier I found that considering only the top 15,000 most frequent words (in my training set) leads to the best average performance against my multiple test data-sets.
A nice bi-product of eliminating low quality features is that your algorithm can be trained a lot faster, since the probability space is much smaller, which opens up the possibility of tweaking the algorithm more readily.
There is an excellent article by Jacob on Eliminating Low Information Features in NLTK, which can also be generalized to many classification algorithms.
Recursively Grow your Stopword List
I usually have at least 5 different stopwords list per classification project, each of which grows as the algorithm is re-optimised and tweaked throughout the life-time of the project, in order for the classifier to meet the target accuracy figure, some of the stopword lists include:
- Frequently used English (or any language) words: this will include about 500 words that doesn’t really contribute to context, such as: “and, the, for, if, I” etc.
- Temporal words: this will include about 100 words such as: “Tuesday, tomorrow, January“, etc.
And many others, obviously not all classification project will use all stopwords list, you will need to match the right stopwords with the right classification problem. You could grow your stopwords list by iteratively analyzing the top features in your algorithm for words that shouldn’t be in there… and of course use logic!
Look Beyond Unigram into Bigrams and Trigrams
In text classification, Unigrams are single words, Bigrams are two related words (appear frequently next to each other in text), and Trigram is just the next extension of that concept.
I found that often considering Bigrams in a classification algorithm tends to really boost performance, since the increased long-tail specificity of the word means that the classifier can easily determine which class has a higher probability, leading to better classifications. In my experience Trigrams do not have offer the same boost as Bigrams, but they are worth considering and could be essential for certain types of classifiers. You could also go beyond Trigrams if you felt that the classification problem requires it.
The important thing to remember here is to apply the same logic for eliminating low quality bigrams and trigrams as you would with unigrams.
Diversify your Training Corpus
This point cannot be stressed enough, particularly if you wish to create a production-ready text classifier that behaves as expected in the lab as it would in the real world.
Diversity in the training corpus helps dilute word features that are specific to one particular corpus, allowing the classification algorithm to only select features that have a root contribution towards the text classification problem at hand. Obviously you need to select the corpus intelligently and do not just add more data for the sake of adding more data, it is all about the context of the classification problem.
For example if you were building a sentiment classification algorithm that will be used to classify social media sentiment, you need to make sure that your training set includes a variety of sources and not just training data from Twitter, ignoring communication from other social hubs like Facebook (a space in which you intend to run your algorithm). This will lead to features specific to Twitter data to appear in your classification probability space, leading to poor results when applied to other input sources.
Tweak Precision and Recall
I have written an article that discusses precision and recall in the context of the Confusion Matrix. The idea here is to tweak the system so when it fails, it does so in a manner that is more tolerable. This can be done by shifting False Positive (or Precision) under-performance to False Negative (or recall) under-performance, and viseversa according to you what is best for your system.
Eliminate Low Quality Predictions (Learn to Say “I Dont Know”)
Sometimes the algorithm might not be sure about which class the input text belongs to, this could be because
- The text does not contain features that the algorithm has been trained on. For example words that do not exist enough in the training set.
- The input text contains words from a different number of classes, resulting in evenly distributing the probability across those classes. For example a sentiment classifier trying to classify the input “I am happy and angry”
In these scenarios the classifier usually returns the item with the highest probability even though it is a very low quality guess.
If your model can tolerate a reduction in the coverage of what it can classify, you could greatly improve the accuracy of what is being classified by returning “Class Unknown” when the classifier is too uncertain (highest probability is lower than threshold), this can be done by analyzing probability filter threshold against accuracy and coverage.
Canonicalize Words through Lemma Reduction
The same word can have different formats depending on its grammatical usage (verb, adjective, noun, etc.), the idea of canonicalization is to reduce words to their lowest format (lemma), assuming that the grammatical placement of words is an irrelevant feature to your classifier. For example the words:
This approach in reducing the word space can sometimes be extremely powerful when used in the right context, since it does not only reduce the probability space of the algorithm generated by the training set (giving the same word 1 score is better and more accurate than 10 different scores), but also helps in reducing the chances of encountering new words that the algorithm has not been trained on (when the algorithm is deployed), since all text will be reduced to its lowest canonical form, leading to improved practical accuracy of the algorithm.
This could also extend to normalizing exaggerations in speech, which is a very common problem when classifying social data, this will include reducing words like “haaaaapppppyyyyy” to “happy“, or even better, reducing all exaggerated lengths of the word “happy” to a canonical format different from the non-exaggerated form, for example reducing both “haaaaapppppyyyyy” and “haaaaaaaaaaaaaapppppyyy” to “haapppyy“, this will differentiate it from the non-exaggerated form when scoring it for classification, but still reduces the over-all probability space by normalizing the word. A good example of where this might be applicable is when classifying conversational intensity.
Eliminate Numerals, Punctuation and Corpus Specific Text
If any character in training & testing corpus, and input text, contribute nothing towards the classification then they should be taken out, as all they will do is clutter your probability space with features that look like this:
Diluting the actual real probability that should be associated with the word “Running“, while occupying space in your high quality features that could be put to better use.
Sometimes you might need to consider the nature of the training data-set itself, and if there is any peculiarities that you need to take out, for example if you are dealing with a data-set from Twitter, you might want to eliminate (using a RegEx perhaps) any usernames (of the format @thinknook) because they do not contribute towards the classification problem you have at hand.
I toyed with the idea of breaking text into its grammatical structure and removing a particular grammatical class (say Interjection) completely, but over-all the results weren’t very successful for my particular situation, and a comprehensive stopwords list that included all encountered Interjection words worked better.
Try a Different Classification Algorithm
Classification algorithms come in many different formats, some are intended as a speedier way to execute the same algorithms, others might offer a more consistent performance or higher over-all accuracy for the specific problem you have at hand. For example if you are currently running your classifier on a Naive Bayes algorithm, then it might be worth considering a Maximum Entropy one.
Many Natural Language Processing tools come with many flavors of classification algorithms, in this article I go through the NLTK classification algorithms as well as present a Linux 64x compiled library of my favorite Maximum Entropy algorithm, MEGAM.
To Lower or Not To Lower (your Text)
This again relates to the problem of canonicalizing the features (or words) in the probability space (and input data). The decision of whether lowering all text makes sense (and yields better accuracy for your classifier) depends on what exactly you are trying to classify, for example if you are classifying Intensity or Mood, then capitalization might be an important feature that contributes positively to the accuracy of the predictions, but if you are trying to classify text into topics or categories, then lowering all text (training, test and input data) might have a very healthy impact on over-all accuracy.
You could get a bit more clever than the brute force approach, and selectively choose to keep certain words capitalized due to their positive contribution in differentiating them from their lower format counterpart.
Targeted Manual Injection and Curation of Corpus Data
A low quality corpus is the Achilles’ heel of a classification algorithm, you could tweak all you want, implement awesome features extraction techniques and do all the recommendations above and still get nowhere if you do not have a comprehensive good quality training corpus.
It is highly recommended that you dedicate time towards a level of manual curation of your training set, particularly if the training set involves human entry, or people trying to game the system for their own benefit. For example if you are using a blog directory as a training set for Topic classification, users entering their blog details might try to trick the topic cataloguing system and get more traffic by ticking as many topics as possible, leading to a poor training corpus and a poor classification algorithm. There is a cool article by Alistair that takes a generalized look on data mining and prediction using public (minimally administered) data, and the issues surrounding that.
The point about corpus diversity does help to a certain extend in diluting the impact of this issue, but as far as I can tell at one point you will hit a brick wall were the only way to improve accuracy is to manual curate the training data, this could be at 10% accuracy or at 90% depending on your situation. I also found that sometimes using a good quality subset of an over-all bad quality corpus leads to better results than using the whole corpus, which seems to suggest that quality is more important than quantity.
Sometimes it is also necessary to plug-in targeted content in your training set intended to remove ambiguity between two classes in the classification algorithm, lower a high probability factor for a feature against a particular class, or introduce a new content area that is not explored by the initial training corpus.
If you’ve made it this far then I salute you sir, and wish you the best with your classification endeavors!