10 Tips to Improve your Text Classification Algorithm Accuracy and Performance
In this article I discuss some methods you could adopt to improve the accuracy of your text classifier, I’ve taken a generalized approach so the recommendations here should really apply for most text classification problem you are dealing with, be it Sentiment Analysis, Topic Classification or any text based classifier. This is by no means a comprehensive list, but it should provide a nice introduction into the subject of text classification algorithm optimisation.
Without further ado, here are 10 tips you could try to help improve the result of your text classification algorithm.
Eliminate Low Quality Features (Words)
Low quality features in your training data-set is more likely to contribute negatively to your classification results (particularly since they might not be classified correctly), eliminating these low quality features can often lead to better classification results. In my experience this generally leads to a very healthy increase in over-all accuracy.
It is sometimes difficult to select a cut-off point for the most important features, generally it is recommended to design a research bench-work application that recursively tries different cut-off points and selects the one with the best accuracy (against a test data-set), for example for a Topic classifier I found that considering only the top 15,000 most frequent words (in my training set) leads to the best average performance against my multiple test data-sets.
A nice bi-product of eliminating low quality features is that your algorithm can be trained a lot faster, since the probability space is much smaller, which opens up the possibility of tweaking the algorithm more readily.
There is an excellent article by Jacob on Eliminating Low Information Features in NLTK, which can also be generalized to many classification algorithms.
Recursively Grow your Stopword List
I usually have at least 5 different stopwords list per classification project, each of which grows as the algorithm is re-optimised and tweaked throughout the life-time of the project, in order for the classifier to meet the target accuracy figure, some of the stopword lists include:
- Frequently used English (or any language) words: this will include about 500 words that doesn’t really contribute to context, such as: “and, the, for, if, I” etc.
- Countries
- Cities
- Names
- Adjectives
- Temporal words: this will include about 100 words such as: “Tuesday, tomorrow, January“, etc.
And many others, obviously not all classification project will use all stopwords list, you will need to match the right stopwords with the right classification problem. You could grow your stopwords list by iteratively analyzing the top features in your algorithm for words that shouldn’t be in there… and of course use logic!
Look Beyond Unigram into Bigrams and Trigrams
In text classification, Unigrams are single words, Bigrams are two related words (appear frequently next to each other in text), and Trigram is just the next extension of that concept.
I found that often considering Bigrams in a classification algorithm tends to really boost performance, since the increased long-tail specificity of the word means that the classifier can easily determine which class has a higher probability, leading to better classifications. In my experience Trigrams do not have offer the same boost as Bigrams, but they are worth considering and could be essential for certain types of classifiers. You could also go beyond Trigrams if you felt that the classification problem requires it.
The important thing to remember here is to apply the same logic for eliminating low quality bigrams and trigrams as you would with unigrams.
Diversify your Training Corpus
This point cannot be stressed enough, particularly if you wish to create a production-ready text classifier that behaves as expected in the lab as it would in the real world.
Diversity in the training corpus helps dilute word features that are specific to one particular corpus, allowing the classification algorithm to only select features that have a root contribution towards the text classification problem at hand. Obviously you need to select the corpus intelligently and do not just add more data for the sake of adding more data, it is all about the context of the classification problem.
For example if you were building a sentiment classification algorithm that will be used to classify social media sentiment, you need to make sure that your training set includes a variety of sources and not just training data from Twitter, ignoring communication from other social hubs like Facebook (a space in which you intend to run your algorithm). This will lead to features specific to Twitter data to appear in your classification probability space, leading to poor results when applied to other input sources.
Tweak Precision and Recall
I have written an article that discusses precision and recall in the context of the Confusion Matrix. The idea here is to tweak the system so when it fails, it does so in a manner that is more tolerable. This can be done by shifting False Positive (or Precision) under-performance to False Negative (or recall) under-performance, and viseversa according to you what is best for your system.
Eliminate Low Quality Predictions (Learn to Say “I Dont Know”)
Sometimes the algorithm might not be sure about which class the input text belongs to, this could be because
- The text does not contain features that the algorithm has been trained on. For example words that do not exist enough in the training set.
- The input text contains words from a different number of classes, resulting in evenly distributing the probability across those classes. For example a sentiment classifier trying to classify the input “I am happy and angry”
In these scenarios the classifier usually returns the item with the highest probability even though it is a very low quality guess.
If your model can tolerate a reduction in the coverage of what it can classify, you could greatly improve the accuracy of what is being classified by returning “Class Unknown” when the classifier is too uncertain (highest probability is lower than threshold), this can be done by analyzing probability filter threshold against accuracy and coverage.
Canonicalize Words through Lemma Reduction
The same word can have different formats depending on its grammatical usage (verb, adjective, noun, etc.), the idea of canonicalization is to reduce words to their lowest format (lemma), assuming that the grammatical placement of words is an irrelevant feature to your classifier. For example the words:
- Running
- Runner
- Runs
- Ran
- Runners
This approach in reducing the word space can sometimes be extremely powerful when used in the right context, since it does not only reduce the probability space of the algorithm generated by the training set (giving the same word 1 score is better and more accurate than 10 different scores), but also helps in reducing the chances of encountering new words that the algorithm has not been trained on (when the algorithm is deployed), since all text will be reduced to its lowest canonical form, leading to improved practical accuracy of the algorithm.
This could also extend to normalizing exaggerations in speech, which is a very common problem when classifying social data, this will include reducing words like “haaaaapppppyyyyy” to “happy“, or even better, reducing all exaggerated lengths of the word “happy” to a canonical format different from the non-exaggerated form, for example reducing both “haaaaapppppyyyyy” and “haaaaaaaaaaaaaapppppyyy” to “haapppyy“, this will differentiate it from the non-exaggerated form when scoring it for classification, but still reduces the over-all probability space by normalizing the word. A good example of where this might be applicable is when classifying conversational intensity.
Eliminate Numerals, Punctuation and Corpus Specific Text
If any character in training & testing corpus, and input text, contribute nothing towards the classification then they should be taken out, as all they will do is clutter your probability space with features that look like this:
- Running,
- Running.
- Running!!!
- Running!?!?
- #Running
- “Running”
Diluting the actual real probability that should be associated with the word “Running“, while occupying space in your high quality features that could be put to better use.
Sometimes you might need to consider the nature of the training data-set itself, and if there is any peculiarities that you need to take out, for example if you are dealing with a data-set from Twitter, you might want to eliminate (using a RegEx perhaps) any usernames (of the format @thinknook) because they do not contribute towards the classification problem you have at hand.
I toyed with the idea of breaking text into its grammatical structure and removing a particular grammatical class (say Interjection) completely, but over-all the results weren’t very successful for my particular situation, and a comprehensive stopwords list that included all encountered Interjection words worked better.
Try a Different Classification Algorithm
Classification algorithms come in many different formats, some are intended as a speedier way to execute the same algorithms, others might offer a more consistent performance or higher over-all accuracy for the specific problem you have at hand. For example if you are currently running your classifier on a Naive Bayes algorithm, then it might be worth considering a Maximum Entropy one.
Many Natural Language Processing tools come with many flavors of classification algorithms, in this article I go through the NLTK classification algorithms as well as present a Linux 64x compiled library of my favorite Maximum Entropy algorithm, MEGAM.
To Lower or Not To Lower (your Text)
This again relates to the problem of canonicalizing the features (or words) in the probability space (and input data). The decision of whether lowering all text makes sense (and yields better accuracy for your classifier) depends on what exactly you are trying to classify, for example if you are classifying Intensity or Mood, then capitalization might be an important feature that contributes positively to the accuracy of the predictions, but if you are trying to classify text into topics or categories, then lowering all text (training, test and input data) might have a very healthy impact on over-all accuracy.
You could get a bit more clever than the brute force approach, and selectively choose to keep certain words capitalized due to their positive contribution in differentiating them from their lower format counterpart.
Targeted Manual Injection and Curation of Corpus Data
A low quality corpus is the Achilles’ heel of a classification algorithm, you could tweak all you want, implement awesome features extraction techniques and do all the recommendations above and still get nowhere if you do not have a comprehensive good quality training corpus.
It is highly recommended that you dedicate time towards a level of manual curation of your training set, particularly if the training set involves human entry, or people trying to game the system for their own benefit. For example if you are using a blog directory as a training set for Topic classification, users entering their blog details might try to trick the topic cataloguing system and get more traffic by ticking as many topics as possible, leading to a poor training corpus and a poor classification algorithm. There is a cool article by Alistair that takes a generalized look on data mining and prediction using public (minimally administered) data, and the issues surrounding that.
The point about corpus diversity does help to a certain extend in diluting the impact of this issue, but as far as I can tell at one point you will hit a brick wall were the only way to improve accuracy is to manual curate the training data, this could be at 10% accuracy or at 90% depending on your situation. I also found that sometimes using a good quality subset of an over-all bad quality corpus leads to better results than using the whole corpus, which seems to suggest that quality is more important than quantity.
Sometimes it is also necessary to plug-in targeted content in your training set intended to remove ambiguity between two classes in the classification algorithm, lower a high probability factor for a feature against a particular class, or introduce a new content area that is not explored by the initial training corpus.
If you’ve made it this far then I salute you sir, and wish you the best with your classification endeavors!
Assuming you have access to LogisticRegression with L1 and L2 regularization (e.g. sci-kit):
Use L1/L2 regularization to automatically eliminate redundant features rather than the above heuristical approach. Use L1-regularization to get a list of relevant features and then train a different classifier with this subset of features. Let the classifier do the work and tell you what features were found important. Features required really depend on the classification task.
You can also iteratively work on feature engineering by generating more-of-the-relevant features for your *task*. To see what features are relevant — inspect the weights assigned to the features (and feature sets) by the regularized-classifier. I typically bucket my features by a name-space to generate feature-sets. For example: ‘verbs:xxx’, ‘bigrams:…’, etc. I can then roll-up by feature-set the regularization scores to determine which type of features are working well — and simply work on generating more of those types of features.
I spend less-time on eliminating features and more on generating different types of features that are helping a given classification task.
I also use the regularization weights to find bugs in my code. If for a feature-set, the weights are lower than expected, it is possible I have a bug in the code and am not generating the features properly.
Hey Manish,
Thanks for the great comment, my understanding of classification algorithm is still somewhat naive (no pun intended), but I understand the general premise of the approach you have highlighted above. MEGAM seems to offer L2 Regularization (actually it looks like an implementation of Gaussian Prior which effectively achieves the same result, i think :S) within NLTK so I will definitely give that a go… otherwise as per your suggestion Sci-Kit seems like the way to go for finer control over L1/L2 regularization, the results of which can then be plugged back into NLTK if needs be.
I particularly like the point regarding categorizing features into bucket to diagnose what features are contributing the most towards the classification problem… what really jumps out at me is the fact that this idea can be extended to reflect document structure or word placement (something I had in the back of my mind for a while now), for example features appearing in the subject of an email can be tagged with ‘subject:feature’ (since subject might be more relevant than body). We can even bucket the words by position within the document, so say have 3 buckets:
0 – 100 words: bucket1
100-200 words: bucket2
201+ words: bucket3
And place features within those buckets accordingly, given that feature placement (and position) within text contribute towards the classification scenario you are dealing with.
Really loved your comments, although now I have about 20 new bookmarked sites that I need to read :)… It appears that the more I know, the more I know that I don’t know!!
Hi Ibrahim,
Thanks for your response. Glad u like it.
I use a combination of NLTK (primarily for the Tree and ParentedTree datastructures) and sci-kit for classification/machine learning stuff.
Yeah the document structure feature is quite powerful. I use it for wikipedia classification (e.g infobox:, abstract:), etc. I need to blog about it.
– Manish
Hi,
Can you throw light on stemming exaggerations in conversations? such as happpyyyyy to happy.
The existing stemming algorithms doesn’t serve the same.
I would like to appreciate your efforts and the blog is quite informative.
Regards,
Vijay Raajaa G S
Hi Vijay,
Am really not sure how I missed your comment!, apologies for not replying sooner.
To handle exaggerations in conversation, most likely you will need to implement your own algorithm, which has to be specific to the classification problem you are trying to solve. For example this type of word reduction might not be a good idea if classifying Sentiment, Conversational Intensity, Mood and even perhaps personality traits, but would yield better results when applied to News Topics or Football Equipment classification problems.
As far as implementation goes, you could just write a Regular Expression (RegEx) that can identify a word with an exceptional number of repeated letters in that language, for example 3 letters in a row in English could indicate exaggerated speech, and then handle that word in the manner you best see fit, such as bucketing them as per MANISH’s suggestion above, or reduce them to a canonical word (remove the exaggeration), etc.
Remember to apply this type of word reduction on the training set, as well as the test set and the live data when the algorithm is being used.
I hope this helps!
i’m started doing my final year project on sentiment analysis and opinion mining for social networks.. pls tell me a classifier algorithm that gives greater accuracy and more advantages. i want that choosing classifier algorithm must be a new approach. Thankz in advance………
Hi Jaseema,
Regarding selecting the right algorithm for the job, I strongly recommend trying out a few algorithms yourself, even perhaps on different packages (NLTK, R or Orange, etc). Once you have the data-set ready, and have identified the best features to extract from the text, then applying various algorithms should be really simple, and there are mountains of tutorials and ready-made code that can help you with that. Personally I found that the MEGAM library on NLTK (which is a MaxEnt algorithm) offers decent accuracy across the many classification tasks I had to deal with.
As for finding a new approach, I recommend reading some of the new papers and research on sentiment analysis for inspiration, there are some really good and unique ideas out there, and hopefully this will help trigger some unique ideas of your own!
I hope this helps, best of luck with your dissertation!
Hello Sir Ibrahim Naji, reading your blog posts has let me learned many things, I thank you for that. I do have some questions too, I am currently doing a thesis on emotion analysis on disaster related tweets, and I have some problems, how do I incorporate the frequent words that have been collected then translate it into an emotion? For example, the tweet is, “God is our refuge and strength so when you choose friends make sure that most of them are God fearing pray 4 them #bopha” Thank you sir in advance.
Hello Jayson,
Awesome, am really glad you found the articles useful.
Regarding your question, there are many approaches to doing emotion classification, each has some advantages and disadvantages, I strongly recommending reading a book on Data Mining to get a better holistic picture about the over-all discipline, and maybe that will also spark some unique ideas of your own, which will make for a great thesis!. There is a good book by Bing Liu on Web Data Mining I would recommend for this task.
As far as your specific problem is concerned, it looks like you are looking at classifying based on Plutchik’s wheel of emotions, in which he identifies 8 primal emotions, and uses those to build more complex human emotions. There are already some classification APIs that returns results based on this wheel of emotion, I recommend looking at ConveyAPI’s Emotion classifier as a practical example.
As far as building a classifier is concerned, assuming you are building a “supervised learning” classifier, you will need 2 things:
1) A training corpus of already emotion classified text.
There are a few ways you could build this corpus, for example:
2) A classification engine.
There are many packages out there that really simplifies the whole process, for example R (for advance users) or NLTK (for beginners/mid-level users). Each of those packages have great examples on how to get started, and a huge user community with many blog posts that can provide you with already made code. If you decide to go for NLTK I really recommend going through Jacob Perkins’ Streamhacker blog, he has so many examples on NLTK, and explains things very well.
Once you have those 2 elements, then begins the process of refining the algorithm and results to achieve the desired accuracy, my posts on improving your classification algorithm, or measuring accuracy using the confusion matrix, should aid you some of the way there, or at least give you some ideas on how to get started. There are many interesting articles online if you do some research.
It is important to note that all of this might help you get started, but in order to produce something totally unique and awesome, you will need to dig deep and understand the mathematical models and concepts these algorithms operate on, and read some of the very interesting research out there on various approaches and the results they produced.
I hope this all makes sense, please do not hesitate to hit me up if you need more information.
Goodluck with your thesis!
Cheers,
Good text analytic methosds
I think you have remarked some very interesting points , appreciate it for the post. ckdfcbgdccck
A LOT of value here. Thanks Ibrahim, thanks Manish.
Thanks
Hello Ibrahim Naji, thank you for a great summary of tools and approaches. I am wondering what are the best emotion classifiers out there. Most are based on the Plutchik’s wheel. Do you know of any that go beyond that into Value Systems? I am considering having a unique classifier build for market research purposes. Would you recommend to use an existing one, or have one built specifically for my application? An example of a classifier “output” is on my website here (manual coding): http://www.heartbeat.marketing/report-example#600-women-1.
Thank you,
Lana
please i want to know the figure(value) or range of values that determine the classifier accuracy is good or bad.
How can i evaluate the content of a data set(s) i.e to know if a data set is good for a specific domain or not without the use of any classifier algorithms.
This question on ResearchGate might be able to help you out, there are multiple approaches to doing this.