In this article I discuss some methods you could adopt to improve the accuracy of your text classifier, I’ve taken a generalized approach so the recommendations here should really apply for most text classification problem you are dealing with, be it Sentiment Analysis, Topic Classification or any text based classifier. This is by no means a comprehensive list, but it should provide a nice introduction into the subject of text classification algorithm optimisation.
Archive for category: Classification
One way to increase the accuracy of a classification algorithm is to allow the algorithm to return an “Unknown” value, particularly when the probability of what we are trying to classify is too low to simply belong in one class and the algorithm is essentially guessing an answer, leading to incorrect classification.
In this post I will try and explore a method for researching and implementing the “Unknown” result in your classifier based on the probability distribution results of a classification, the idea is to give you the tools to tweak the optimum thresholds that gives you the best accuracy, while maintaining acceptable level of over-all coverage of data.
To get something going with text (or any) classification algorithm is easy enough, all you need is an algorithm, such as Maximum Entropy or Naive Bayes, an implementation of each is available in many different flavors across various programming languages (I use NLTK on Python for text classification), and a bunch of already classified corpus data to train your algorithm on and that is it, you got yourself a basic classifier.
But the story rarely ends here, and to get any decent production-level performance or accuracy out of your classification algorithm, you’ll need to iteratively test your algorithm for optimum configuration, understand how different classes interact with each other, and diagnose any abnormality or irregularity you’re algorithm is experiencing.
In this post I hope to cover some basic mathematical tools for diagnosing and testing a classification algorithm, I will be taking a real life algorithm that I have worked as an example, and explore the various techniques we used to better understand how well it is performing, and when it is not performing, what is the underlying characteristic of this failure.