In this article I discuss some methods you could adopt to improve the accuracy of your text classifier, I’ve taken a generalized approach so the recommendations here should really apply for most text classification problem you are dealing with, be it Sentiment Analysis, Topic Classification or any text based classifier. This is by no means a comprehensive list, but it should provide a nice introduction into the subject of text classification algorithm optimisation.
Tag Archive for: classification
To get something going with text (or any) classification algorithm is easy enough, all you need is an algorithm, such as Maximum Entropy or Naive Bayes, an implementation of each is available in many different flavors across various programming languages (I use NLTK on Python for text classification), and a bunch of already classified corpus data to train your algorithm on and that is it, you got yourself a basic classifier.
But the story rarely ends here, and to get any decent production-level performance or accuracy out of your classification algorithm, you’ll need to iteratively test your algorithm for optimum configuration, understand how different classes interact with each other, and diagnose any abnormality or irregularity you’re algorithm is experiencing.
In this post I hope to cover some basic mathematical tools for diagnosing and testing a classification algorithm, I will be taking a real life algorithm that I have worked as an example, and explore the various techniques we used to better understand how well it is performing, and when it is not performing, what is the underlying characteristic of this failure.
Trend analysis in my experience is generally done through manual (human) review and exploration of data through various BI tools, these tools do a great job by visually highlighting data that can be of interest to the data analyst, and when coupled with data-mining techniques such as clustering and forecasting, it gives us invaluable and actionable information that can help us further explore and exploit the business or data model at hand. As far as I can tell, the name of the game these days is “exploratory data analysis and mining”, at least in terms of Business Intelligence products on the market and the direction they are taking.
NLTK (Natural Language Toolkit) is a Python library that allows developers and researchers to extract information and annotations from text, and run classification algorithms such as the Naive Bayes or Maximum Entropy, as well as many other interesting Natural Language tools and processing techniques.
The Maximum Entropy algorithm from NLTK comes in different flavours, this post will introduce the different Max Ent classification algorithm flavours supported by the NLTK library, as well as provide a compiled MEGAM binary on a Linux (Ubuntu) 64-bit machine, which is a requirement for running Max Ent NLTK classification on the megam algorithm.