In this article I discuss some methods you could adopt to improve the accuracy of your text classifier, I’ve taken a generalized approach so the recommendations here should really apply for most text classification problem you are dealing with, be it Sentiment Analysis, Topic Classification or any text based classifier. This is by no means a comprehensive list, but it should provide a nice introduction into the subject of text classification algorithm optimisation.
Archive for category: Data-Mining
One way to increase the accuracy of a classification algorithm is to allow the algorithm to return an “Unknown” value, particularly when the probability of what we are trying to classify is too low to simply belong in one class and the algorithm is essentially guessing an answer, leading to incorrect classification.
In this post I will try and explore a method for researching and implementing the “Unknown” result in your classifier based on the probability distribution results of a classification, the idea is to give you the tools to tweak the optimum thresholds that gives you the best accuracy, while maintaining acceptable level of over-all coverage of data.
To get something going with text (or any) classification algorithm is easy enough, all you need is an algorithm, such as Maximum Entropy or Naive Bayes, an implementation of each is available in many different flavors across various programming languages (I use NLTK on Python for text classification), and a bunch of already classified corpus data to train your algorithm on and that is it, you got yourself a basic classifier.
But the story rarely ends here, and to get any decent production-level performance or accuracy out of your classification algorithm, you’ll need to iteratively test your algorithm for optimum configuration, understand how different classes interact with each other, and diagnose any abnormality or irregularity you’re algorithm is experiencing.
In this post I hope to cover some basic mathematical tools for diagnosing and testing a classification algorithm, I will be taking a real life algorithm that I have worked as an example, and explore the various techniques we used to better understand how well it is performing, and when it is not performing, what is the underlying characteristic of this failure.
Trend analysis in my experience is generally done through manual (human) review and exploration of data through various BI tools, these tools do a great job by visually highlighting data that can be of interest to the data analyst, and when coupled with data-mining techniques such as clustering and forecasting, it gives us invaluable and actionable information that can help us further explore and exploit the business or data model at hand. As far as I can tell, the name of the game these days is “exploratory data analysis and mining”, at least in terms of Business Intelligence products on the market and the direction they are taking.
NLTK (Natural Language Toolkit) is a Python library that allows developers and researchers to extract information and annotations from text, and run classification algorithms such as the Naive Bayes or Maximum Entropy, as well as many other interesting Natural Language tools and processing techniques.
The Maximum Entropy algorithm from NLTK comes in different flavours, this post will introduce the different Max Ent classification algorithm flavours supported by the NLTK library, as well as provide a compiled MEGAM binary on a Linux (Ubuntu) 64-bit machine, which is a requirement for running Max Ent NLTK classification on the megam algorithm.
An essential part of creating a Sentiment Analysis algorithm (or any Data Mining algorithm for that matter) is to have a comprehensive dataset or corpus to learn from, as well as a test dataset to ensure that the accuracy of your algorithm meets the standards you expect. This will also allow you to tweak your algorithm and deduce better (or more precise) features of natural language that you could extract from the text that contribute towards stronger sentiment classification, rather than using a generic “word bag” approach.
This post will contain a corpus of already classified tweets in terms of sentiment, this Twitter sentiment dataset is by no means diverse and should not be used in a final product for sentiment analysis, at least not without diluting the dataset with a much more diverse one.
Ok, this might be abit of a general question, as am sure anyone who found themselves in this blog knows a thing or two about BI, but in this post I will try to give a more holistic overview of what is a full Business Intelligence offering, and what dimensions constitutes a full analytical offering. Additionally, having a BI infrastructure is all well and good, but at the end of the day, the over-all goal of any BI platform is to identify and act upon the data as quickly as possible, when the data is most useful for strategic, tactical or operational business decisions, a concept that we will try and explore in this article.
The Data-Mining Excel Plugin for SQL Server 2008 is one of the more awesome tools in the Microsoft BI tool-set, although might require some configuration before deployment into the business.
While I was trying to roll-out this solution (to a test group), I ran into the following error message:
Error (data mining) Session Mining objects (including a special data source view used to process data mining dimensions) cannot be created on this instance
This posts goes through how to solve this issue and get your Data-Mining Excel Plugin talking with the back-end SSAS instance.
The Data-Mining Excel Plugin from SQL Server is one of the most powerful tools available to Data Analysts and Power-Users, by leveraging SQL Server’s SSAS (Analysis Service), the Data-Mining Plugin is able to make cutting edge Data-Mining Algorithms very accessible and easy to use, and with extended features such as creating, customizing and training your own custom Data-Mining model, Microsoft’s Self-Service BI (Business Intelligence) offering is truly comprehensive.