An essential part of creating a Sentiment Analysis algorithm (or any Data Mining algorithm for that matter) is to have a comprehensive dataset or corpus to learn from, as well as a test dataset to ensure that the accuracy of your algorithm meets the standards you expect. This will also allow you to tweak your algorithm and deduce better (or more precise) features of natural language that you could extract from the text that contribute towards stronger sentiment classification, rather than using a generic “word bag” approach.
This post will contain a corpus of already classified tweets in terms of sentiment, this Twitter sentiment dataset is by no means diverse and should not be used in a final product for sentiment analysis, at least not without diluting the dataset with a much more diverse one.
The dataset is based on data from the following two sources:
- University of Michigan Sentiment Analysis competition on Kaggle
- Twitter Sentiment Corpus by Niek Sanders
The Twitter Sentiment Analysis Dataset contains 1,578,627 classified tweets, each row is marked as 1 for positive sentiment and 0 for negative sentiment. I recommend using 1/10 of the corpus for testing your algorithm, while the rest can be dedicated towards training whatever algorithm you are using to classify sentiment. I tried using this dataset with a very simple Naive Bayesian classification algorithm and the result were 75% accuracy, given that a guess work approach over time will achieve an accuracy of 50% , a simple approach could give you 50% better performance than guess work essentially, not so great, but given that generally (and particularly when it comes to social communication sentiment classification) 10% of sentiment classification by humans can be debated, the maximum relative accuracy any algorithm analysing over-all sentiment of a text can hope to achieve is 90%, this is not a bad starting point.
Of course you can get cleverer with your approach, and use natural language processing to add some context, and better highlight features of the text that have a higher contribution rate towards sentiment deduction. I had fun running this dataset through the NLTK (Natural Language Tool Kit) on Python, which provides a highly configurable platform for different types of natural language analysis and classification techniques.
One thing to note is that tweets, or any form of social informal communication, contains many shortened words, characters within words as well as over-use of punctuation and may not conform to grammatical rules, this is something that you either need to normalize when classifying text or use to your advantage. For example you can deduce that the intensity of a particular communication is high by the amount of exclamation marks used, which could be an indication of a strong positive or negative emotion, rather than a dull (or neutral) emotion.
Things will start to get really cool when you can breakdown the sentiment of a statement (or a tweet in our case) in relation to multiple elements (or nouns) within that statement, for example lets take the following statement:
I really hate Apple and like Samsung
There are two explicit opposing sentiments in this statement towards 2 nouns, and an over-all classification of this statement might be misleading. A good natural processing package that allows you to pivot your classification around a particular element within the sentence is Lingpipe, I haven’t personally tried it (definitely on my list of things to-do), but I reckon it provides the most comprehensive library that is also enterprise ready (rather than research oriented).
Twitter Sentiment Analysis Training Corpus (Dataset) rated 5 out of 5 by 1 readers