Twitter Sentiment Analysis Training Corpus (Dataset)

22 Sep

September 22, 2012

An essential part of creating a Sentiment Analysis algorithm (or any Data Mining algorithm for that matter) is to have a comprehensive dataset or corpus to learn from, as well as a test dataset to ensure that the accuracy of your algorithm meets the standards you expect. This will also allow you to tweak your algorithm and deduce better (or more precise) features of natural language that you could extract from the text that contribute towards stronger sentiment classification, rather than using a generic “word bag” approach.

This post will contain a corpus of already classified tweets in terms of sentiment, this Twitter sentiment dataset is by no means diverse and should not be used in a final product for sentiment analysis, at least not without diluting the dataset with a much more diverse one.

The dataset is based on data from the following two sources:

University of Michigan Sentiment Analysis competition on Kaggle
Twitter Sentiment Corpus by Niek Sanders

The Twitter Sentiment Analysis Dataset contains 1,578,627 classified tweets, each row is marked as 1 for positive sentiment and 0 for negative sentiment. I recommend using 1/10 of the corpus for testing your algorithm, while the rest can be dedicated towards training whatever algorithm you are using to classify sentiment. I tried using this dataset with a very simple Naive Bayesian classification algorithm and the result were 75% accuracy, given that a guess work approach over time will achieve an accuracy of 50% , a simple approach could give you 50% better performance than guess work essentially, not so great, but given that generally (and particularly when it comes to social communication sentiment classification) 10% of sentiment classification by humans can be debated, the maximum relative accuracy any algorithm analysing over-all sentiment of a text can hope to achieve is 90%, this is not a bad starting point.

Of course you can get cleverer with your approach, and use natural language processing to add some context, and better highlight features of the text that have a higher contribution rate towards sentiment deduction. I had fun running this dataset through the NLTK (Natural Language Tool Kit) on Python, which provides a highly configurable platform for different types of natural language analysis and classification techniques.

One thing to note is that tweets, or any form of social informal communication, contains many shortened words, characters within words as well as over-use of punctuation and may not conform to grammatical rules, this is something that you either need to normalize when classifying text or use to your advantage. For example you can deduce that the intensity of a particular communication is high by the amount of exclamation marks used, which could be an indication of a strong positive or negative emotion, rather than a dull (or neutral) emotion.

Things will start to get really cool when you can breakdown the sentiment of a statement (or a tweet in our case) in relation to multiple elements (or nouns) within that statement, for example lets take the following statement:

I really hate Apple and like Samsung

There are two explicit opposing sentiments in this statement towards 2 nouns, and an over-all classification of this statement might be misleading. A good natural processing package that allows you to pivot your classification around a particular element within the sentence is Lingpipe, I haven’t personally tried it (definitely on my list of things to-do), but I reckon it provides the most comprehensive library that is also enterprise ready (rather than research oriented).

Twitter Sentiment Analysis Training Corpus (Dataset) rated 5 out of 5 by 1 readers

* * * * ½ 34 votes

36 Comments/

Array Likes

/6 Tweets/posted in Sentiment Analysis

36 replies

Jai says:
April 26, 2013 at 6:38 am

Hello, What are the annotation guide lines which were obeyed for scoring the entries of the corpus you have posted here? (The 1.5million record corpus)
Reply
- admin says:
  April 26, 2013 at 9:40 am
  
  hey Jai,
  
  good question… am not really sure. This was only part of a proof of concept bit of research I had to do, and so I wasn’t overly focused on understanding the finer details of the corpus data, which is a must if you are trying to build an accurate and production ready NLP engine.
  
  You can try to follow the original sources of the data to learn more about their classification assumptions (links in the article).
  
  I will also be releasing a more comprehensive positive/negative sentiment corpus in the future (which is the actual one I used on our production ready sentiment classifier), with a detailed explanation of all the assumptions that went into the training set, and the best features/techniques to use to get the maximum out of it… so if you are interested, watch this space!
  
  Cheers
  Reply
Baldo says:
May 7, 2013 at 11:01 pm

Hi! Just a simple question. You say:

“…given that a guess work approach over time will achieve an accuracy of 50%…”

Actually, about 70% of the tweets are classified as positive tweets (+), so I think random guess over the most frequent class would give a 70% hit rate, wouldn’t it?

In that case the Naive Bayes approach you talked about the improvement is quite low, right?

Regards.
Reply
- admin says:
  May 8, 2013 at 12:04 am
  
  Hey Baldo,
  
  I can see I totally wasn’t clear in the text, the 50% refers to the probability of classifying sentiment on general text (say in a production environment) without a heuristic algorithm in-place; so basically it is like the probability of correctly calling a coin flip (heads/tails = positive/negative sentiment) with a random guess.
  
  When I tested the NB approach, I did the following:
  > Take out 1,000 positive and 1,000 negative sentiment text from the corpus and put them aside for testing.
  > Then train my NB algorithm (with very simple feature extraction) on the remaining data set.
  > Apply the test set and collate the accuracy results, which were 70% accuracy on a 2,000 entries (1,000 positive/1,000 negative) test corpus.
  
  So that leads to the statement that a simple NB algorithm could lead to better results than “random guess”.
  
  To be fair though that figure (70% accuracy) is barely scratching the surface of sentiment classification, with a clever bit of NLP feature extraction you could get awesome results, there are some interesting (and alot of) papers out there on the subject, definitely worth a read.
  Reply
Ulrike says:
July 8, 2013 at 11:04 am

Hi – I followed up on the two data sources you mention and I’m a bit confused about the numbers. Sanders’ list has ~5k tweets and the University of Michigan Kaggle competition talks about 40k (train + test, didn’t download). How do you get to 1.5 million tweets from that?
Reply
- Links Naji says:
  July 8, 2013 at 11:53 am
  
  Hey Ulrike,
  
  Yeah you are absolutely correct, there must be another source of sentiment classified tweets that I have used here, which am not entirely sure what.
  
  I am actually reviving this project over the next month due to a client demand, I will update the post at some point highlighting what the third source is (if I still have that information somewhere). Thanks for flagging this up!
  
  Cheers
  />L
  Reply
abul says:
September 5, 2013 at 9:23 am

hi, how about the experiment result on this dataset ?any papers to show?
Reply
- Links Naji says:
  September 11, 2013 at 9:41 am
  
  Hey Abul,
  
  Unfortunately no, the algorithm I developed for this particular classification problem based on the data in the article was too naive to warrant any proper research papers. A very simple “bag of words” approach (which is what I have used) will probably get you as far as 70-80% accuracy (which is better than a coin flip), but in reality any algorithm that is based on this approach will be unsatisfactory against practical and more complex constructs of sentiment in language.
  
  The topic of sentiment analysis via text is a large one, and if you are trying to innovate, or embark on a project that has real world applicability, then I strongly recommend reading the latest scientific papers on the subject.
  
  Cheers!
  Reply
Afshin says:
November 19, 2013 at 2:14 pm

Hi i am a newly admitted PhD student in Sentiment Analysis.

I need a resource for Sentiment Analysis training and found your dataset here. I just wondered if all the tweets are manually annotated or the positive negative tags are the results of a classifier algorithm?
I need to know that if i can use this 1.5 million tweets as gold standard for training and evaluation or they are not 100% human-labled and they are tagged by a classifier.
thanks and best.
Reply
- Links Naji says:
  November 19, 2013 at 2:22 pm
  
  Hi Afshin,
  
  The dataset is actually collated together from various sources, each source has indicated that they provide manually tagged tweets, whether you believe them or not is up to you really
  
  I am not even sure humans can provide 100% accuracy on a classification problem, this dataset might be “as accurate as possible”, but I wouldn’t say this is the ultimate indisputable corpus for sentiment analysis.
  
  You could potentially grow your own corpus for training, I’ve used Mechanical Turk in the past to build a dataset of topic classified text, although I have to say the accuracy of humans definitely leaves something to be desired
  
  Goodluck with your PHD, and all the best
  Reply
Ahmet Eser says:
June 6, 2014 at 2:55 am

Hello to clear up some confusion, I believe the corpus refers to Sentiment140 and it’s not exactly manually classified. More info can be found here: http://help.sentiment140.com/for-students

They say the following regarding this dataset: “Our approach was unique because our training data was automatically created, as opposed to having humans manual annotate tweets. In our approach, we assume that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative. We used the Twitter Search API to collect these tweets by using keyword search. This is described in our paper.”
Reply
Maryem says:
December 1, 2014 at 5:05 pm

Hello
I can’t download the Twitter Sentiment Analysis Dataset , can anyone help me please ??
Reply
- Links Naji says:
  December 8, 2014 at 9:50 pm
  
  Hey Maryem, Whats the issue exactly? I can download the corpus fine!
  
  Tbh, its been a while since this post, I am sure there are more comprehensive and better “groomed” corpus’s out there by now… surely!
  Reply
Yasen says:
January 13, 2015 at 10:26 am

Yes, the corpus is not manually created. Sanders’ group tried to create a reasonable sentiment classifier based on “distant supervision” – they gathered 1.5 million tweets with the vague idea that if a smiley face is found the tweet is positive and growney face -> negative.

They trained some smart algorithms to benefit from this vague knowledge and tested on (if I remember correctly) about 500 manually annotated tweets.

And your n-gram based experiment seems to be wrong – it should be super easy for it to learn that means positive and means negative. Did you exclude punctuation?
Reply
S.GokulaGokhila says:
January 21, 2015 at 11:37 am

Hi,I am Doing Mphil Research on “SOCIAL MEDIA ” Tweets on Sentiment Analysis
Please Send The DataSet For This……
Reply
Kashyap says:
February 12, 2015 at 11:07 am

Hi, I have been working on nltk for quite a few days now… I need a dataset for sentiment analysis. I downloaded the 1.5 million tweet dataset .. But the file is corrupted I guess.. While extracting it shows error….

If you could please send me the correct file it would be great… This dataset is very important for my project ! Please I request you to email me the 1.5 million tweet dataset…

Thanks in advance.
Reply
- Kashyap says:
  February 12, 2015 at 11:21 am
  
  Hey very sorry to disturb you… I downloaded the dataset once again… And its working fine… Sorry for bothering…
  
  Thanks a lot for the dataset Cheers
  Reply
Neo says:
February 21, 2015 at 9:00 pm

where is the dataset??
Reply
- Links Naji says:
  February 22, 2015 at 1:06 pm
  
  Here: http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip
  Can u not download it?
  Reply
Tristen Georgiou says:
October 8, 2015 at 11:47 pm

Seems like the CSV in this file isn’t well formatted (the tweet content isn’t always escaped properly). I was able to fix this using the following Python code:

import csv

if __name__ == "__main__":
fin = open("../Downloads/Sentiment Analysis Dataset.csv", "r")
fout = open("../Downloads/sad-clean.csv", "w")
writer = csv.writer(fout)
try:
# we know the first 3 columns are consistent...
for row in fin:
parts = row.strip().split(",")
out = parts[0:3] + [",".join(parts[3:])]
writer.writerow(out)
finally:
fin.close()
fout.close()
Reply
- Links Naji says:
  October 9, 2015 at 3:10 pm
  
  Thats awesome, thanks Tristen.
  
  Tbh, I reckon there are better corpus out there since I made this post, which is like ages ago
  Reply
anuj says:
June 14, 2016 at 1:22 pm

Ibrahim

The dataset contains 1,578,627 tweets. Are these hand labeled ??
The 2 sources you have cited contain 7086 and 5513 labeled tweets. which is less than 1% of your corpus.

thanks
Reply
- Links Naji says:
  June 19, 2016 at 4:16 pm
  
  Honestly, this was ages ago, I am not totally sure I would be able to recall.
  
  I can’t recommend this dataset for building a production grade model tho. Also, since I looked at this problem awhile ago, surely there are better sources of sentiment labelled corpora out there, no?. U can potentially build your own using Amazon’s mechanical turk, or any similar task distribution solution.
  Reply
- Dean says:
  July 5, 2016 at 3:56 am
  
  Hi lbrahim,
  
  Actually this dataset is not all hand classified. Sander’s (http://www.sananalytics.com/lab/twitter-sentiment/) is, but is a bit old dated.
  Reply
  - Links Naji says:
    July 8, 2016 at 12:16 am
    
    Thanks for sharing Dean!
    Reply
pragati says:
December 20, 2016 at 12:59 pm

hi….can tell me how to do sentiment analysis…..using java. i have to do this in java. and unable to find it….
Reply
sopan says:
March 18, 2017 at 7:12 am

emotion analysis dataset link plz
Reply
- kush shrivastava says:
  March 9, 2018 at 12:00 pm
  
  Yes I too need this dataset. I have a question that how we can annotate the dataset with emotion labels?
  Reply
Swapnil Babaji Shinde says:
March 26, 2017 at 8:18 am

Please post some twitter text datasets with multiple classes e.g. sports,technology etc. for text mining
Reply
keerti says:
April 22, 2017 at 9:13 pm

can u share me the facebook n twitter datasets for defining and predicting the human behavior in social IOT usig big data analytics
Reply
saba says:
February 8, 2018 at 8:18 am

can u please provide me the labelled data of twitter, as i am doing my m.tech dessertation in twitter spam detection and i am not able to get the labelled dat can u plz provide me the same
Reply
saba says:
February 8, 2018 at 8:19 am

can u plz provide me the labelled data for spam detection in twitter
Reply
mona says:
February 23, 2018 at 12:57 am

I need necessary to arabic sentment analysis dataset It was done reprocessing before for research , please help me In the fastest time
thanks
Reply
Sarker Monojit Asish says:
April 21, 2018 at 6:07 pm

Hi
I am working on twitter sentiment analysis for course project.Could you send me python source code ?

Thanks
Reply
Sithara Fernando says:
May 17, 2018 at 5:23 am

Dear sir ,

Can you please provide me a dataset that containing hashtags .And i need to build a hierarchy using the hashtags .I look forward to hearing from you .

Thank you.
Reply

Trackbacks & Pingbacks

The President Tweets Like a Democratic Senator | lynchklablog says:

May 26, 2017 at 8:50 pm

[…] sklearn package (MLPClassifier). For training data, I used 200,000 of the 1.5M labeled tweets from here, evenly split between positive and negative […]

Reply

Want to join the discussion?
Feel free to contribute!

Twitter Sentiment Analysis Training Corpus (Dataset)

Trackbacks & Pingbacks

Leave a Reply

Leave a Reply to Swapnil Babaji Shinde Cancel reply

Books I am currently reading

Bigging myself up abit

Dev Categories

Interesting links

Pages

Categories

Archive