Twitter Sentiment Analysis Training Corpus (Dataset)

22 Sep
September 22, 2012

An essential part of creating a Sentiment Analysis algorithm (or any Data Mining algorithm for that matter) is to have a comprehensive dataset or corpus to learn from, as well as a test dataset to ensure that the accuracy of your algorithm meets the standards you expect. This will also allow you to tweak your algorithm and deduce better (or more precise) features of natural language that you could extract from the text that contribute towards stronger sentiment classification, rather than using a generic “word bag” approach.

This post will contain a corpus of already classified tweets in terms of sentiment, this Twitter sentiment dataset is by no means diverse and should not be used in a final product for sentiment analysis, at least not without diluting the dataset with a much more diverse one.

The dataset is based on data from the following two sources:

The Twitter Sentiment Analysis Dataset contains 1,578,627 classified tweets, each row is marked as 1 for positive sentiment and 0 for negative sentiment. I recommend using 1/10 of the corpus for testing your algorithm, while the rest can be dedicated towards training whatever algorithm you are using to classify sentiment. I tried using this dataset with a very simple Naive Bayesian classification algorithm and the result were 75% accuracy, given that a guess work approach over time will achieve  an accuracy of 50% , a simple approach could give you 50% better performance than guess work essentially, not so great, but given that generally (and particularly when it comes to social communication sentiment classification) 10% of sentiment classification by humans can be debated, the maximum relative accuracy any algorithm analysing over-all sentiment of a text can hope to achieve is 90%, this is not a bad starting point.

Of course you can get cleverer with your approach, and use natural language processing to add some context, and better highlight features of the text that have a higher contribution rate towards sentiment deduction. I had fun running this dataset through the NLTK (Natural Language Tool Kit) on Python, which provides a highly configurable platform for different types of natural language analysis and classification techniques.

One thing to note is that tweets, or any form of social informal communication, contains many shortened words, characters within words as well as over-use of punctuation and may not conform to grammatical rules, this is something that you either need to normalize when classifying text or use to your advantage. For example you can deduce that the intensity of a particular communication is high by the amount of exclamation marks used, which could be an indication of a strong positive or negative emotion, rather than a dull (or neutral) emotion.

Things will start to get really cool when you can breakdown the sentiment of a statement (or a tweet in our case) in relation to multiple elements (or nouns) within that statement, for example lets take the following statement:

I really hate Apple and like Samsung

There are two explicit opposing sentiments in this statement towards 2 nouns, and an over-all classification of this statement might be misleading. A good natural processing package that allows you to pivot your classification around a particular element within the sentence is Lingpipe, I haven’t personally tried it (definitely on my list of things to-do), but I reckon it provides the most comprehensive library that is also enterprise ready (rather than research oriented).

Twitter Sentiment Analysis Training Corpus (Dataset) rated 5 out of 5 by 1 readers

Twitter Sentiment Analysis Training Corpus (Dataset) , 5.0 out of 5 based on 1 ratings
18 votes
25 replies
  1. Jai says:

    Hello, What are the annotation guide lines which were obeyed for scoring the entries of the corpus you have posted here? (The 1.5million record corpus)

    Reply
    • admin says:

      hey Jai,

      good question… am not really sure. This was only part of a proof of concept bit of research I had to do, and so I wasn’t overly focused on understanding the finer details of the corpus data, which is a must if you are trying to build an accurate and production ready NLP engine.

      You can try to follow the original sources of the data to learn more about their classification assumptions (links in the article).

      I will also be releasing a more comprehensive positive/negative sentiment corpus in the future (which is the actual one I used on our production ready sentiment classifier), with a detailed explanation of all the assumptions that went into the training set, and the best features/techniques to use to get the maximum out of it… so if you are interested, watch this space! :)

      Cheers

      Reply
  2. Baldo says:

    Hi! Just a simple question. You say:

    “…given that a guess work approach over time will achieve an accuracy of 50%…”

    Actually, about 70% of the tweets are classified as positive tweets (+), so I think random guess over the most frequent class would give a 70% hit rate, wouldn’t it?

    In that case the Naive Bayes approach you talked about the improvement is quite low, right?

    Regards.

    Reply
    • admin says:

      Hey Baldo,

      I can see I totally wasn’t clear in the text, the 50% refers to the probability of classifying sentiment on general text (say in a production environment) without a heuristic algorithm in-place; so basically it is like the probability of correctly calling a coin flip (heads/tails = positive/negative sentiment) with a random guess.

      When I tested the NB approach, I did the following:
      > Take out 1,000 positive and 1,000 negative sentiment text from the corpus and put them aside for testing.
      > Then train my NB algorithm (with very simple feature extraction) on the remaining data set.
      > Apply the test set and collate the accuracy results, which were 70% accuracy on a 2,000 entries (1,000 positive/1,000 negative) test corpus.

      So that leads to the statement that a simple NB algorithm could lead to better results than “random guess”.

      To be fair though that figure (70% accuracy) is barely scratching the surface of sentiment classification, with a clever bit of NLP feature extraction you could get awesome results, there are some interesting (and alot of) papers out there on the subject, definitely worth a read.

      Reply
  3. Ulrike says:

    Hi – I followed up on the two data sources you mention and I’m a bit confused about the numbers. Sanders’ list has ~5k tweets and the University of Michigan Kaggle competition talks about 40k (train + test, didn’t download). How do you get to 1.5 million tweets from that?

    Reply
    • Links Naji says:

      Hey Ulrike,

      Yeah you are absolutely correct, there must be another source of sentiment classified tweets that I have used here, which am not entirely sure what.

      I am actually reviving this project over the next month due to a client demand, I will update the post at some point highlighting what the third source is (if I still have that information somewhere). Thanks for flagging this up!

      Cheers
      />L

      Reply
  4. abul says:

    hi, how about the experiment result on this dataset ?any papers to show?

    Reply
    • Links Naji says:

      Hey Abul,

      Unfortunately no, the algorithm I developed for this particular classification problem based on the data in the article was too naive to warrant any proper research papers. A very simple “bag of words” approach (which is what I have used) will probably get you as far as 70-80% accuracy (which is better than a coin flip), but in reality any algorithm that is based on this approach will be unsatisfactory against practical and more complex constructs of sentiment in language.

      The topic of sentiment analysis via text is a large one, and if you are trying to innovate, or embark on a project that has real world applicability, then I strongly recommend reading the latest scientific papers on the subject.

      Cheers!

      Reply
  5. Afshin says:

    Hi i am a newly admitted PhD student in Sentiment Analysis.

    I need a resource for Sentiment Analysis training and found your dataset here. I just wondered if all the tweets are manually annotated or the positive negative tags are the results of a classifier algorithm?
    I need to know that if i can use this 1.5 million tweets as gold standard for training and evaluation or they are not 100% human-labled and they are tagged by a classifier.
    thanks and best.

    Reply
    • Links Naji says:

      Hi Afshin,

      The dataset is actually collated together from various sources, each source has indicated that they provide manually tagged tweets, whether you believe them or not is up to you really :)

      I am not even sure humans can provide 100% accuracy on a classification problem, this dataset might be “as accurate as possible”, but I wouldn’t say this is the ultimate indisputable corpus for sentiment analysis.

      You could potentially grow your own corpus for training, I’ve used Mechanical Turk in the past to build a dataset of topic classified text, although I have to say the accuracy of humans definitely leaves something to be desired :)

      Goodluck with your PHD, and all the best

      Reply
  6. Ahmet Eser says:

    Hello to clear up some confusion, I believe the corpus refers to Sentiment140 and it’s not exactly manually classified. More info can be found here: http://help.sentiment140.com/for-students

    They say the following regarding this dataset: “Our approach was unique because our training data was automatically created, as opposed to having humans manual annotate tweets. In our approach, we assume that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative. We used the Twitter Search API to collect these tweets by using keyword search. This is described in our paper.”

    Reply
  7. Maryem says:

    Hello
    I can’t download the Twitter Sentiment Analysis Dataset , can anyone help me please ??

    Reply
    • Links Naji says:

      Hey Maryem, Whats the issue exactly? I can download the corpus fine!

      Tbh, its been a while since this post, I am sure there are more comprehensive and better “groomed” corpus’s out there by now… surely! :)

      Reply
  8. Yasen says:

    Yes, the corpus is not manually created. Sanders’ group tried to create a reasonable sentiment classifier based on “distant supervision” – they gathered 1.5 million tweets with the vague idea that if a smiley face is found the tweet is positive and growney face -> negative.

    They trained some smart algorithms to benefit from this vague knowledge and tested on (if I remember correctly) about 500 manually annotated tweets.

    And your n-gram based experiment seems to be wrong – it should be super easy for it to learn that :) means positive and :( means negative. Did you exclude punctuation?

    Reply
  9. S.GokulaGokhila says:

    Hi,I am Doing Mphil Research on “SOCIAL MEDIA ” Tweets on Sentiment Analysis
    Please Send The DataSet For This……

    Reply
  10. Kashyap says:

    Hi, I have been working on nltk for quite a few days now… I need a dataset for sentiment analysis. I downloaded the 1.5 million tweet dataset .. But the file is corrupted I guess.. While extracting it shows error….

    If you could please send me the correct file it would be great… This dataset is very important for my project ! Please I request you to email me the 1.5 million tweet dataset…

    Thanks in advance.

    Reply
    • Kashyap says:

      Hey very sorry to disturb you… I downloaded the dataset once again… And its working fine… Sorry for bothering…

      Thanks a lot for the dataset Cheers :)

      Reply
  11. Neo says:

    where is the dataset??

    Reply
  12. Tristen Georgiou says:

    Seems like the CSV in this file isn’t well formatted (the tweet content isn’t always escaped properly). I was able to fix this using the following Python code:

    import csv

    if __name__ == "__main__":
        fin = open("../Downloads/Sentiment Analysis Dataset.csv", "r")
        fout = open("../Downloads/sad-clean.csv", "w")
        writer = csv.writer(fout)
        try:
            # we know the first 3 columns are consistent...
            for row in fin:
                parts = row.strip().split(",")
                out = parts[0:3] + [",".join(parts[3:])]
                writer.writerow(out)
        finally:
            fin.close()
            fout.close()
    Reply
  13. anuj says:

    Ibrahim

    The dataset contains 1,578,627 tweets. Are these hand labeled ??
    The 2 sources you have cited contain 7086 and 5513 labeled tweets. which is less than 1% of your corpus.

    thanks

    Reply

Trackbacks & Pingbacks

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>