NLTK Megam (Maximum Entropy) Library on 64-bit Linux

27 Nov

November 27, 2012

NLTK (Natural Language Toolkit) is a Python library that allows developers and researchers to extract information and annotations from text, and run classification algorithms such as the Naive Bayes or Maximum Entropy, as well as many other interesting Natural Language tools and processing techniques.

The Maximum Entropy algorithm from NLTK comes in different flavours, this post will introduce the different Max Ent classification algorithm flavours supported by the NLTK library, as well as provide a compiled MEGAM binary on a Linux (Ubuntu) 64-bit machine, which is a requirement for running Max Ent NLTK classification on the megam algorithm.

NLTK Max Ent

I found that the Maximum Entropy text classification in terms of sentiment tends to produce much better accuracy than the Naive Bayes approach, although it is really difficult to generalize that statement to all classification types, as the choice of algorithm could be highly dependant on what exactly are you trying to classify, the number of classifications you have, and how exactly are you extracting features from a given text for classification. I highly recommend trialling as much as possible when it comes to natural language classification.

With NLTK, you get the following implementation of the Maximum Entropy (or Logistic Regression) out of the box through the class MaxentClassifier (without installing any additional libraries):

GIS: Generalized Iterative Scaling
IIS: Improved Iterative Scaling, which for all intend and purposes, the same as above but offers a more consistent classifiier to the labelled training data.

IIS is the default algorithm used by NLTK if no algorithm is provided (and SciPy is not installed).

NLTK Max Ent with SciPy

If you include the SciPy library, which is a scientific Python library that helps in bringing complex mathematical modelling into python, you get the following additional Max Ent algorithms:

CG: Conjugate Gradient
BFGS: Broyden-Fletcher-Goldfarb-Shanno algorithm
Powell: Powell algorithm
LBFGSB: Same as BFGS but with limited memory variant.
Nelder-Mead: The Nelder-Mead algorithm

CG is the default algorithm used by NLTK if no algorithm is provided (and SciPy is installed).

I personally had a few issues with running the CG algorithm on both Windows 64 and Linux 64, the training on both machines ends up taking days and eating up more than 30 GB of memory with no result and no error message, but thats a post for another day.

NLTK Max Ent with MEGAM

MEGAM is an OCaml based Maximum Entropy project that originated from Utah university, a very cool external library to NLTK that can be plugged into NLTK and used to analyse your training corpus and build the classification algorithm out, MEGAM tends to perform much better than the SciPy’s algorithms in terms of speed and resource consumption.

MEGAM: MEGA Model Optimization Package

Before using MEGAM with NLTK, you need to download the MEGAM library and compile it on your machine, for that you’ll need OCaml installed, or you could download the binaries from the MEGAM site which can be run directly. I have also provided the compiled binaries of MEGAM (5th (b) release) for Linux 64 bit installation, since the site only offers the 32-bit executables. The zipped folder will contain both MEGAM and MEGAM.OPT which is the optimized version of MEGAM.

Also, since MEGAM is an external library, in order to reference it from within your NLTK application you need to add the following line of code:

nltk.config_megam('<FULL-PATH-TO-MEGAM-EXECUTABLE>/./megam.opt')

If you are having an issues with the executable please drop me a comment and I’ll be happy to help out, or even provide the virtual machine which I used to compile the OCaml code.

NLTK is really a lot of fun, and MEGAM makes designing Logistic Regression algorithm a breeze, particularly since this is very much a trial and elemination process, particularly in terms identifying what features in natural language that has a higher contribution towards the identification of the classification group of a sentence.

NLTK Megam (Maximum Entropy) Library on 64-bit Linux rated 4 out of 5 by 1 readers

Tags: classification, logistic regression, max ent, megam, natural language processing, nltk

12 Comments/

Array Likes

/0 Tweets/posted in Coding, Coding Libraries, Data-Mining, Sentiment Analysis

Generic Trend Classification Engine using Pearson Correlation...

Testing & Diagnosing a Text Classification Al...

10 Tips to Improve your Text Classification Algorithm...

12 replies

Marco Ippolito says:
December 31, 2013 at 1:37 pm

Hi,
thanks for your useful post.

I’m trying to install and use MEGAM on my Ubuntu AWS microistance :
~/nltk_data/MEGAM]$ls -a
. .. megam-64 megam-64.opt

nltk.config_megam(‘~/nltk_data/MEGAM/megam-64.opt’)
Traceback (most recent call last):
File “”, line 1, in
File “/usr/local/lib/python2.7/dist-packages/nltk/classify/megam.py”, line 59, in config_megam
url=’http://www.cs.utah.edu/~hal/megam/’)
File “/usr/local/lib/python2.7/dist-packages/nltk/internals.py”, line 528, in find_binary
url, verbose)
File “/usr/local/lib/python2.7/dist-packages/nltk/internals.py”, line 512, in find_file
raise LookupError(‘\n\n%s\n%s\n%s’ % (div, msg, div))
LookupError:

===========================================================================
NLTK was unable to find the ~/nltk_data/MEGAM/megam-64.opt file!
Use software specific configuration paramaters or set the MEGAM environment vari able.

For more information, on ~/nltk_data/MEGAM/megam-64.opt, see:

What do I have to do in order to make it working?

Kind regards.
Marco
Reply
- Links Naji says:
  January 4, 2014 at 1:21 pm
  
  Hi Marco,
  
  Apologies for the delayed response, I was away for Christmas and New Year.
  
  The error you are getting seems to be in relation to finding the megam file, two reasons that might be happening (that I could think of for now anyway):
  
  > The file path is incorrect: how about you try using the full path to the megam.opt file, rather than a relative path, when calling the nltk.config_megam() function, just to see if that fixes things. Could be as simple as that.
  
  > There is a mismatch between the architecture the file is compiled to, and the Python/Ubuntu versions running: The file is compiled to 64-Bit, are you sure the rest of the dependant are too?
  
  Cheers!
  Reply
  - Marco Ippolito says:
    January 4, 2014 at 2:50 pm
    
    Hi,
    thanks for your suggestions.
    I’ve solved it along with the memory shortage (adding swap memory space).
    Now I have this hurdle to overcome:
    
    Traceback (most recent call last):
    File “classifying.py”, line 492, in
    me_classifier = MaxentClassifier.train(train_feats, algorithm=’megam’)
    File “/usr/local/lib/python2.7/dist-packages/nltk/classify/maxent.py”, line 319, in train
    gaussian_prior_sigma, **cutoffs)
    File “/usr/local/lib/python2.7/dist-packages/nltk/classify/maxent.py”, line 1522, in train_maxent_classifier_with_megam
    stdout = call_megam(options)
    File “/usr/local/lib/python2.7/dist-packages/nltk/classify/megam.py”, line 167, in call_megam
    p = subprocess.Popen(cmd, stdout=subprocess.PIPE)
    File “/usr/lib/python2.7/subprocess.py”, line 679, in __init__
    errread, errwrite)
    File “/usr/lib/python2.7/subprocess.py”, line 1249, in _execute_child
    raise child_exception
    OSError: [Errno 13] Permission denied
    
    Any hints to solve it?
    
    Thanks in advance.
    Marco
    Reply
    - Links Naji says:
      January 4, 2014 at 4:29 pm
      
      Hmm… a long shot, but did you try running the process with admin rights? (sudo command perhaps?).
      Reply
      - Marco Ippolito says:
        January 4, 2014 at 4:31 pm
        
        Yes both sudo classifying.py and sudo-s classifying.py with the result above
      - Links Naji says:
        January 4, 2014 at 10:47 pm
        
        Well, the steps of what actually happens, when the MEGAM algorithm is selected in NLTK, is the following:
        
        > NLTK will create temporary files that contains the training set (in a format acceptable for MEGAM)
        > NLTK will spawn a subproces that will call, via command line, the MEGAM (.opt) executable, providing the location of the temporary file as parameters (along with some other parameters).
        > The MEGAM executable will run over the temporary file, and when done, creates an output dump of the results and exits.
        > NLTK will resume by consuming the MEGAM output dump and creating the trained classification object.
        
        In the past I have actually identified the exact command line statements that NLTK tries to run (on the MEGAM executable), and ran it directly myself (without NLTK, on a temporary file I created).
        
        So it seems to me that, for some reason, the user associated with the execution context of the command line statement is unable to spawn a subprocess, which will be used to execute the MEGAM executable.
        
        Maybe try changing the rights on the MEGAM executable (.opt file), or the location where NLTK creates temporary files, maybe one of them is a read-only?… something along that execution flow described above doesn’t seem to be configured correctly, although I am struggling to tell exactly which bit.
        
        You could always try to execute MEGAM directly, without NLTK, and see what happens!
        
        Hope this helps!
Sebastian says:
May 11, 2015 at 4:41 am

Great Post. Thanks for sharing this post of Megam.
Reply
- Links Naji says:
  May 16, 2015 at 2:42 pm
  
  Thanks Sebastian, glad you found it useful.
  Reply
Sushma NBharadwaj says:
November 18, 2016 at 2:54 am

Hi,

Thank you for such a wonderful tutorial. Could you please let me know how I can install it on a Windows system?

Thank you
Reply
zhuguowei says:
October 6, 2017 at 11:52 am

I find there are continuously output of `optimizing with lambda = 0` , is it normal?
Reply
- Links Naji says:
  October 9, 2017 at 2:53 am
  
  Honestly, I am not sure. I haven’t used NLTK for a while now, and so not even sure what the latest is with this library.
  I would however recommend using TensorFlow, which allow u to vectorize words and train an NLP model. (example: https://www.tensorflow.org/tutorials/word2vec). IMO, it is a much more powerful library.
  Reply

Trackbacks & Pingbacks

Thinknook | 10 Ways to Improve your Text Classification Algorithm Accuracy and Performance says:

January 21, 2013 at 5:48 pm

[…] tools come with many flavors of classification algorithms, in this article I go through the NLTK classification algorithm as well as present a Linux 64x compiled library of my favorite Maximum Entropy algorithm, […]

Reply

Want to join the discussion?
Feel free to contribute!

NLTK Megam (Maximum Entropy) Library on 64-bit Linux

NLTK Max Ent

NLTK Max Ent with SciPy

NLTK Max Ent with MEGAM

Related Posts

Trackbacks & Pingbacks

Leave a Reply

Leave a Reply to zhuguowei Cancel reply

Books I am currently reading

Bigging myself up abit

Dev Categories

Interesting links

Pages

Categories

Archive