NLTK Megam (Maximum Entropy) Library on 64-bit Linux

27 Nov
November 27, 2012

NLTK (Natural Language Toolkit) is a Python library that allows developers and researchers to extract information and annotations from text, and run classification algorithms such as the Naive Bayes or Maximum Entropy, as well as many other interesting Natural Language tools and processing techniques.

The Maximum Entropy algorithm from NLTK comes in different flavours, this post will introduce the different Max Ent classification algorithm flavours supported by the NLTK library, as well as provide a compiled MEGAM binary on a Linux (Ubuntu) 64-bit machine, which is a requirement for running Max Ent NLTK classification on the megam algorithm.

NLTK Max Ent

I found that the Maximum Entropy text classification in terms of sentiment tends to produce much better accuracy than the Naive Bayes approach, although it is really difficult to generalize that statement to all classification types, as the choice of algorithm could be highly dependant on what exactly are you trying to classify, the number of classifications you have, and how exactly are you extracting features from a given text for classification. I highly recommend trialling as much as possible when it comes to natural language classification.

With NLTK, you get the following implementation of the Maximum Entropy (or Logistic Regression) out of the box through the class MaxentClassifier (without installing any additional libraries):

IIS is the default algorithm used by NLTK if no algorithm is provided (and SciPy is not installed).

NLTK Max Ent with SciPy

If you include the SciPy library, which is a scientific Python library that helps in bringing complex mathematical modelling into python, you get the following additional Max Ent algorithms:

CG is the default algorithm used by NLTK if no algorithm is provided (and SciPy is installed).

I personally had a few issues with running the CG algorithm on both Windows 64 and Linux 64, the training on both machines ends up taking days and eating up more than 30 GB of memory with no result and no error message, but thats a post for another day.

NLTK Max Ent with MEGAM

MEGAM is an OCaml based Maximum Entropy project that originated from Utah university, a very cool external library to NLTK that can be plugged into NLTK and used to analyse your training corpus and build the classification algorithm out, MEGAM tends to perform much better than the SciPy’s algorithms in terms of speed and resource consumption.

  • MEGAM: MEGA Model Optimization Package

Before using MEGAM with NLTK, you need to download the MEGAM library and compile it on your machine, for that you’ll need OCaml installed, or you could download the binaries from the MEGAM site which can be run directly. I have also provided the compiled binaries of MEGAM (5th (b) release) for Linux 64 bit installation, since the site only offers the 32-bit executables. The zipped folder will contain both MEGAM and MEGAM.OPT which is the optimized version of MEGAM.

Also, since MEGAM is an external library, in order to reference it from within your NLTK application you need to add the following line of code:

nltk.config_megam('<FULL-PATH-TO-MEGAM-EXECUTABLE>/./megam.opt')

If you are having an issues with the executable please drop me a comment and I’ll be happy to help out, or even provide the virtual machine which I used to compile the OCaml code.

NLTK is really a lot of fun, and MEGAM makes designing Logistic Regression algorithm a breeze, particularly since this is very much a trial and elemination process, particularly in terms identifying what features in natural language that has a higher contribution towards the identification of the classification group of a sentence.

 

NLTK Megam (Maximum Entropy) Library on 64-bit Linux rated 4 out of 5 by 1 readers

NLTK Megam (Maximum Entropy) Library on 64-bit Linux , 4.0 out of 5 based on 1 ratings
* * * * ½ 3 votes
Tags: , , , , ,
12 replies
  1. Marco Ippolito says:

    Hi,
    thanks for your useful post.

    I’m trying to install and use MEGAM on my Ubuntu AWS microistance :
    ~/nltk_data/MEGAM]$ls -a
    . .. megam-64 megam-64.opt

    nltk.config_megam(‘~/nltk_data/MEGAM/megam-64.opt’)
    Traceback (most recent call last):
    File “”, line 1, in
    File “/usr/local/lib/python2.7/dist-packages/nltk/classify/megam.py”, line 59, in config_megam
    url=’http://www.cs.utah.edu/~hal/megam/’)
    File “/usr/local/lib/python2.7/dist-packages/nltk/internals.py”, line 528, in find_binary
    url, verbose)
    File “/usr/local/lib/python2.7/dist-packages/nltk/internals.py”, line 512, in find_file
    raise LookupError(‘\n\n%s\n%s\n%s’ % (div, msg, div))
    LookupError:

    ===========================================================================
    NLTK was unable to find the ~/nltk_data/MEGAM/megam-64.opt file!
    Use software specific configuration paramaters or set the MEGAM environment vari able.

    For more information, on ~/nltk_data/MEGAM/megam-64.opt, see:

    What do I have to do in order to make it working?

    Kind regards.
    Marco

    Reply
    • Links Naji says:

      Hi Marco,

      Apologies for the delayed response, I was away for Christmas and New Year.

      The error you are getting seems to be in relation to finding the megam file, two reasons that might be happening (that I could think of for now anyway):

      > The file path is incorrect: how about you try using the full path to the megam.opt file, rather than a relative path, when calling the nltk.config_megam() function, just to see if that fixes things. Could be as simple as that.

      > There is a mismatch between the architecture the file is compiled to, and the Python/Ubuntu versions running: The file is compiled to 64-Bit, are you sure the rest of the dependant are too?

      Cheers!

      Reply
      • Marco Ippolito says:

        Hi,
        thanks for your suggestions.
        I’ve solved it along with the memory shortage (adding swap memory space).
        Now I have this hurdle to overcome:

        Traceback (most recent call last):
        File “classifying.py”, line 492, in
        me_classifier = MaxentClassifier.train(train_feats, algorithm=’megam’)
        File “/usr/local/lib/python2.7/dist-packages/nltk/classify/maxent.py”, line 319, in train
        gaussian_prior_sigma, **cutoffs)
        File “/usr/local/lib/python2.7/dist-packages/nltk/classify/maxent.py”, line 1522, in train_maxent_classifier_with_megam
        stdout = call_megam(options)
        File “/usr/local/lib/python2.7/dist-packages/nltk/classify/megam.py”, line 167, in call_megam
        p = subprocess.Popen(cmd, stdout=subprocess.PIPE)
        File “/usr/lib/python2.7/subprocess.py”, line 679, in __init__
        errread, errwrite)
        File “/usr/lib/python2.7/subprocess.py”, line 1249, in _execute_child
        raise child_exception
        OSError: [Errno 13] Permission denied

        Any hints to solve it?

        Thanks in advance.
        Marco

        Reply
        • Links Naji says:

          Hmm… a long shot, but did you try running the process with admin rights? (sudo command perhaps?).

          Reply
          • Marco Ippolito says:

            Yes both sudo classifying.py and sudo-s classifying.py with the result above

          • Links Naji says:

            Well, the steps of what actually happens, when the MEGAM algorithm is selected in NLTK, is the following:

            > NLTK will create temporary files that contains the training set (in a format acceptable for MEGAM)
            > NLTK will spawn a subproces that will call, via command line, the MEGAM (.opt) executable, providing the location of the temporary file as parameters (along with some other parameters).
            > The MEGAM executable will run over the temporary file, and when done, creates an output dump of the results and exits.
            > NLTK will resume by consuming the MEGAM output dump and creating the trained classification object.

            In the past I have actually identified the exact command line statements that NLTK tries to run (on the MEGAM executable), and ran it directly myself (without NLTK, on a temporary file I created).

            So it seems to me that, for some reason, the user associated with the execution context of the command line statement is unable to spawn a subprocess, which will be used to execute the MEGAM executable.

            Maybe try changing the rights on the MEGAM executable (.opt file), or the location where NLTK creates temporary files, maybe one of them is a read-only?… something along that execution flow described above doesn’t seem to be configured correctly, although I am struggling to tell exactly which bit.

            You could always try to execute MEGAM directly, without NLTK, and see what happens!

            Hope this helps!

  2. Sebastian says:

    Great Post. Thanks for sharing this post of Megam.

    Reply
  3. Sushma NBharadwaj says:

    Hi,

    Thank you for such a wonderful tutorial. Could you please let me know how I can install it on a Windows system?

    Thank you

    Reply
  4. zhuguowei says:

    I find there are continuously output of `optimizing with lambda = 0` , is it normal?

    Reply

Trackbacks & Pingbacks

  1. […] tools come with many flavors of classification algorithms, in this article I go through the NLTK classification algorithm as well as present a Linux 64x compiled library of my favorite Maximum Entropy algorithm, […]

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>