NLTK Megam (Maximum Entropy) Library on 64-bit Linux
NLTK (Natural Language Toolkit) is a Python library that allows developers and researchers to extract information and annotations from text, and run classification algorithms such as the Naive Bayes or Maximum Entropy, as well as many other interesting Natural Language tools and processing techniques.
The Maximum Entropy algorithm from NLTK comes in different flavours, this post will introduce the different Max Ent classification algorithm flavours supported by the NLTK library, as well as provide a compiled MEGAM binary on a Linux (Ubuntu) 64-bit machine, which is a requirement for running Max Ent NLTK classification on the megam algorithm.
NLTK Max Ent
I found that the Maximum Entropy text classification in terms of sentiment tends to produce much better accuracy than the Naive Bayes approach, although it is really difficult to generalize that statement to all classification types, as the choice of algorithm could be highly dependant on what exactly are you trying to classify, the number of classifications you have, and how exactly are you extracting features from a given text for classification. I highly recommend trialling as much as possible when it comes to natural language classification.
With NLTK, you get the following implementation of the Maximum Entropy (or Logistic Regression) out of the box through the class MaxentClassifier (without installing any additional libraries):
- GIS: Generalized Iterative Scaling
- IIS: Improved Iterative Scaling, which for all intend and purposes, the same as above but offers a more consistent classifiier to the labelled training data.
NLTK Max Ent with SciPy
If you include the SciPy library, which is a scientific Python library that helps in bringing complex mathematical modelling into python, you get the following additional Max Ent algorithms:
- CG: Conjugate Gradient
- BFGS: Broyden-Fletcher-Goldfarb-Shanno algorithm
- Powell: Powell algorithm
- LBFGSB: Same as BFGS but with limited memory variant.
- Nelder-Mead: The Nelder-Mead algorithm
CG is the default algorithm used by NLTK if no algorithm is provided (and SciPy is installed).
I personally had a few issues with running the CG algorithm on both Windows 64 and Linux 64, the training on both machines ends up taking days and eating up more than 30 GB of memory with no result and no error message, but thats a post for another day.
NLTK Max Ent with MEGAM
MEGAM is an OCaml based Maximum Entropy project that originated from Utah university, a very cool external library to NLTK that can be plugged into NLTK and used to analyse your training corpus and build the classification algorithm out, MEGAM tends to perform much better than the SciPy’s algorithms in terms of speed and resource consumption.
- MEGAM: MEGA Model Optimization Package
Before using MEGAM with NLTK, you need to download the MEGAM library and compile it on your machine, for that you’ll need OCaml installed, or you could download the binaries from the MEGAM site which can be run directly. I have also provided the compiled binaries of MEGAM (5th (b) release) for Linux 64 bit installation, since the site only offers the 32-bit executables. The zipped folder will contain both MEGAM and MEGAM.OPT which is the optimized version of MEGAM.
Also, since MEGAM is an external library, in order to reference it from within your NLTK application you need to add the following line of code:
If you are having an issues with the executable please drop me a comment and I’ll be happy to help out, or even provide the virtual machine which I used to compile the OCaml code.
NLTK is really a lot of fun, and MEGAM makes designing Logistic Regression algorithm a breeze, particularly since this is very much a trial and elemination process, particularly in terms identifying what features in natural language that has a higher contribution towards the identification of the classification group of a sentence.
NLTK Megam (Maximum Entropy) Library on 64-bit Linux rated 4 out of 5 by 1 readers
thanks for your useful post.
I’m trying to install and use MEGAM on my Ubuntu AWS microistance :
. .. megam-64 megam-64.opt
Traceback (most recent call last):
File “”, line 1, in
File “/usr/local/lib/python2.7/dist-packages/nltk/classify/megam.py”, line 59, in config_megam
File “/usr/local/lib/python2.7/dist-packages/nltk/internals.py”, line 528, in find_binary
File “/usr/local/lib/python2.7/dist-packages/nltk/internals.py”, line 512, in find_file
raise LookupError(‘\n\n%s\n%s\n%s’ % (div, msg, div))
NLTK was unable to find the ~/nltk_data/MEGAM/megam-64.opt file!
Use software specific configuration paramaters or set the MEGAM environment vari able.
For more information, on ~/nltk_data/MEGAM/megam-64.opt, see:
What do I have to do in order to make it working?
Apologies for the delayed response, I was away for Christmas and New Year.
The error you are getting seems to be in relation to finding the megam file, two reasons that might be happening (that I could think of for now anyway):
> The file path is incorrect: how about you try using the full path to the megam.opt file, rather than a relative path, when calling the nltk.config_megam() function, just to see if that fixes things. Could be as simple as that.
> There is a mismatch between the architecture the file is compiled to, and the Python/Ubuntu versions running: The file is compiled to 64-Bit, are you sure the rest of the dependant are too?
thanks for your suggestions.
I’ve solved it along with the memory shortage (adding swap memory space).
Now I have this hurdle to overcome:
Traceback (most recent call last):
File “classifying.py”, line 492, in
me_classifier = MaxentClassifier.train(train_feats, algorithm=’megam’)
File “/usr/local/lib/python2.7/dist-packages/nltk/classify/maxent.py”, line 319, in train
File “/usr/local/lib/python2.7/dist-packages/nltk/classify/maxent.py”, line 1522, in train_maxent_classifier_with_megam
stdout = call_megam(options)
File “/usr/local/lib/python2.7/dist-packages/nltk/classify/megam.py”, line 167, in call_megam
p = subprocess.Popen(cmd, stdout=subprocess.PIPE)
File “/usr/lib/python2.7/subprocess.py”, line 679, in __init__
File “/usr/lib/python2.7/subprocess.py”, line 1249, in _execute_child
OSError: [Errno 13] Permission denied
Any hints to solve it?
Thanks in advance.
Hmm… a long shot, but did you try running the process with admin rights? (sudo command perhaps?).
Yes both sudo classifying.py and sudo-s classifying.py with the result above
Well, the steps of what actually happens, when the MEGAM algorithm is selected in NLTK, is the following:
> NLTK will create temporary files that contains the training set (in a format acceptable for MEGAM)
> NLTK will spawn a subproces that will call, via command line, the MEGAM (.opt) executable, providing the location of the temporary file as parameters (along with some other parameters).
> The MEGAM executable will run over the temporary file, and when done, creates an output dump of the results and exits.
> NLTK will resume by consuming the MEGAM output dump and creating the trained classification object.
In the past I have actually identified the exact command line statements that NLTK tries to run (on the MEGAM executable), and ran it directly myself (without NLTK, on a temporary file I created).
So it seems to me that, for some reason, the user associated with the execution context of the command line statement is unable to spawn a subprocess, which will be used to execute the MEGAM executable.
Maybe try changing the rights on the MEGAM executable (.opt file), or the location where NLTK creates temporary files, maybe one of them is a read-only?… something along that execution flow described above doesn’t seem to be configured correctly, although I am struggling to tell exactly which bit.
You could always try to execute MEGAM directly, without NLTK, and see what happens!
Hope this helps!
Great Post. Thanks for sharing this post of Megam.
Thanks Sebastian, glad you found it useful.
Thank you for such a wonderful tutorial. Could you please let me know how I can install it on a Windows system?
I find there are continuously output of `optimizing with lambda = 0` , is it normal?
Honestly, I am not sure. I haven’t used NLTK for a while now, and so not even sure what the latest is with this library.
I would however recommend using TensorFlow, which allow u to vectorize words and train an NLP model. (example: https://www.tensorflow.org/tutorials/word2vec). IMO, it is a much more powerful library.