NLTK (Natural Language Toolkit) is a Python library that allows developers and researchers to extract information and annotations from text, and run classification algorithms such as the Naive Bayes or Maximum Entropy, as well as many other interesting Natural Language tools and processing techniques.
The Maximum Entropy algorithm from NLTK comes in different flavours, this post will introduce the different Max Ent classification algorithm flavours supported by the NLTK library, as well as provide a compiled MEGAM binary on a Linux (Ubuntu) 64-bit machine, which is a requirement for running Max Ent NLTK classification on the megam algorithm.
NLTK Max Ent
I found that the Maximum Entropy text classification in terms of sentiment tends to produce much better accuracy than the Naive Bayes approach, although it is really difficult to generalize that statement to all classification types, as the choice of algorithm could be highly dependant on what exactly are you trying to classify, the number of classifications you have, and how exactly are you extracting features from a given text for classification. I highly recommend trialling as much as possible when it comes to natural language classification.
With NLTK, you get the following implementation of the Maximum Entropy (or Logistic Regression) out of the box through the class MaxentClassifier (without installing any additional libraries):
- GIS: Generalized Iterative Scaling
- IIS: Improved Iterative Scaling, which for all intend and purposes, the same as above but offers a more consistent classifiier to the labelled training data.
NLTK Max Ent with SciPy
If you include the SciPy library, which is a scientific Python library that helps in bringing complex mathematical modelling into python, you get the following additional Max Ent algorithms:
- CG: Conjugate Gradient
- BFGS: Broyden-Fletcher-Goldfarb-Shanno algorithm
- Powell: Powell algorithm
- LBFGSB: Same as BFGS but with limited memory variant.
- Nelder-Mead: The Nelder-Mead algorithm
CG is the default algorithm used by NLTK if no algorithm is provided (and SciPy is installed).
I personally had a few issues with running the CG algorithm on both Windows 64 and Linux 64, the training on both machines ends up taking days and eating up more than 30 GB of memory with no result and no error message, but thats a post for another day.
NLTK Max Ent with MEGAM
MEGAM is an OCaml based Maximum Entropy project that originated from Utah university, a very cool external library to NLTK that can be plugged into NLTK and used to analyse your training corpus and build the classification algorithm out, MEGAM tends to perform much better than the SciPy’s algorithms in terms of speed and resource consumption.
- MEGAM: MEGA Model Optimization Package
Before using MEGAM with NLTK, you need to download the MEGAM library and compile it on your machine, for that you’ll need OCaml installed, or you could download the binaries from the MEGAM site which can be run directly. I have also provided the compiled binaries of MEGAM (5th (b) release) for Linux 64 bit installation, since the site only offers the 32-bit executables. The zipped folder will contain both MEGAM and MEGAM.OPT which is the optimized version of MEGAM.
Also, since MEGAM is an external library, in order to reference it from within your NLTK application you need to add the following line of code:
If you are having an issues with the executable please drop me a comment and I’ll be happy to help out, or even provide the virtual machine which I used to compile the OCaml code.
NLTK is really a lot of fun, and MEGAM makes designing Logistic Regression algorithm a breeze, particularly since this is very much a trial and elemination process, particularly in terms identifying what features in natural language that has a higher contribution towards the identification of the classification group of a sentence.
NLTK Megam (Maximum Entropy) Library on 64-bit Linux rated 4 out of 5 by 1 readers