Keyword Stemming and Lemmatisation with Apache Solr
I started working recently with Apache Solr, and I am hugely impressed, the search technology is very solid and packs many IR, advance search and NLP features out of the box.
In this post I will provide an overview of how to setup Keyword Stemming on a field in your Solr core. A stemming filter will essentially expand the input Solr search term to include results containing stems of the original search term, in addition to the search term itself.
To quickly explain stemming in the context of Solr, lets take an example. Consider that you have the following documents uploaded in a field called content within your Solr core:
- The boy ran to the store.
- The boy is running to the store.
- The boy will run to the store.
- The boy runs to the store every day.
- The boy will grow up to be a runner, if he keeps going to the store.
When stemming is not setup on the content field (containing the documents above), a Solr query searching on the term “run” (so essentially a search parameter of q?content:run) will return only the 3rd document “The boy will run to the store”, while if stemming is setup on the content field, all or a subset of the 5 documents will be returned as part of the search result set. How many of these documents will be returned with stemming enabled depends on the stemming algorithm being applied.
Stemming in Solr is actually very simple to setup, to carry on with the example above by assuming we have a field in the Solr schema(.xml) file called content, which you intend to setup stemming enabled keyword search on, this field has a type name attribute called text_stem. So the XML node for this field might look something like this:
<field name="content" type="text_stem" indexed="true" stored="true" multiValued="true"/>
Then in order to enable stemming, you will need to define the fieldType for text_stem as follows:
<fieldType name="text_stem"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SnowballPorterFilterFactory"/> </analyzer> </fieldType>
The fieldType definition is pretty self explanatory, but just in case:
- Tokenizer solr.WhitespaceTokenizerFactory: This operation will break up the sentences into words, using whitespaces as delimiters.
- Filter solr.SnowballPorterFilterFactory: This filter will apply a stemming algorithm to each word (token). In the example above I have chosen the Snowball Porter stemming algorithm. Solr provides a few implementation of popular stemming algorithms.
That should be it! you might need to restart Solr and re-index the documents in the core already, but next time a search is performed on the content field then documents containing stems of the search term should be returned in the search result set.
Solr Stemming Algorithm Implementations
There are a few flavors of stemming algorithms supported by Solr, some are more aggressive than others, these are:
|Algorithm (Solr Filter)||Languages Supported||Notes|
|SnowballPorterFilterFactory||19||Implement different language specific flavours|
|PorterStemFilterFactory||1 (Eng)||Fast algorithm that removes common endings|
|HunspellStemFilterFactory||99||Combines dictionary and rules|
|KStemFilterFactory||1 (Eng)||Faster algorithm, Less aggressive than Porter.|
These stemming algorithms can be customized in Solr to suite the particular data-set you are searching, either by protecting certain words from being stemmed or overriding stemming values by providing a custom mapping dictionary for certain terms.
Lemmatisation using Solr
Lemmatisation is very similar to stemming, with a small but key difference, lemmatisation takes into account the context of the term and the semantics of that term within the over-all sentence, while stemming acts on terms with no context or meaning relating to the rest of the sentence.
Lemmatisation algorithms tend to be very domain specific, and in general you will find that a custom approach yields much stronger results than out of the box algorithms. Currently Solr has no implementation of a Lemmatisation algorithm, but the filter SynonymFilterFactory can be used to build a custom algorithm using token matching and replacing. The process of building out the dictionary files can be tedious though.
Note: I was using Solr 4.3 on Windows using the Bitnami Solr installer.
Any idea if we can use the keywordTokenizer along with stemming. For me it seems like it does not stem all the words in the sentence, but just stems the last word. The problem that i see is that since keywordTokenizer doesnot break it into words it doesnot work. Thoughts?
Tried the exact steps and added a class attribute to the field definition as it was required. Still looks like stemming is not working.
From my config.xml:
<fieldType name="text_stem" class=”solr.TextField”>