SQL Server English Thesaurus for FULLTEXT Search

07 Feb
February 7, 2012


SQL Server offers Full Text Search capabilities integrated within it’s framework, Full-Text Search offers very fast search capabilities over large text columns, along with advance search features such as Stemming or Thesaurus, as well as Infection (tense) Searches and Proximity Search, to name a few.

Out of the box, SQL Server’s FullText Search currently does not ship with a thesaurus for any language, here I present an English thesaurus that can be used to stem English language searches.

Microsoft did promise that a thesaurus will be included in future editions of SQL Server, although until then, here is a SQL Server English Language Thesaurus (XML), formatted in the exact manner accepted by SQL Server’s FULL-TEXT Search engine.

This XML thesaurus is based on the open-source CSV thesaurus that can be found at Stormrage.

There is an excellent post by Lowell on SQL Server Central, which shows how you can upload the thesaurus CSV file onto a structured table.

The imported data will appear in two tables, TheThesaurus and TheSynonyms, as SQL Server’s thesaurus takes in an XML file, you will need to write a query (FOR XML) that joins up these two tables to produce the output required by the FT Engine. I like FOR XML PATH, here is my query that I used to convert the data to an acceptable format for SQL Server:

DECLARE @ThesaurusXML XML

SET @ThesaurusXML = (SELECT
    T.[Word] AS pat,
    T.[Word] AS sub,
    (
        SELECT [TheDefinition] AS sub
        FROM TheSynonyms S
        WHERE T.ThesaurusID = S.ThesaurusID
        FOR XML PATH(''),TYPE
    )
FROM
    [TheThesaurus] T
--WHERE
--  T.ThesaurusID IN (252,1)
FOR XML PATH('replacement'),ROOT('thesaurus'),TYPE
  )
 
  --SELECT @ThesaurusXML

SQL Server stores thesaurus XML files by default within its installation structure, generally can be found in the following path structure:
..\Program Files\Microsoft SQL Server\MSSQL.\MSSQL\FTData

There you will see a collection of files, each file is relevant to a specific language code, aside from the global file (tsglobal.xml).

The global thesaurus file by default is run against every query using the keyword expansion feature, as well as the language specific thesaurus file. Be careful not over-load your global thesaurus, particularly if you are running a multi-language Full-Text search engine.

The XML thesaurus file is set to be diacritics insensitive, you can change the XML value to 1 (from 0) if that is a business requirement.

After uploading the XML thesaurus, you will need to load the FullText Thesaurus file using the following SQL command:

sp_fulltext_load_thesaurus_file <lang-id>

Where is 0 for the global thesaurus file. So:

EXEC sp_fulltext_load_thesaurus_file 0

Make sure you have enough memory and CPU resources to process the XML file, it can be resource intensive.

Full-Text is a very cool technology and easy to setup too, I really recommend having a browse through the advance features offered by this search framework.

SQL Server English Thesaurus for FULLTEXT Search rated 4 out of 5 by 1 reader(s)

SQL Server English Thesaurus for FULLTEXT Search , 4.0 out of 5 based on 1 ratings

* * * * ½ 7 votes
8 replies
  1. Drew says:

    Great post – this helped me get a thesaurus loaded into my server and test it out. I have horrible performance, however! A query on a two word string took 5 minutes to return a result based on a freetexttable or contains query. Are there any ways to optimize this? My full text index really isn’t that big, in my opinion, just 50,000 records on one column and an average of 5-6 words in each row.

    Reply
  2. admin says:

    Hey Drew,

    Thanks!

    Regarding performance, I noticed that as well due to the large size of the thesaurus, fortunately for me I only had to set this up as a proof of concept, so performance was not really an issue. That being said, I dont think things were as bad as 5 minutes for a 2 word query when I tested it on even larger tables.

    Optimizing FULL-TEXT search performance with a thesaurus tends to be difficult, at the time when i was researching this the recommendation for optimization were along the lines of:

    > Reduce the size of the thesaurus and the stemming; basically the recommendation is to create a “vertical” based thesaurus relevant to the problem at hand, rather than a general one to cover all scenarios.

    The question is aswell, how fast is your FULL-TEXT search without a Thesaurus?, and what does your query look like?

    Things such as the size of the unique index size (which FULL-TEXT is built upon) could drastically affect the performance of your FULL-TEXT Search queries.

    I would also recommend having a read of articles for optimizing for Full-Text search in general, but you probably had a look at that anyway :).

    Cheers
    />L

    Reply
  3. Jason says:

    Not still sure if you’re monitoring this page however I have a question. Lets say I have a column in a table called description. Now anyone from any country can put in a description, in any language they choose. Would I basically need a single XML file that contains the thesaurus of EVERY language that people might be writing a description in? If so is there a better way of doing it without forcing the description writer, and the searcher, to choose what language they’re using?

    Reply
    • Links Naji says:

      Hey Jason,

      In the past I had a similar situation to the one you are describing.

      The single XML file approach works great if you have small enough file, perhaps vertical specific thesaurus to the type of content that might be housed in the Description column, rather than a general thesaurus that tries to cover everything, although if each thesaurus is similar in size to the English one attached in this article, then this approach would be rendered too slow at query time to be practical.

      If you do not want to trouble the users with selecting the language they are writing in, you could automatically detect the language using a simple classifier, or perhaps a language detection API. some API offer free limited usage, such as detectlanguage.com. This is the approach I opted in for, as it keeps the language files small enough for querying, as well as provide context to the description field. I used a combination of multiple free language detection APIs, once a (daily) limit is reached in one of the APIs I just switch to the next one.

      Although we are mainly discussing “Query Time” operations here, it is important to note that language settings are also configured at “Index Time” in a SQL Server FULL-TEXT index. The choice of language at “Index Time” will help in the selection of the correct Word Breaker class, which operates on various languages differently.

      You might want to reach this article on SQL Server Full Text Search Language Features by Hilary Cotter for more information on Index Time settings.

      Hope this helps!

      Reply
      • Jason says:

        Thank you Naji,
        It was a lot of help and I ended up going with Google’s language detection.. While not free, at $20 per 1 million characters it’s not bad.

        I AM however having a new problem that perhaps you can help with. When I first implemented this, using the thesaurus that was provided on this site, it worked fine (and fast, at least for a tiny sample of 3 records). However now, just a couple days later with no code changes, when I try to run the search on 3 records, it takes forever (and never finishes) When I check it out in Task Manager, I can see Sql Server eating up my memory until all 8 gigs are gone and my system freezes up (again I’m only searching 3 records so there’s no way it’s that inefficient) Is this anything you’ve ran into before or might have a viewpoint on?
        Whether I run the code in my website, or do it as a standalone query in the sql management console, the same thing happens, it gobbles memory until I manually end the process.. Any thoughts would be wonderful.
        Ty for all the input,
        Jason

        Reply
        • Links Naji says:

          Hey Jason,

          Glad that worked out!.

          Regarding the memory issues you are having, given the description you provided, am starting to think that during your original test the thesaurus was not being applied (or loaded).

          Essentially if you have not explicitly loaded the thesaurus once SQL Server is restarted, then it will load automatically on the first FULL-TEXT query executed that uses the thesaurus. That is why you are noticing long delays on the first query you are executing against the table.

          You could test that loading the thesaurus is causing the long delays by manually forcing it to load, you could do that via executing the Stored Procedure sp_fulltext_load_thesaurus_file (usage shown above) and seeing how long it takes to finish executing.

          Loading this particular thesaurus managed to place considerable pressure on my SQL Server VM, and it has at least twice the amount of memory resources compared to what you have, I will be really surprised if it worked as fast as you said in your original tests, even with a tiny number of document. The process of loading the thesaurus can be extremely demanding, irrespective of the size and number of documents in the FULL-TEXT index.

          There are many ways you could optimize your FULL-TEXT index to reduce memory pressure, and you could (and really should) limit how much memory resources SQL Server consumes in total, in order to avoid total system crashes, but at the end of the day the thesaurus size will be a bottle-neck, and unless you can fork out for additional server resources, you might have to select a subset of keywords instead.

          Let me know if you need any resources/ideas on optimizing full-text search, or on how to select the right keywords to include in your trimmed thesaurus.

          Hope this helps!
          />L

          Reply
          • Jason says:

            Thank you Naji,
            I could have sworn I had tested it out but I guess I was mistaken. I tried it with a small sample file and it loaded immediately. While I could buy some more memory, the time it takes to pre-load the file is a non-starter for me, especially while I’m running it on my local machine for testing.

            I’d definitely appreciate any resources or ideas you have on optimizing full-text search, this is my first foray into using FTS so I’m still learning. Is there an “Easy” way to trim down the thesaurus? It seems to me I can either start with the huge file and delete what I don’t need (which would take ages) or start from scratch which would take even longer.

            Thanks again, This has been a lot of help!
            Jason

          • Links Naji says:

            Hey Jason,

            Thats awesome, no worries!

            If you look at the second comment, you should find a link to optimizing your FTS performance, there is also a nice (but rather old) article that lists 10 ways to optimize FTS, alot of the advice should still be applicable. Lastly, there is an article on SQLCAT that discusses best practice with regard to implementing FTS.

            Although your problem is specifically to do with loading the thesaurus, so not many of the advice in these articles will be applicable to speed up your situation… aside from trimming the thesaurus.

            So with regard to trimming, one approach I took was that I selectively included all the words that appear in the content I am housing. So essentially for every document in my FTS index, I tokenized it on whitespaces, and for every keyword that appeared in the text, I made sure that the corresponding entry is added to the thesaurus (if available). This means that your thesaurus will only contain the terms that actually exists in your document, which means if you have a relatively low number of documents (and the growth of the index isn’t that big), then the thesaurus will be trimmed considerably. I actually did an over night job that updates the thesaurus XML file daily based on data in FTS.

            Another approach that *might* be suitable is to have a trimmed down thesaurus starting set, say the top 1000 words used in the English language (you can find alot of sites online that provide this type of data), then based on your user’s search pattern (by tracking what terms they are using when searching), you could start slowly introducing new terms that get searched for alot in your thesaurus.

            Each approach obviously has some merits, while also suffering from certain drawbacks, I think you need to look at your specific situation and see what approach works best!

            Cheers J,
            />L

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply to admin Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>