This is a quick post to describe a solution to a problem I was having with my local installation of the HDInsight Hadoop cluster.
Archive for month: March, 2013
Hashing can be a very useful technique when dealing with the storage and look up of large text fields (say a table of URLs or Search Keywords), these fields will incur high resource utilization on any database engine if used directly in DML statements, in which they are either filtered by or aggregated on. Any index built on these fields is costly to maintain, if it is at all possible given that SQL Server limits index size to 900 bytes.
Using hashing functions we can facilitate the handling of large textual data in the relational engine, leading to improved performance when these fields are being compared to satisfy a query, hashing can also be used to build unique and non-unique indexes that are easier to manage than directly using the text fields in the index definition. In this post we will discuss a few options for hashing large text data using functions native within SQL Server, as well as provide other external hashing algorithms that we can integrate into Microsoft’s SQL Server (or any RDBMS for that matter) that might provide a better practical performance.