Hashing can be a very useful technique when dealing with the storage and look up of large text fields (say a table of URLs or Search Keywords), these fields will incur high resource utilization on any database engine if used directly in DML statements, in which they are either filtered by or aggregated on. Any index built on these fields is costly to maintain, if it is at all possible given that SQL Server limits index size to 900 bytes.
Using hashing functions we can facilitate the handling of large textual data in the relational engine, leading to improved performance when these fields are being compared to satisfy a query, hashing can also be used to build unique and non-unique indexes that are easier to manage than directly using the text fields in the index definition. In this post we will discuss a few options for hashing large text data using functions native within SQL Server, as well as provide other external hashing algorithms that we can integrate into Microsoft’s SQL Server (or any RDBMS for that matter) that might provide a better practical performance.
Hashing (in the context of this post at least) is the process of reducing a large textual field into a smaller, easily comparable, value of a workable data type, such as reducing an NVARCHAR(1000) field in SQL Server to an INT one. There are many properties of a hashing function you need to consider before making a choice for your database, I will be arguing the hashing functions discussed here using the following considerations:
- The performance of computing the hash function: This is essentially an indication of how long it will take to generate the result of hashing a particular string-based data. Obviously the longer it takes to generate the hashes, the lower the performance will be.
- The data type of the hash function: Some hash function results can be stored as an INTEGER field, others might require VARCHAR or even NVARCHAR. It is important to choose the function with the smallest (and easily comparable) data-type that can still satisfy all your other requirements.
- The collision performance of the hash function: Since we are reducing a large textual content and trying to fit it into a smaller data type, we will run into situations where we have two slightly different textual content that will produce the same hash results. Collisions is a performance measure based on the hash algorithm distribution and uniformity over a particular dataset. Clearly the lower the collisions experienced over the dataset the better the algorithm will perform.
Any hashing function’s result need to be deterministic; in that given the same input, the function will always return the same result.
It is important to note that hashing algorithms do not require to produce unique results in order to be useful in a search argument (SARG) within SQL Server, you could still create a non-unique index on the result of the hash function and use the actual non-hashed field to eliminate any duplicate value at a much lower cost to the database engine. For example:
CHECKSUM(name) = 1596235592
AND name = 'model'
Lets take a look at a few of the available hashing functions.
CHECKSUM Hash Function
The CHECKSUM SQL Server system function can be used to generate an INT (32) typed hash value from a single or multiple fields (order and collation of fields are taken into account), the input can be anything aside from non-comparable types (such as XML or text).
- Hash Computation Performance: High, CHECKSUM can be computed very quickly and across a list of fields.
- Hash Data Type: INTEGER (32). This indicates that the range for the hashing function is: -2^31 to 2^30 (permutations of all possible values in a signed INT)
- Hash Collision Frequency: High. The algorithm has a high probability of collision, particularly when passing a single value to the function.
- Can be Replicated Outside SQL Server: No, this algorithm is native to SQL Server so in order to use it within code you need to connect to a SQL Server engine. The algorithm is comparable in implementation to CRC32 techniques.
- CHECKSUM_AGG function: Used to generate a hash result over values in a group. Mainly used to detect and track changes in the underlying values of an aggregation result, keeping in mind that it aggregates hashing results using an XOR operator, which means CHECKSUM_AGG will return the same value if the changes in the underlying group values represent as a “swap” of values between different rows. For example : 43, 341, 3, 764, 4, changed to: 3, 341, 43, 764, 4.
- BINARY_CHECKSUM function: Similar to CHECKSUM except it does not take into account the COLLATION setting of the string data; so for example ‘Abcde‘ and ‘abcde‘ are always different regardless of collation setting.
HASHBYTES Hash Function
Traditionally used for cryptography, the HASHBYTES function within SQL Server returns the result of a number of MD and SHA series hashing algorithms, these are:
- MD2: Can be stored in column of type CHAR(34) in SQL Server
- MD4: Can be stored in column of type CHAR(34) in SQL Server
- MD5: Can be stored in column of type CHAR(34) in SQL Server
- SHA1: Can be stored in column of type CHAR(42) in SQL SERVER
- SHA2 (256 or 512 bits): Can be stored in column of type CHAR(66) and CHAR(130) respectively in SQL SERVER
These can be used to generate low collision hashes at the expense of the returned data-type and over-all computational performance.
- Hash Computation Performance: Low, HASHBYTES takes considerably longer than CHECKSUM to generate.
- Hash Data Type: Although the returned value is VARBINARY, the resultant can be stored in the data types indicated above without loss of value.
- Hash Collision Frequency: Extremely low, for example the probability of a collision on an MD5 algorithm is 2.7×10^-20
- Can be Replicated Outside SQL Server: Yes, these are standard encryption and hashing algorithms, many libraries offer an implementation of each of these algorithms across a variety of programming languages.
MD5 is generally considered to offer optimal balance between performance and collisions in the HASHBYTES function, although if reducing collisions is of paramount importance, then an SHA based hashes might be more suitable.
FNV Hashing Algorithm
The Fowler–Noll–Vo hash function is my favorite, as it offers an acceptable rate of collisions (lower than CHECKSUM), but does not sacrifice computational performance of the hash result. There is also a handy SQL Server CLR for the FNV hash algorithm which I typically use throughout many databases that exhibit a hashing requirement.
- Hash Computation Performance: High
- Hash Data Type: Comes in a 3 different flavors: 16-bit, 32-bit and 64-bit integer depending on requirement. 64-bit offers the lowest collision frequency but requires a BIGINT data-type for storage.
- Hash Collision Frequency: Medium
- Can be Replicated Outside SQL Server: Yes
MurmurHash Hashing Algorithm
MurmurHash hash function is a fast performance algorithm that offers even lower collision rates than FNV (when utilizing the 128-bit results). There are 3 series of this algorithm, these are MurmurHash, MurmurHash2 and MurmurHash3, with MurmurHash3 being the latest version of the algorithm. There is a very handy implementation in C# of MurmurHash3 that can be easily converted into a CLR and used within SQL Server.
- Hash Computation Performance: High, can be computed quickly.
- Hash Data Type: MurmurHash3 comes in 2 flavors: 64-bit and 128-bit integer depending on requirement. 128-bit offers the lowest collision frequency.
- Hash Collision Frequency: Low
- Can be Replicated Outside SQL Server: Yes
Other Techniques for Handling Large String Data in SQL Server
Although hashing, used in the right context, can offer a very good approach for handling large text fields, it does come with certain draw backs, for example you will still need to maintain the original text (string data), as well as the result of the hash function, in order for a hash approach to work.
There are alternative solutions that can be explored, which can help working with large string field in specific scenarios, such as:
- Short Text Compression Algorithm: There exists a variety of algorithms and libraries that provide short text compression, some are made for specific jobs (for example URLs or Web Pages) others are generic. Being able to compress large string field into smaller workable (and perhaps indexable) ones could offer a great advantage in terms of storage as well as performance. There exists a few algorithms that does text compression. These algorithms tend to use dictionaries of most frequently used text to perform the compression, and can potentially lead to larger text output in certain scenarios.
- Writing a Scenario Specific Text Shortner: There are scenarios where the string data itself exhibits repeating patterns, these patterns can be extracted and normalized out of this string data, resulting in a shorter over-all string representation of the original data. For example if we take a string field that contains URL data, such as: http://thinknook.com/about?xyz=123, we could potentially extract the HTTP protocol and the domain information and place them in a separate table, then the row for this string data in the URL table will reference the domain id (in a separate table) and only contains about?xyz=123 in the URL table
Last Words on SQL Server Hash Functions
It is important to remember that hash functions might not be suitable if you are trying to perform wild card string search, such as
WHERE URLText LIKE 'google.com%'
If an index is possible and exists on the URLText field in the example above, then SQL Server will be able to use this index to speed up the partial match search operation.
Additionally, hash function can be combined through either hashing the result of a hash function, or presenting hash results for two or more different functions to reduce collisions. This could offer better performance in certain scenarios, although you will need to take into account the cost of computing all the required functions.
If you use any other hashing functions, or handle large text fields using other techniques not mentioned here, then please let me know in the comment section as I would love to hear about it.