Merge Join can be a very fast join operation, although it has some underlying behavior that could dramatically increase the over-all cost this operator has on the query execution plan.
In this post I will go through how SQL Server utilizes the Merge Join operator for performing logical join operations, as well as the different types of Merge Join and how to optimize a Merge Join operation.
Merge Join for Joining SQL Server Data-Sets
Merge Join uses a sorted input to quickly and efficiently determine the required joined output (inner join, left outer joint, etc.) through comparing the input rows.
Essentially by going through the two sorted data-sets involved in the join operation simultaneously, SQL Server is able to efficiently determine the required output by simply going through the data-sets, rather than performing a look-up on the sets (as with some other Join Operator types). This can lead to a very efficient and powerful Join operation, even on very larger data-sets, which is one occasion were SQL Server might opt-in for the Merge Join option for joining tables.
It is important to note that Merge Join will require both data-sets involved in the join operation to be sorted, either by design or by explicitly inserting a SORT operation on both or either data-sets as required, this will be seen in the Graphical Execution Plan as a SORT operator existing before the Merge Join operation.
Types of Merge Join Operator
There are two types of Merge Join, these are:
- One-To-Many Merge Join (or Regular Merge Join): This is typical when performing a one-to-one join
- Many-To-Many Merge Join: This is typical when performing a many-to-many Merge Join operation. When performing a many-to-many Merge Join, SQL Server uses a temporary table to store rows while being processed; this over-head is required to perform deduplication on the many-to-many data-sets being joined.
Optimizing Merge Join Operations
Obviously by behavior, it is much better to have both data-sets sorted when you are in a Merge Join scenario, since SORT is a requirement for this type of operator. This can be enforced through the clustered index on the actual table itself, or through creating any B-tree index which will sort data inherently by structure (such as non-clustered index or materialized views with the correct sort order).
The many-to-many Merge Join is typically slower considering the requirement for the temporary storage, as well as the data stream rate will be slower (a functionality that many database connectors take advantage of) due to the duplication handle mechanism. Obviously there is little you could do in this area to optimize the Merge Join performance, as generally this will be an unavoidable requirement of the returned data-set, but it is a behavior that is worth keeping in mind when interrogating query performance.
Sometimes SQL Server decides that a SORT operation is the best option due to a later on operator that requires the input to be sorted as-well, for example an aggregation operation (such as a ROLLUP or CUBE GROUP BY), and so SQL Server decides that sorting the input once to satisfy both heavy weight operations could be the better option. This is very important to realize when trying to switch to another Join Operator (such as Nested Loops Join), as such a switch (or enforcement through Query Hints) could lead to lower cost on the Join part of the execution plan, but higher over-all cost due to the delayed SORT operation that will show up before the aggregation.