Recently one of our Data Warehouse SQL server instances started throwing a bunch of alerts (severity level: 20), the alert error message looked like this:
The client was unable to reuse a session with SPID [SPID], which had been reset for connection pooling. The failure ID is 29. This error may have been caused by an earlier operation failing
In this post I will go through the initial steps to diagnosing this issue, as well as introduce the new (SQL Server 2008+) Connectivity Ring Buffer feature, which lives in the sys.dm_os_ring_buffers DMV, and captures any server-side initiated SQL Server connection closures.
On first inspection of the error message, there seems to be some sort of resource starvation or inability to utilize certain resources to complete a SQL operation (for example, the Network is giving out). Essentially the SQL Server instance is for some reason dropping connections.
As with most SQL Server errors, the best place to start looking is the SQL Server ERRORLOG, the default path for SQL Server’s ERRORLOG file is:
:\Program Files\Microsoft SQL Server\MSSQL.\MSSQL\Log\ERRORLOG
Although the log is not always very helpful, I did notice a large number of deadlocks encountered just before the error message appeared, which started to sound like there might be a large number of connections trying to access similar resources, which in turn is causing these deadlocks, this does give the impression that we are experiencing resource starvation on the server, since deadlocks are costly and tend to keep connections open for longer than necessary.
Another good place to start digging around is the sys.dm_os_ring_buffers DMV’s Connectivity Ring Buffer. This ring buffer type will give you information regarding any server-side initiated connection closure (including connection closed using the KILL [SPID] function). Obviously you can establish the same result using server-side traces.
A great description of how to use the new RING_BUFFER_CONNECTIVITY feature can be found on msdn, this will include description of each XML tag, as well as a few handy ways to query the records.
I have found this handy script written by Jonathan Kehayias on MSDN, which helps convert the XML into table result for quicker analysis:
record.value('(Record/@id)', 'int') as id,
record.value('(Record/@type)', 'varchar(50)') as type,
record.value('(Record/ConnectivityTraceRecord/RecordType)', 'varchar(50)') as RecordType,
record.value('(Record/ConnectivityTraceRecord/RecordSource)', 'varchar(50)') as RecordSource,
record.value('(Record/ConnectivityTraceRecord/Spid)', 'int') as Spid,
record.value('(Record/ConnectivityTraceRecord/SniConnectionId)', 'uniqueidentifier') as SniConnectionId,
record.value('(Record/ConnectivityTraceRecord/SniProvider)', 'int') as SniProvider,
record.value('(Record/ConnectivityTraceRecord/OSError)', 'int') as OSError,
record.value('(Record/ConnectivityTraceRecord/SniConsumerError)', 'int') as SniConsumerError,
record.value('(Record/ConnectivityTraceRecord/State)', 'int') as State,
record.value('(Record/ConnectivityTraceRecord/RemoteHost)', 'varchar(50)') as RemoteHost,
record.value('(Record/ConnectivityTraceRecord/RemotePort)', 'varchar(50)') as RemotePort,
record.value('(Record/ConnectivityTraceRecord/LocalHost)', 'varchar(50)') as LocalHost,
record.value('(Record/ConnectivityTraceRecord/LocalPort)', 'varchar(50)') as LocalPort,
record.value('(Record/ConnectivityTraceRecord/RecordTime)', 'datetime') as RecordTime,
record.value('(Record/ConnectivityTraceRecord/LoginTimers/TotalLoginTimeInMilliseconds)', 'bigint') as TotalLoginTimeInMilliseconds,
record.value('(Record/ConnectivityTraceRecord/LoginTimers/LoginTaskEnqueuedInMilliseconds)', 'bigint') as LoginTaskEnqueuedInMilliseconds,
record.value('(Record/ConnectivityTraceRecord/LoginTimers/NetworkWritesInMilliseconds)', 'bigint') as NetworkWritesInMilliseconds,
record.value('(Record/ConnectivityTraceRecord/LoginTimers/NetworkReadsInMilliseconds)', 'bigint') as NetworkReadsInMilliseconds,
record.value('(Record/ConnectivityTraceRecord/LoginTimers/SslProcessingInMilliseconds)', 'bigint') as SslProcessingInMilliseconds,
record.value('(Record/ConnectivityTraceRecord/LoginTimers/SspiProcessingInMilliseconds)', 'bigint') as SspiProcessingInMilliseconds,
record.value('(Record/ConnectivityTraceRecord/LoginTimers/LoginTriggerAndResourceGovernorProcessingInMilliseconds)', 'bigint') as LoginTriggerAndResourceGovernorProcessingInMilliseconds,
record.value('(Record/ConnectivityTraceRecord/TdsBuffersInformation/TdsInputBufferError)', 'int') as TdsInputBufferError,
record.value('(Record/ConnectivityTraceRecord/TdsBuffersInformation/TdsOutputBufferError)', 'int') as TdsOutputBufferError,
record.value('(Record/ConnectivityTraceRecord/TdsBuffersInformation/TdsInputBufferBytes)', 'int') as TdsInputBufferBytes,
record.value('(Record/ConnectivityTraceRecord/TdsDisconnectFlags/PhysicalConnectionIsKilled)', 'int') as PhysicalConnectionIsKilled,
record.value('(Record/ConnectivityTraceRecord/TdsDisconnectFlags/DisconnectDueToReadError)', 'int') as DisconnectDueToReadError,
record.value('(Record/ConnectivityTraceRecord/TdsDisconnectFlags/NetworkErrorFoundInInputStream)', 'int') as NetworkErrorFoundInInputStream,
record.value('(Record/ConnectivityTraceRecord/TdsDisconnectFlags/ErrorFoundBeforeLogin)', 'int') as ErrorFoundBeforeLogin,
record.value('(Record/ConnectivityTraceRecord/TdsDisconnectFlags/SessionIsKilled)', 'int') as SessionIsKilled,
record.value('(Record/ConnectivityTraceRecord/TdsDisconnectFlags/NormalDisconnect)', 'int') as NormalDisconnect
--record.value('(Record/ConnectivityTraceRecord/TdsDisconnectFlags/NormalLogout)', 'int') as NormalLogout
( SELECT CAST(record as xml) as record
WHERE ring_buffer_type = 'RING_BUFFER_CONNECTIVITY') as tab
FROM connectivity_ring_buffer c
LEFT JOIN sys.messages m ON c.SniConsumerError = m.message_id AND m.language_id = 1033
ORDER BY c.RecordTime DESC
The ring buffer connectivity records should give you an idea of which connections are being closed (remote or local), the reason for their closure, and allows you to identify any trends in the connections being closed, as well as many other interesting traces.
The above should give you a good place to start investigating the issue, as for solutions, it really depends on what the underlying issue is, the problem with failure ID is 29 is that its just a generic error message to an underlying system problem (or even just a SQL Server bug), either way, there is no one fix.
That being said, it might be worth trying out the following:
- This might be a SQL Server bug, which Microsoft has resolved in their SQL Server Service Packs or Cumulative Updates.
- Inefficient queries could be over-utilizing the server resources, it might be worth experimenting with optimizing queries or placing a resource governance strategies. You can use DMVs to identify high resource utilization queries.
- If you have resource-intensive queries, a high degree of parallelism could be causing too many DOPs per query, which in turn would cause resource starvation, this MSDN post goes through a few theories as to why this could be happening.
There is also a closed ticket with Microsoft Connect that discusses the SQL Server failure ID 29 issue.
Anyway I hope this helped, or at least set you along the right path, goodluck!
SQL Server Connection Reset for Connection Pooling rated 4 out of 5 by 1 readers