Troubleshooting SQL Server Hang using system_health Xevents

Recently my customer reported they were seeing intermittent server hangs, connection timeouts and slow performance on their SQL Server instance. The first question I had, do we have PSSDIAG running on the SQL Instance for me to be able to understand the cause of the hang. However since we won’t be running PSSDIAG proactively on any of the servers we didn’t had a logs to look at.  So we started looking at the system_health Xevents session data at the time of the slow performance.

If you are not familiar with default system health Xevent sessions, I would suggest you read the blog post from Bob Dorr here

Reading from System_Health Xevents session is not the most convenient way of troubleshooting and hence I was used a solution which is developed my fellow PFE colleague Denzil which has blogged here

We observed the issue occurs intermittently when the application hangs and doesn’t allow any further connections to the SQL Server.

Resource Utilization

We do not see CPU or Memory pressure on the SQL Server instance. The System CPU utilization is 40% on any average with some spikes causing the CPU Utilization to rise to 100 % however during these times the SQL CPU Utilization is consistent to 30-35% on an average.

The Available Memory on the server is unchanged and is consistent while there is no low memory alert observed on the SQL instance

CPUUtil

MemoryUtil

 

 

 

 

 

 

 

 

Query Processing & THREADPOOL Waits

We see very high waits for THREADPOOL wait type which suggests we are running out of worker threads due to which we observed slowness and hangs on the SQL Instance. From the following Query Processing logs, we see that there is sudden surge in the worker threads from 198 to 806 worker threads in 10 min interval from 7:10 AM to 7:20 AM…. (Server time would be 2:10 AM to 2:20AM).

QueryProcessingEvent

 

 

 

 

When we check sys.dm_exec_requests data captured during the same time we see there are concurrent requests to SQL instance increased from 80 to 500 during the same interval which justifies the surge in the worker threads created to serve the 500 concurrent requests.

However we need to understand here, THREADPOOL waitype is not the cause but a victim since the assigned worker threads are not completing their assigned tasks ASAP. The root cause of the issue in this case was the blocking observed on the SQL instance which prevents the worker threads to complete its task quickly and release the worker thread back to worker pool.

Blocking

In this case, we found that the Head blocker is waiting on the non-io Buffer Latch PAGELATCH_XX (PAGELATCH_EX, PAGELATCH_UP, PAGELATCH_SH)  caused due to sequential inserts in a large table (Last Page Contention).  In this case we suggested to partition the table and build the indexes on alternate column to avoid Last page content (Further info on resolving latch contention can be found in this white paper).

 

Blocking

 

 

 

 

 

 

 

 

 

Using the System Health Xevents session and solution developed by Denzil we were able to nail down the issue on the SQL Instance without capturing any additional data.

Hope this helps !!!

Parikshit Savjani
Premier Field Engineer

Leave a comment

Your email address will not be published. Required fields are marked *