I killed Reindexing Job and now it is in KILLED/ROLLBACK State for more than 9 hours

 

One of our Customers ran into this issue where in he killed a Reindexing Job and following that it went into KILLED/ROLLBACK State and was in that State for more than 9 hours.

First of all we would not recommend you to kill Job like Reindexing or DBCC CHECKDB which performs internal operations.

However due to some reason if you are in this mess, you can follow the steps which we used to resolve this issue

We queried reindexing spid from sysprocesses to understand what is it waiting on.

We queried sysprocesses multiple times to see if the CPU used or physical IO performed by the spid increases. we found that neither CPU nor IO consumed changes for 2 different snapshot of sysprocesses which it indicates that it hung and not doing any operation

Spliting the output into 2 table as my Blog theme doesn’t shupport single table output

 

Sysprocesses
spid kpid blocked waittime lastwaittype waitresource
55 13056 0 33037675 EXECSYNC                                                                                                                                                                                                                                                                                       
55 15508 55 100643062 LCK_M_X                         KEY:5:72057594101563392 (3601a3e3f92d)                                                                                                                                                                                                                        
55 10676 55 100638124 LCK_M_X                         KEY:5:72057594101563392 (e3008001e214)                                                                                                                                                                                                                        
Blocked
55 15508 55 101000192 LCK_M_X                         KEY:5:72057594101563392 (3601a3e3f92d)                                                                                                                                                                                                                        
55 10676 55 100995253 LCK_M_X                         KEY:5:72057594101563392 (e3008001e214)                                                                                                                                                                                                                        

 

dbid cpu physical_io ecid open_tran status cmd
5 1565934 1192250 0 1 suspended                     KILLED/ROLLBACK 
5 2147483647 9714 17 0 suspended                     KILLED/ROLLBACK 
5 2147483647 9480 33 0 suspended                     KILLED/ROLLBACK 
5 2147483647 9714 17 0 suspended                     KILLED/ROLLBACK 
5 2147483647 9480 33 0 suspended                     KILLED/ROLLBACK 

 

To understand the above sysprocesses output,

Multiple entries of SPID 55 is due to parallelism. Also the waittype of the first SPID EXECSYNC suggest that it is parent thread in parallelism responsible for synchronization

The 2 parallel threads spawned by SPID 55 are waiting on each other to release lock LCK_M_X on database id 5 and hobt id 72057594101563392 (hobt id can be obtained fro sys.partitions)

From the above output of Sysprocesses it was clear that spid 55 which is reindexing job is in suspended state and it blocked by itself.(self blocking)

In other words it is in Intra-query parallel query deadlock which was once explained by Bart Duncan in his blog (http://blogs.msdn.com/b/bartd/archive/2008/09/24/today-s-annoyingly-unwieldy-term-intra-query-parallel-thread-deadlocks.aspx)

In order to trace out the Deadlock we can enable Trace Flag 1222 on the server and following was our Deadlock Graph

 

11/01/2010 18:44:05,spid20s,Unknown,Deadlock encountered …. Printing deadlock information

11/01/2010 18:44:05,spid20s,Unknown,ResType:LockOwner Stype:’OR’Xdes:0x00000000801507E0 Mode: X SPID:55 BatchID:0 ECID:33 TaskProxy:(0x00000003782D7DA0) Value:0x940f9280 Cost:(-30/0)

11/01/2010 18:44:05,spid20s,Unknown,Requested by:

11/01/2010 18:44:05,spid20s,Unknown,Owner:0x0000000197568800 Mode: X        Flg:0x40 Ref:0 Life:02000000 SPID:55 ECID:17 XactLockInfo: 0x00000000858A79B0

11/01/2010 18:44:05,spid20s,Unknown,Grant List 1:

11/01/2010 18:44:05,spid20s,Unknown,KEY: 5:72057594101563392 (e3008001e214) CleanCnt:2 Mode:X Flags: 0x1

11/01/2010 18:44:05,spid20s,Unknown,Node:2

11/01/2010 18:44:05,spid20s,Unknown,Log Viewer could not read information for this log entry. Cause: Data is Null. This method or property cannot be called on Null values.. Content:

11/01/2010 18:44:05,spid20s,Unknown,ResType:LockOwner Stype:’OR’Xdes:0x00000000858A7970 Mode: X SPID:55 BatchID:0 ECID:17 TaskProxy:(0x00000003782D77A0) Value:0x99460c40 Cost:(-30/0)

11/01/2010 18:44:05,spid20s,Unknown,Requested by:

11/01/2010 18:44:05,spid20s,Unknown,Owner:0x000000020C81A7C0 Mode: X        Flg:0x40 Ref:0 Life:02000000 SPID:55 ECID:33 XactLockInfo: 0x0000000080150820

11/01/2010 18:44:05,spid20s,Unknown,Grant List 2:

11/01/2010 18:44:05,spid20s,Unknown,KEY: 5:72057594101563392 (3601a3e3f92d) CleanCnt:2 Mode:X Flags: 0x1

11/01/2010 18:44:05,spid20s,Unknown,Node:1

11/01/2010 18:44:05,spid20s,Unknown,Log Viewer could not read information for this log entry. Cause: Data is Null. This method or property cannot be called on Null values.. Content:

 

Ideally when you kill an open user transaction and if that SPID goes into ROLLBACK state we would recommend you to wait until the ROLLBACK is completed and time required to ROLLBACK depends on the work done by the open transaction which needs to be rolled back (Further reading http://blogs.msdn.com/b/psssql/archive/2008/09/12/sql-server-2000-2005-2008-recovery-rollback-taking-longer-than-expected.aspx)

In such case you should never attempt to restart the sql server service since at restart the database will go in recovery state where database will undergo 3 states of recovery viz Analysis,REDO and UNDO. So if there is a SPID in ROLLBACK state which is not completed and if you restart the SQL Service the database might continue to remain in recovery until entire ROLLBACK is completed. SO previously only the table involved in transaction were locked by ROLLBACK operation but now entire database is unavailable to users.

However in our case where we are facing Intra-query parallelism Deadlock, our only solution is to restart the sql server service. Here as well the Sql Service may take time to start depending on the amount of T-log entries but it should be faster as compared to the above case if there is no pending open user transaction.

In our case it took 40 mins for SQL Service to start since our Tlog had grown upto 160GB.

However for such Intra-query parallelism Deadlock we should not delay for restart since the more we delay to restart the service the more Tlog grows and intrun delays the startup of Sql Service.

So to resolve Intra-query parallelism Deadlock for Killed Transactions you should restart sql server service as soon as you detect it for all the other cases wait for the ROLLBACK to complete and never restart sql service during the ROLLBACK phase.

Hope this makes sense Smile

Parikshit Savjani
Premier Field Engineer, Microsoft

3 comments

  1. Hi Parikshit,

    Good to see that article but you have any solution if this prblem is not related to Intra-query parallelism Deadlock and in that case what we can do. It showing to me rollback is going on and its not self blocking.

    1. Hi Gaurav,

      Unfortunately if the spid is in ROLLBACK state and if you are not seeing any self-blocking, then it indicates that spid is performing the ROLLBACK and in that case your only option is wait until the ROLLBACK is completed. SInce in order to maintain the ACID properties of the database, rollback has to complete.

      Until that time, table will be locked and all the other spids requiring to access the table will be blocked.

      1. Hi Parikshit,

        Thank you for the article.

        Atleast do we have any option to check the status. I mean how much more time it will take to rollback? (STATUSONLY most of the time does not shows the correct information)

        Thanks,
        Vikas B Sahu

Leave a Reply to Vikas Cancel reply

Your email address will not be published. Required fields are marked *