DeadLockCheck is buggy - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | DeadLockCheck is buggy |
Date | |
Msg-id | 14276.979684024@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: DeadLockCheck is buggy
|
List | pgsql-hackers |
Create three tables and start four transactions, then do: XACT 1: LOCK TABLE A; XACT 2: LOCK TABLE B IN ROW SHARE MODE; XACT 3: LOCK TABLE B IN ROW EXCLUSIVE MODE; XACT 4: LOCK TABLE C; XACT 2: LOCK TABLE C; XACT 3: LOCK TABLE C; XACT 1: LOCK TABLE B IN SHARE MODE; << wait at least 1 second here >> XACT 4: LOCK TABLE A; The system is now deadlocked: 4 waits for 1, 1 waits for 3, 3 waits for 4 ... but DeadLockCheck doesn't notice. Deadlock *is* detected if LOCK TABLE C is issued in xact 3 before xact 2, however. The problem is that after a recursive invocation of DeadLockCheck, the code checks to see if the waitProc is blocked by thisProc, and assumes no deadlock if not. But each waitProc will only be visited once, so if we happen to reach it first by way of a proc that is not blocking it, we fail to notice that there really is a deadlock. In this example, we reach xact 1 by way of xact 2 (which is hit first in the scan of waiters for lock C); xact 2 is waiting on the same lock as xact 3, but only xact 3 blocks xact 1. I have been studying DeadLockCheck for most of a day now, and I doubt that this is the only bug lurking in it. I think that we really ought to throw it away and start over, because it doesn't look to me at all like a standard deadlock-detection algorithm. The standard way of doing this is that you trace outward along waits-for relationships and see if you come back to your starting point or not. This is not doing that. One reason is that the existing locktable structure makes it extremely painful to discover just which processes are holding a given lock. You can only find which ones are *waiting* for a given lock. That's not a killer problem, we can simply trace the waits-for relationships in reverse, but it's not really doing that either. In particular, it's only doing the tracing in a literal sense for conflicts between locks requested and locks already held. It's not doing things that way for the case where A is waiting for B because they are requesting (not holding) conflicting locks and A is behind B in the queue for the lock. I think there are bugs associated with that case too, though I don't have an example yet. Anyway, the bottom line is that I want to start over using a standard deadlock-check algorithm, along the following lines: 1. Make it relatively cheap to find all the procs already holding a given lock, by extending the lock datastructures so that there's a list of already-granted holder objects for each lock, not only pending holders. (In other words, each HOLDER object will now be in two linked lists, one for its proc and one for its lock, not only one list. This will also make it a lot easier to write a useful lock-state-dumping routine, whenever someone gets around to that.) 2. Rewrite DeadLockCheck to recurse outwards to procs that the current proc is waiting for, rather than vice versa, and consider both hard blocking (he's already got a conflicting lock) and priority blocking (he's got a conflicting request and is ahead of me in the wait queue). 3. As now, if a cycle is found that includes a priority block, try to break the cycle by awakening the lower-priority waiter out of order, rather than aborting a transaction. 4. Clean up DeadLockCheck so that it can be started from any proc, not only MyProc. This is needed for point #5. 5. Change ProcLockWakeup to address Hiroshi's concern about possible starvation if we awaken waiters out-of-order. If we find a wakable waiter behind one or more nonwakable waiters, awaken it only if DeadLockCheck says it is deadlocked. Under normal circumstances that won't happen and we won't break priority-of-arrival ordering, but if it's necessary to avoid deadlock then we will. (We could avoid excess computation here by only running DeadLockCheck if the waiting process has already done so on its own behalf; otherwise don't, just let it happen if the waiter is still blocked when its timeout occurs. That would be easy if we add a bool I've-run-DeadLockCheck flag to the PROC structures.) Comments? regards, tom lane
pgsql-hackers by date: