SPGist "triple parity" concept doesn't work - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | SPGist "triple parity" concept doesn't work |
Date | |
Msg-id | 5829.1370555205@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: SPGist "triple parity" concept doesn't work
Re: SPGist "triple parity" concept doesn't work Re: SPGist "triple parity" concept doesn't work |
List | pgsql-hackers |
I've been looking into the problem reported at http://www.postgresql.org/message-id/519A5917.40704@qunar.com and what I find is that we have spgist insertion operations deadlocking against each other because one is descending from page A to page B while the other descends from page B to page A. According to the README file, the "triple parity" page selection algorithm is supposed to prevent that: : While descending the tree, the insertion algorithm holds exclusive lock on : two tree levels at a time, ie both parent and child pages (parent and child : pages can be the same, see notes above). There is a possibility of deadlock : between two insertions if there are cross-referenced pages in different : branches. That is, if inner tuple on page M has a child on page N while : an inner tuple from another branch is on page N and has a child on page M, : then two insertions descending the two branches could deadlock. To prevent : deadlocks we introduce a concept of "triple parity" of pages: if inner tuple : is on page with BlockNumber N, then its child tuples should be placed on the : same page, or else on a page with BlockNumber M where (N+1) mod 3 == M mod 3. : This rule guarantees that tuples on page M will have no children on page N, : since (M+1) mod 3 != N mod 3. That would work fine as long as the invariant is maintained accurately. However, there are at least two cases where the existing code fails to maintain the invariant: 1. In spgAddNodeAction, if the enlarged inner tuple doesn't fit on the current page anymore, we do this: /* * obtain new buffer with the same parity as current, since it will be * a child of same parent tuple */ current->buffer = SpGistGetBuffer(index, GBUF_INNER_PARITY(current->blkno), ... That's fine as long as the parent tuple wasn't also on the current page. If it was on the current page, we end up re-downlinking the parent to a page having the same parity it has, not one more as it should be. I tried to fix this like so: /* * get a new buffer that has the right parity to store a child of * the current tuple's parent */ current->buffer = SpGistGetBuffer(index, GBUF_INNER_PARITY(parent->blkno+ 1), ... but that just moves the problem somewhere else: the link from the parent to the new inner tuple is now guaranteed to follow the parity rules, but the downlinks leading out of it don't follow them anymore. 2. In spgSplitNodeAction, we split an inner tuple into a "prefix" tuple that replaces that inner tuple, and a "postfix" tuple that contains the same downlinks the original tuple did. That's fine as long as we can fit the postfix tuple on the same page. If we can't, we assign it to a page that's one parity level below the current page, and then its outgoing links violate the parity rules. (Keeping the postfix tuple on the current page wouldn't make things better, since we'd still violate the parity rules with respect to either the incoming or outgoing links of the prefix tuple, if it has to go to another page.) With a few repetitions of either of these cases, and some bad luck about placement of the new tuples, you get into situations where two pages each contain downlinks leading to the other; and then a deadlock is just a matter of time. I don't immediately see any good way to fix this. I think the "triple parity" rule as it stands is hopelessly broken, but I don't know what to replace it with, even granting that we don't need to maintain on-disk compatibility. (We'd have to tell people to reindex SPGist indexes anyway, because of the risk that they already contain circular links; so switching to a new layout rule doesn't seem to add any more pain.) Or we could try to modify the insertion algorithm so it doesn't lock two levels of the tree at once, but that seems pretty risky. Thoughts? regards, tom lane
pgsql-hackers by date: