Tricky bugs in concurrent index build - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Tricky bugs in concurrent index build |
Date | |
Msg-id | 9364.1156267572@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Tricky bugs in concurrent index build
Re: Tricky bugs in concurrent index build Re: Tricky bugs in concurrent index build Re: Tricky bugs in concurrent index build |
List | pgsql-hackers |
I see fairly nasty problems in the concurrent-index patch when it's trying to build a unique index. I think it's solvable but want some more eyeballs on my reasoning. Look at the code in IndexBuildHeapScan where we are deciding whether or not to include a tuple in the index ("indexIt") and also whether or not to include it in the uniqueness check ("tupleIsAlive"). In a normal non-concurrent build, we have to include recently-dead tuples in the index because transactions started before ours might try to use the index after we finish it. (Remember system catalogs generally operate on SnapshotNow, so a query could use a newly created index even though it will be run with a serializable snapshot much older than the index.) So we have to put tuples into the index if any active transaction might wish to see those tuples. OTOH, we should exclude dead tuples from the uniqueness check: the uniqueness constraint ought to be across currently-valid tuples only. In particular, for tuples previously created or deleted by our own transaction, we certainly must include created ones and not include deleted ones in the uniqueness check. In the past, the only way we could see HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS was for tuples created/deleted by our own transaction, and so the actions taken by IndexBuildHeapScan are to include in the index in both cases, but exclude DELETE_IN_PROGRESS tuples from the uniqueness check. This does not work for a concurrent build, though, because if the in-progress delete is from another transaction, it could roll back after we look. In that case we have an entry that is in the index and has escaped the uniqueness check. If it conflicts with another tuple also entered into the index in the first pass, we'll never notice that. I think we can solve this by having IndexBuildHeapScan not index DELETE_IN_PROGRESS tuples if it's doing a concurrent build. The problem of old transactions trying to use the index does not exist, because we'll wait 'em out before marking the index valid, so we need not worry about preserving validity for old snapshots. And if the deletion does in fact roll back, we'll insert the tuple during the second pass, and catch any uniqueness violation at that point. But wait, there's more: in the patch as it stands, the second pass over the table ignores DELETE_IN_PROGRESS tuples, which is wrong. It's entirely possible for a tuple that is RECENTLY_DEAD or DELETE_IN_PROGRESS to have no entry in the index, if it was inserted during the first pass, and then the deletion occurred after the first pass (and either has or hasn't committed yet). If we ignore DELETE_IN_PROGRESS and then the deleter rolls back, the tuple never gets into the index at all. Furthermore, the patch also tries to insert RECENTLY_DEAD tuples, which is good for MVCC coverage, but wrong for uniqueness checking --- keep in mind that in the second pass, we are just doing normal index insertions, and so anything we insert into the index will be uniqueness-checked as though still alive. We could get a uniqueness failure that should not occur, eg from trying to insert the old version of a recently-updated row. What I think we can do about this is to include DELETE_IN_PROGRESS tuples in the set of candidate tuples to insert in the second pass. During the merge step that verifies whether the tuple is already in the index, if we find that it's not, then we must wait for the deleter to commit or roll back. If the deleter commits then we ignore the tuple. If the deleter rolls back then we have to insert the tuple in the index. (I think we have to actually take a FOR UPDATE or possibly FOR SHARE lock on the tuple while we do this, else we have race conditions against someone else starting a new deletion attempt on the tuple.) In the commit case we've essentially waited long enough to transform DELETE_IN_PROGRESS into RECENTLY_DEAD, and for both of those statuses we are not going to insert into the index for fear of causing a false uniqueness violation. What that means is that after we finish the second pass, we need *another* wait-for-everyone-to-die before we can mark the index valid. Otherwise we have the same MVCC problem that someone might try to use the index for a query with a snapshot old enough that it should be able to see the not-inserted tuple. Have I missed anything? This is tricky stuff. In any case it's clear that the patch still needs major work. Greg, do you have cycles to spare now? regards, tom lane
pgsql-hackers by date: