Found: some pretty ugly VACUUM bugs - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Found: some pretty ugly VACUUM bugs |
Date | |
Msg-id | 10156.1124394535@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Found: some pretty ugly VACUUM bugs
|
List | pgsql-hackers |
I believe I've traced down the cause of the Assert trap in VACUUM FULL that Teodor reported here: http://archives.postgresql.org/pgsql-hackers/2005-06/msg01278.php The case that VACUUM is tripping up on is one in which some concurrent transaction (call it X1) updates and then later deletes a row. By the time VACUUM starts, X1 is committed, but it's still within the OldestXmin horizon, so there are open transactions that should not see its effects. We have two tuples in the table, call 'em T1 and T2, representing the original and updated states of the row: XMIN XMAX t_ctid T1 X0 X1 points to T2 T2 X1 X1 points to T2 (ie, itself) (X0 is whichever transaction originally inserted T1; it's old enough to not be interesting. The reason T1 must point to T2 is that a READ COMMITTED transaction that decides to update T1 must be able to find T2 instead.) The reason this configuration is troublesome is that HeapTupleSatisfiesVacuum has this (premature?) optimization in it: if (TransactionIdEquals(HeapTupleHeaderGetXmin(tuple), HeapTupleHeaderGetXmax(tuple))) { /* * inserter also deleted it, so it was never visible to anyone * else */ return HEAPTUPLE_DEAD; } This code causes VACUUM to delete T2, even though T1 is still pointing at it --- and because T1's XMAX is past the OldestXmin horizon, T1 will not be deleted. The Assert that Teodor saw arose from the following specific series of events (which can only occur if T2 is on a lower- numbered page than T1): * T2 is removed and its slot is marked free.* repair_frag moves some unrelated tuple into T2's slot.* repair_frag visitsT1, decides it has to move a tuple chain, and moves the new occupant of T2's slot (which is already wrong anyway).*when update_hint_bits scans T2's page, it finds the wrong number of MOVED_IN tuples because the tuple that wasmoved into T2's slot is now MOVED_OFF. This triggers the Assert. However this is merely the least interesting symptom of the problem. If the state with T1 pointing to a tuple that is actually unrelated is allowed to persist, then after the VACUUM is done someone else could find T1 and then update the new occupant of T2's slot under the mistaken assumption that it's the descendant of T1. My first instinct for a quick fix was to delete the above-quoted section of HeapTupleSatisfiesVacuum. This would ensure that we never remove a tuple that some other tuple might still be pointing at, unless we are going to remove the referencing tuple as well. But this only fixes the problem for VACUUM FULL, which exclusive-locks the whole table. With lazy VACUUM, concurrent transactions can see the intermediate state with T2 gone and T1 not. Thus they could write some new tuple into T2's slot and then make the mistaken update before VACUUM gets a chance to remove T1. The only solution I can see (short of abandoning lazy VACUUM) is that we have to make the code that follows t_ctid chains more wary. That code is already aware (at least in the places I looked at) that a t_ctid link might lead to an empty slot, but if there is a tuple in the slot it just assumes that that is really a descendant version of the tuple pointing to it. That won't do. But I believe it would work if we also test that the XMIN of the tuple in the slot equals the XMAX of the referencing tuple. If they are unequal, we can conclude that the original child tuple is dead and has been removed, so there is no current version of the referencing tuple. It is clearly true that if the XMIN and XMAX are different, the putative child tuple isn't really the child. I believe this test is sufficient, because VACUUM never removes rows generated by still-in-progress transactions, and so it is not possible for a single transaction to store different tuples into the same slot over its lifetime. This is going to require a number of changes since there are several places that follow t_ctid chains. With the change in place, though, I think it's OK to leave the xmin=xmax optimization in place in HeapTupleSatisfiesVacuum. Comments? Anyone see any flaws in the reasoning? regards, tom lane
pgsql-hackers by date: