Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated) - Mailing list pgsql-bugs
From | Thomas Munro |
---|---|
Subject | Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated) |
Date | |
Msg-id | CAEepm=3ctG4RZZDjUycMx0_TkSUAVmVKJzowGDwzTy_BEFZcjQ@mail.gmail.com Whole thread Raw |
In response to | Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated) (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Re: BUG #12990: Missing pg_multixact/members files
(appears to have wrapped, then truncated)
|
List | pgsql-bugs |
On Sun, May 10, 2015 at 12:43 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On May 9, 2015, at 8:00 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >>> On Sat, May 9, 2015 at 2:46 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> On Fri, May 8, 2015 at 9:55 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >>>> Thomas Munro wrote: >>>>> I think the fix is something like "if nextMXact == oldestMultiXactId, >>>>> then there are no active multixacts, so the offsetStopLimit should be >>>>> set to nextOffset - (a segment's worth)". >>>> >>>> Makes sense. >>> >>> Here's a patch that attempts to implement this. >> >> Thanks. I think I have managed to reproduce something like the data >> loss race that we were speculating about. >> >> 0. initdb, autovacuum = off, set up explode_mxact_members.c as >> described elsewhere in the thread. >> 1. Fill up the members SLRU completely (ie reach state where you can >> no longer create a new multixact of any size). pg_multixact/members >> contains 82040 files and the last one is named 14077. >> 2. Issue CHECKPOINT, but use a debugger to stop inside >> TruncateMultiXact after it has read >> MultiXactState->lastCheckpointedOldest and released the lock, but >> before it calls SlruScanDirectory to delete files... >> 3. Run VACUUM FREEZE in all databases (including template0). datminmxid moves. >> 4. Create lots of new multixacts. pg_multixact/members now contains >> 82041 files and the last one is named 14078 (ie one extra segment, >> with the highest possible segment number, which couldn't be created >> before vacuuming because of the one segment gap enforced by >> DetermineSafeOldestOffset). Segments 0000-0016 have new modified >> times. >> 5. ... allow the checkpoint started in step 2 to continue. It >> deletes segments, keeping only 0000-0016. The segment 14078 which >> contained active member data has been incorrectly deleted. > > OK. So the next question is: if you then apply the other patch, does that prevent step 4 and thereby avoid catastrophe? Yes, in a quick test, at step 4 I couldn't proceed. I need to prod this some more on Monday, and also see how it interacts with autovacuum's view of what work needs to be done. Here is my attempt at a summary. In master, we have 3 arbitrarily overlapping processes: 1. VACUUM advances oldest multixact and member tail. 2. CHECKPOINT observes member tail and head (separately) and then deletes storage. 3. Regular transaction obverses tail, checks boundary and advances head. Information flows from 1 to 2, from 3 to 2, and from 1 to 3. 2 doesn't have a consistent view of head and tail, and doesn't prevent them moving while deleting storage, so the effect is that we can delete the wrong range of storage. With the patch, we have 3 arbitrarily overlapping processes: 1. VACUUM advances oldest multixact. 2. CHECKPOINT observes oldest multixact, deletes storage and then advances member tail. 3. Regular transaction observes member tail, checks boundary and advances member head. Information flows from 1 to 2 and from 2 to 3. Although 2 works with a snapshot of the oldest multixact which may move before it deletes storage, 2 knows that the member tail isn't moving (that is its job), and that 3 can't move the head past the the tail (or rather the stop limit which is the tail minus a gap), so the effect of using an out of date oldest multixact is that we err on the side of being too conservative with our allocation of member space, which is good. I suppose you could have a workload that eats member space really fast and checkpoints too infrequently so that you run out of space before a checkpoint advances the tail. I think that is why you were suggesting triggering checkpoints automatically in some cases. But I think that would be a pretty insane workload (I can't convince my computer to consume 2^32 member elements in under a couple of hours using the pathological explode_mxact_members.c workload, and you can't set checkpoint time above 1 hour). -- Thomas Munro http://www.enterprisedb.com
pgsql-bugs by date: