Re: WAL logging problem in 9.4.3? - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: WAL logging problem in 9.4.3? |
Date | |
Msg-id | 559FA0BA.3080808@iki.fi Whole thread Raw |
In response to | Re: WAL logging problem in 9.4.3? (Andres Freund <andres@anarazel.de>) |
Responses |
Re: WAL logging problem in 9.4.3?
|
List | pgsql-hackers |
On 07/10/2015 12:14 PM, Andres Freund wrote: > On 2015-07-10 11:50:33 +0300, Heikki Linnakangas wrote: >> On 07/10/2015 02:06 AM, Tom Lane wrote: >>> cab9a0656c36739f was based on an actual user complaint, so we have good >>> evidence that there are people out there who care about the cost of >>> truncating a table many times in one transaction. >> >> Yeah, if we specifically made that case cheap, in response to a complaint, >> it would be a regression to make it expensive again. We might get away with >> it in a major version, but would hate to backpatch that. > > Sure. But making COPY slower would also be one. Of a longer standing > behaviour, with massively bigger impact if somebody relies on it? I mean > a new relfilenode includes a couple heap and storage options. Missing > the skip wal optimization can easily double or triple COPY durations. Completely disabling the skip-WAL optimization is not acceptable either, IMO. It's a false dichotomy that we have to choose between those two options. We'll have to consider the exact scenarios where we'd have to disable the optimization vs. using a new relfilenode. >>>> My tentative guess is that the best course is to >>>> a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the >>>> truncation replay issue. >>>> b) Force new pages to be used when using the heap_sync mode in >>>> COPY. That avoids the INIT danger you found. It seems rather >>>> reasonable to avoid using pages that have already been the target of >>>> WAL logging here in general. >>> >>> And what reason is there to think that this would fix all the problems? >>> We know of those two, but we've not exactly looked hard for other cases. >> >> Hmm. Perhaps that could be made to work, but it feels pretty fragile. > > It does. I'm not very happy about this mess. > >> For >> example, you could have an insert trigger on the table that inserts >> additional rows to the same table, and those inserts would be intermixed >> with the rows inserted by COPY. > > That should be fine? As long as copy only uses new pages INSERT can use > the same ones without problem. I think... > >> Full-page images in general are a problem. > > With the above rules I don't think it'd be. They'd contain the previous > contents, and we'll not target them again with COPY. Well, you really have to ensure that COPY never uses a page that any other operation (INSERT, DELETE, UPDATE, hint-bit-update) has ever touched and created a FPW for. The naive approach, where you just reset the target block at beginning of COPY and use the HEAP_INSERT_SKIP_FSM option is not enough. It's possible, but requires a lot more bookkeeping than might seem at first glance. >> I think we should >> 1. reliably and explicitly keep track of whether we've WAL-logged any >> TRUNCATE, INSERT/UPDATE+INIT, or any other full-page-logging operations on >> the relation, and >> 2. make sure we never skip WAL-logging again if we have. >> >> Let's add a flag, rd_skip_wal_safe, to RelationData that's initially set >> when a new relfilenode is created, i.e. whenever rd_createSubid or >> rd_newRelfilenodeSubid is set. Whenever a TRUNCATE or a full-page image >> (including INSERT/UPDATE+INIT) is WAL-logged, clear the flag. In copy.c, >> only skip WAL-logging if the flag is still set. To deal with the case that >> the flag gets cleared in the middle of COPY, also check the flag whenever >> we're about to skip WAL-logging in heap_insert, and if it's been cleared, >> ignore the HEAP_INSERT_SKIP_WAL option and WAL-log anyway. > > Am I missing something or will this break the BEGIN; TRUNCATE; COPY; > pattern we use ourselves and have suggested a number of times ? Sorry, I was imprecise above. I meant "whenever an XLOG_SMGR_TRUNCATE record is WAL-logged", rather than a "whenever a TRUNCATE [command] is WAL-logged". TRUNCATE on a table that wasn't created in the same transaction doesn't emit an XLOG_SMGR_TRUNCATE record, because it creates a whole new relfilenode. So that's OK. In the long-term, I'd like to refactor this whole thing so that we never WAL-log any operations on a relation that's created in the same transaction (when wal_level=minimal). Instead, at COMMIT, we'd fsync() the relation, or if it's smaller than some threshold, WAL-log the contents of the whole file at that point. That would move all that more-difficult-than-it-seems-at-first-glance logic from COPY and indexam's to a central location, and it would allow the same optimization for all operations, not just COPY. But that probably isn't feasible to backpatch. - Heikki
pgsql-hackers by date: