Re: WAL logging problem in 9.4.3? - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: WAL logging problem in 9.4.3? |
Date | |
Msg-id | 55AFC302.1060805@iki.fi Whole thread Raw |
In response to | Re: WAL logging problem in 9.4.3? (Simon Riggs <simon@2ndQuadrant.com>) |
Responses |
Re: WAL logging problem in 9.4.3?
Re: WAL logging problem in 9.4.3? Re: WAL logging problem in 9.4.3? |
List | pgsql-hackers |
On 07/22/2015 11:18 AM, Simon Riggs wrote: > On 10 July 2015 at 00:06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Andres Freund <andres@anarazel.de> writes: >>> On 2015-07-06 11:49:54 -0400, Tom Lane wrote: >>>> Rather than reverting cab9a0656c36739f, which would re-introduce a >>>> different performance problem, perhaps we could have COPY create a new >>>> relfilenode when it does this. That should be safe if the table was >>>> previously empty. >> >>> I'm not convinced that cab9a0656c36739f needs to survive in that >>> form. To me only allowing one COPY to benefit from the wal_level = >>> minimal optimization has a significantly higher cost than >>> cab9a0656c36739f. >> >> What evidence have you got to base that value judgement on? >> >> cab9a0656c36739f was based on an actual user complaint, so we have good >> evidence that there are people out there who care about the cost of >> truncating a table many times in one transaction. On the other hand, >> I know of no evidence that anyone's depending on multiple sequential >> COPYs, nor intermixed COPY and INSERT, to be fast. The original argument >> for having this COPY optimization at all was to make restoring pg_dump >> scripts in a single transaction fast; and that use-case doesn't care >> about anything but a single COPY into a virgin table. >> > > We have to backpatch this fix, so it must be both simple and effective. > > Heikki's suggestions may be best, maybe not, but they don't seem > backpatchable. > > Tom's suggestion looks good. So does Andres' suggestion. I have coded both. Thanks. For comparison, I wrote a patch to implement what I had in mind. When a WAL-skipping COPY begins, we add an entry for that relation in a "pending-fsyncs" hash table. Whenever we perform any action on a heap that would normally be WAL-logged, we check if the relation is in the hash table, and skip WAL-logging if so. That was a simplified explanation. In reality, when WAL-skipping COPY begins, we also memorize the current size of the relation. Any actions on blocks greater than the old size are not WAL-logged, and any actions on smaller-numbered blocks are. This ensures that if you did any INSERTs on the table before the COPY, any new actions on the blocks that were already WAL-logged by the INSERT are also WAL-logged. And likewise if you perform any INSERTs after (or during, by trigger) the COPY, and they modify the new pages, those actions are not WAL-logged. So starting a WAL-skipping COPY splits the relation into two parts: the first part that is WAL-logged as usual, and the later part that is not WAL-logged. (there is one loose end marked with XXX in the patch on this, when one of the pages involved in a cold UPDATE is before the watermark and the other is after) The actual fsync() has been moved to the end of transaction, as we are now skipping WAL-logging of any actions after the COPY as well. And truncations complicate things further. If we emit a truncation WAL record in the transaction, we also make an entry in the hash table to record that. All operations on a relation that has been truncated must be WAL-logged as usual, because replaying the truncate record will destroy all data even if we fsync later. But we still optimize for "BEGIN; CREATE; COPY; TRUNCATE; COPY;" style patterns, because if we truncate a relation that has already been marked for fsync-at-COMMIT, we don't need to WAL-log the truncation either. This is more invasive than I'd like to backpatch, but I think it's the simplest approach that works, and doesn't disable any of the important optimizations we have. >> And what reason is there to think that this would fix all the problems? > > I don't think either suggested fix could be claimed to be a great solution, > since there is little principle here, only heuristic. Heikki's solution > would be the only safe way, but is not backpatchable. I can't get too excited about a half-fix that leaves you with data corruption in some scenarios. I wrote a little test script to test all these different scenarios (attached). Both of your patches fail with the script. - Heikki
Attachment
pgsql-hackers by date: