[HACKERS] Subtle bug in "Simplify tape block format" commit - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | [HACKERS] Subtle bug in "Simplify tape block format" commit |
Date | |
Msg-id | CAM3SWZRWdNtkhiG0GyiX_1mUAypiK3dV6-6542pYe2iEL-foTA@mail.gmail.com Whole thread Raw |
Responses |
Re: [HACKERS] Subtle bug in "Simplify tape block format" commit
|
List | pgsql-hackers |
It seems that commit 01ec2563 has a subtle bug, which stems from the fact that logtape.c no longer follows the rule described above ltsGetFreeBlock(): /* * Select a currently unused block for writing to. * * NB: should only be called when writer is ready to write immediately, * to ensure that first write pass is sequential. */ static long ltsGetFreeBlock(LogicalTapeSet *lts) { ... } Previously, all calls to ltsGetFreeBlock() were immediately followed by a corresponding call to ltsWriteBlock(); we wrote out the newly-allocated-in-first-pass block there and then. It's a good idea for a corresponding ltsWriteBlock() call to happen immediately, and it's *essential* for it to happen before any later block is written during the first write pass (when tuples are initially dumped out to form runs), since, as noted above ltsWriteBlock(): /* * Write a block-sized buffer to the specified block of the underlying file. * * NB: should not attempt to write beyond current end of file (ie, create * "holes" in file), since BufFile doesn't allow that. The first write pass * must write blocks sequentially. ... However, a "write beyond current end of file" now seems to actually be attempted at times, resulting in an arcane error during sorting. This is more or less because LogicalTapeWrite() doesn't heed the warning from ltsGetFreeBlock(), as seen here: while (size > 0) { if (lt->pos >= TapeBlockPayloadSize) { ... /* * First allocate the next block, so that we can store it in the * 'next' pointer of this block. */ nextBlockNumber = ltsGetFreeBlock(lts); /* set the next-pointer and dump the current block. */ TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber; ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer); ... } ... } Note that LogicalTapeWrite() now allocates each block (calls ltsGetFreeBlock()) one block in advance of current block, immediately before *current* block is written (not the next block/block just allocated). So, the corresponding ltsWriteBlock() call *must* happen at an unspecified time in the future, typically the next time control ends up at exactly the same point, where the block that was "next" becomes "current". I'm not about to argue that we should go back to following this ltsGetFreeBlock() rule, though; I can see why Heikki refactored LogicalTapeWrite() to allocate blocks a block in advance. Still, we need to be more careful in avoiding the underlying problem that the ltsGetFreeBlock() rule was intended to prevent. Attached patch 0001-* has logtape.c be sure to write out a tape's buffer every time tuplesort ends a run. I have a test case. Build Postgres with only 0002-* applied, which makes MAX_PHYSICAL_FILESIZE far smaller than its current value (current value is 2^30, while this reduces it to BLCKSZ). Then: postgres=# set replacement_sort_tuples = 0; set work_mem = 64; SET SET postgres=# with a as (select repeat('a', 7000) i from generate_series(1, 7)) select * from a order by i; ERROR: XX000: could not write block 5 of temporary file: Success LOCATION: ltsWriteBlock, logtape.c:204 This "block 5" corresponds to an fd.c file/BufFile segment that does not in fact exist when this error is raised. Since BufFiles multiplex BLCKSZ-sized segment files in this build of Postgres, it's quite likely, in general, that writes marking the end of a run will straddle "segment boundaries", which is what it takes for buffile.c/logtape.c to throw this error. (The BufFile contract does not allow "holes" in files, but segment boundaries must be crossed as the critical point (just before a tape switch) to actually see this error -- you'd have to be very unlucky to have things line up in exactly the wrong way with 1GiB BufFile segments, but it can still happen.) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: