tackling full page writes - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | tackling full page writes |
Date | |
Msg-id | BANLkTimhopkDvD2y_S-0Kf874ueX-gQD8Q@mail.gmail.com Whole thread Raw |
Responses |
Re: tackling full page writes
Re: tackling full page writes Re: tackling full page writes Re: tackling full page writes Re: tackling full page writes |
List | pgsql-hackers |
While eating good Indian food and talking about aviation accidents on the last night of PGCon, Greg Stark, Heikki Linnakangas, and I found some time to brainstorm about possible ways to reduce the impact of full_page_writes. I'm not sure that these ideas are much good, but for the sake of posterity: 1. Heikki suggested that instead of doing full page writes, we might try to write only the parts of the page that have changed. For example, if we had 16 bits to play with in the page header (which we don't), then we could imagine the page as being broken up into 16 512-byte chunks, one per bit. Each time we update the page, we write whatever subset of the 512-byte chunks we're actually modifying, except for any that have been written since the last checkpoint. In more detail, when writing a WAL record, if a checkpoint has intervened since the page LSN, then we first clear all 16 bits, reset the bits for the chunks we're modifying, and XLOG those chunks. If no checkpoint has intervened, then we set the bits for any chunks that we are modifying and for which the corresponding bits aren't yet set; and XLOG the corresponding chunks. As I think about it a bit more, we'd need to XLOG not only the parts of the page we actually modifying, but any that the WAL record would need to be correct on replay. (It was further suggested that, in our grand tradition of bad naming, we could name this feature "partial full page writes" and enable it either with a setting of full_page_writes=partial, or better yet, add a new GUC partial_full_page_writes. The beauty of the latter is that it's completely ambiguous what happens when full_page_writes=off and partial_full_page_writes=on. Actually, we could invert the sense and call it disable_partial_full_page_writes instead, which would probably remove all hope of understanding. This all seemed completely hilarious when we were talking about it, and we weren't even drunk.) 2. The other fairly obvious alternative is to adjust our existing WAL record types to be idempotent - i.e. to not rely on the existing page contents. For XLOG_HEAP_INSERT, we currently store the target tid and the tuple contents. I'm not sure if there's anything else, but we would obviously need the offset where the new tuple should be written, which we currently infer from reading the existing page contents. For XLOG_HEAP_DELETE, we store just the TID of the target tuple; we would certainly need to store its offset within the block, and maybe the infomask. For XLOG_HEAP_UPDATE, we'd need the old and new offsets and perhaps also the old and new infomasks. Assuming that's all we need and I'm not missing anything (which I won't bet on), that means we'd be adding, say, 4 bytes per insert or delete and 8 bytes per update. So, if checkpoints are spread out widely enough that there will be more than ~2K operations per page between checkpoints, then it makes more sense to just do a full page write and call it good. If not, this idea might have legs. 3. Going a bit further, Greg proposed the idea of ripping out our current WAL infrastructure altogether and instead just having one WAL record that says "these byte ranges on this page changed to have these new contents". That's elegantly simple, but I'm afraid it would bloat the records quite a bit. For example, as Heikki pointed out, HEAP_XLOG_DELETE relies on the XID in the record header to figure out what to write, and all the heap-modification operations implicitly specify the visibility map change when they specify the heap change. We currently have a flag to indicate whether the visibility map actually requires an update, but it's just one bit. However, one possible application of this concept is that we could add something like this in along with our existing WAL record types. It might be useful, for example, for third-party index AMs, which are currently pretty much out of luck. That's about as far as we got. Though I haven't convinced anyone else yet, I still think there's some merit to the idea of just writing the portion of the page that precedes pd_upper. WAL records would have to assume that the tuple data might be clobbered, but they could rely on the early portion of the page to be correct. AFAICT, that would be OK for all of the existing WAL records except for XLOG_HEAP2_CLEAN (i.e. vacuum), with the exception that - prior to the minimum recovery point - they'd need to apply their changes unconditionally rather than considering the page LSN. Tom has argued that won't work, but I'm not sure he's convinced anyone else yet... Anyone else have good ideas? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: