Re: Logical to physical page mapping - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Logical to physical page mapping |
Date | |
Msg-id | CA+TgmobE70334xF2-xdPQz-xfc_WN9FQYATsD7QJ=8SAaQNhNw@mail.gmail.com Whole thread Raw |
In response to | Logical to physical page mapping (Jan Wieck <JanWieck@Yahoo.com>) |
Responses |
Re: Logical to physical page mapping
Re: Logical to physical page mapping |
List | pgsql-hackers |
On Sat, Oct 27, 2012 at 1:01 AM, Jan Wieck <JanWieck@yahoo.com> wrote: > The reason why we need full_page_writes is that we need to guard against > torn pages or partial writes. So what if smgr would manage a mapping between > logical page numbers and their physical location in the relation? This sounds a lot like http://en.wikipedia.org/wiki/Shadow_paging According to my copy of Gray and Reuter, shadow paging is in fact a workable way of providing atomicity and durability, but as of its writing (1992) shadow paging had been essentially abandoned because it didn't have very good performance characteristics. One of the big problems is that you lose locality of reference - e.g. there's nothing at all sequential about a sequential scan if, below the mapping layer, the blocks are scattered about the disk, which is a likely outcome, if they are frequently updated, or in the long run even if they are only occasionally updated. It's occurred to me before to think that this might work if we did it, not at the block level, but at some higher level, with say 64MB segments. That wouldn't impinge too much on sequential access, but it would allow vacuum to clip out an entire 64MB segment anywhere in the relation if it happened to be empty, or perhaps to rewrite a 64MB segment of a relation without rewriting the whole thing. But it wouldn't do anything about torn pages. Another idea that's been previously proposed (and which is used by MySQL, and previously proposed by VMware for inclusion in PostgreSQL) for torn-page avoidance is that of a double-write buffer - i.e. instead of including full page images in WAL, write them to the double-write buffer; if we crash, start by restoring all the pages from the double-write buffer; then, replay WAL. This avoids passing the full-page images through the WAL stream sent from master to slave, because the slave can have its own double-write buffer. This would probably also allow slaves to perform restart-points at arbitrary locations independent of where the master performs checkpoints. In the patch as proposed, the double-write buffer was kept very small, in the hopes of keeping it within the presumed BBWC, so that very-frequent fsyncs would all reference the same pages and therefore all be absorbed by the cache. This delivers terrible performance without a BBWC, though, because the fsyncs are so frequent. Alternatively, you could imagine a large double-write buffer which only gets flushed once per checkpoint cycle or so - i.e. basically what we have now, but just separating the FPW traffic from the main WAL stream. Indeed, you could extend that a bit futher: why throw out the double-write buffer just because there's been a checkpoint cycle? In a workload like pgbench, it seems likely that the same pages will be written over and over again. You could have a checkpoint whose purpose is only to minimize the recovery time in cases where no pages are torn. You could then also have a less frequent "super-checkpoint" cycle and retain WAL back to the last "super-checkpoint". In the hopefully-unikely event that we detect a torn page (through a checksum failure, presumably) then we hunt backwards through WAL (something our current infrastructure doesn't really support) and find the last FPI for that torn page and then begin selective replay from that point, scanning through all of the WAL since the last super-checkpoint and replaying all and only records pertaining to that page. But when no pages are torn then you only need to recover from the last "normal" checkpoint. I have heard reports (on this mailing list, I think) that Oracle does something like this, but I haven't tried to verify for myself whether that is in fact the case. Yet another idea we've tossed around is to make only vacuum records include FPWs, and have the more common heap insert/update/delete operations include enough information that they can still be applied correctly even if the page has been "torn" by the previous replay of such a record. This would involve modifying the recovery algorithm so that, until consistency is reached, we replay all records, regardless of LSN, which would cost some extra I/O, but maybe not too much to live with? It would also require that, for example, a heap-insert record mention the line pointer index used for the insertion; currently, we count on the previous state of the page to tell us that.For checkpoint cycles of reasonable length, the costof storing the line pointer in every WAL record seems like it'll be less than the cost needing to write an FPI for the page once per checkpoint cycle, but this isn't certain to be the case for all workloads. OK, I'll stop babbling now... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: