Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CA+Tgmoavrc6iHRrxqyoe-YSq6OzmGswyvKOWxZpn=ULtSUPyyQ@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
|
List | pgsql-hackers |
On Wed, Aug 7, 2019 at 6:57 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Yeah, that's also a problem with complicated WAL record types. Hopefully > the complex cases are an exception, not the norm. A complex case is > unlikely to fit any pre-defined set of fields anyway. (We could look at > how e.g. protobuf works, if this is really a big problem. I'm not > suggesting that we add a dependency just for this, but there might be > some patterns or interfaces that we could mimic.) I think what you're calling the complex cases are going to be pretty normal cases, not something exotic, but I do agree with you that making the infrastructure more generic is worth considering. One idea I had is to use the facilities from pqformat.h; have the generic code read whatever the common fields are, and then pass the StringInfo to the AM which can do whatever it wants with the rest of the record, but probably these facilities would make it pretty easy to handle either a series of fixed-length fields or alternatively variable-length data. What do you think of that idea? (That would not preclude doing compression on top, although I think that feeding everything through pglz or even lz4/snappy may eat more CPU cycles than we can really afford. The option is there, though.) > If you remember, we did a big WAL format refactoring in 9.5, which moved > some information from AM-specific structs to the common headers. Namely, > the information on the relation blocks that the WAL record applies to. > That was a very handy refactoring, and allowed tools like pg_waldump to > print more detailed information about all WAL record types. For WAL > records, moving the block information was natural, because there was > special handling for full-page images anyway. However, I don't think we > have enough experience with UNDO log yet, to know which fields would be > best to include in the common undo header, and which to leave as > AM-specific payload. I think we should keep the common header slim, and > delegate to the AM routines. Yeah, I remember. I'm not really sure I totally buy your argument that we don't know what besides XID should go into an undo record: tuples are a pretty important concept, and although there might be some exceptions here and there, I have a hard time imagining that undo is going to be primarily about anything other than identifying a tuple and recording something you did to it. On the other hand, you might want to identify several tuples, or identify a tuple with a TID that's not 6 bytes, so that's a good reason for allowing more flexibility. Another point in being favor of being more flexible is that it's not clear that there's any use case for third-party tools that work using undo. WAL drives replication and logical decoding and could be used to drive incremental backup, but it's not really clear that similar applications exist for undo. If it's just private to the AM, the AM might as well be responsible for it. If that leads to code duplication, we can create a library of common routines and AM users can use them if they want. > Hmm. If you're following an UNDO chain, from newest to oldest, I would > assume that the newer record has enough information to decide whether > you need to look at the previous record. If the previous record is no > longer interesting, it might already be discarded away, after all. I actually thought zedstore might need this pattern. If you store an XID with each undo pointer, as the current zheap code mostly does, then you have enough information to decide whether you care about the previous undo record before you fetch it. But a tuple stores only an undo pointer, and you determine that the undo isn't discarded, you have to fetch the record first and then possibly decide that you had the right version in the first place. Now, maybe that pattern doesn't repeat, because the undo records could be set up to contain both an XMIN and an XMAX, but not necessarily. I don't know exactly what you have in mind, but it doesn't seem totally crazy that an undo record might contain the XID that created that version but not the XID that created the prior version, and if so, you'll iterate backwards until you either hit the end of undo or go one undo record past the version you can see. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: