Re: First draft of snapshot snapshot building design document - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: First draft of snapshot snapshot building design document |
Date | |
Msg-id | CAEYLb_Xj=t-4CW6gLV5jUvdPZSsYwSTbZtUethsW2oMpd58jzA@mail.gmail.com Whole thread Raw |
In response to | First draft of snapshot snapshot building design document (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: First draft of snapshot snapshot building design document
|
List | pgsql-hackers |
On 16 October 2012 12:30, Andres Freund <andres@2ndquadrant.com> wrote: > Here's the first version of the promised document. I hope it answers most of > the questions. This makes for interesting reading. So, I've taken a closer look at the snapshot building code in light of this information. What follows are my thoughts on that aspect of the patch (in particular, the merits of snapshot time-travel as a method of solving the general problem of making sense of WAL during what I'll loosely call "logical recovery", performed by what one design document [1] calls an "output plugin", which I previously skirted over somewhat. You can think of this review as a critical exposition of what happens during certain steps of my high-level schematic of "how this all fits together", which covers this entire patchset (this comes under my remit as the reviewer of one of the most important and complex patches in that patchset, "0008-Introduce-wal-decoding-via-catalog-timetravel.patch"), as described in my original, high-level review [2]. To recap on what those steps are: ***SNIP*** (from within plugin, currently about to decode a record for the first time) | \ / 9. During the first call(within the first record within a call to decode_xlog()), we allocate a snapshot reader. | \ / 10. Builds snapshot callback. This scribbles on oursnapshot state, which essentially encapsulates a snapshot. The state (and snapshot) changes continually, once per call. | \ / 11. Looks at XLogRecordBuffer (an XLogReader struct). Looks at an XLogRecord. Decodes based on record type. Let's assume it's an XLOG_HEAP_INSERT. | \ / 12. DecodeInsert()called. This in turn calls DecodeXLogTuple(). We store the tuple metadata in our ApplyCache. (some ilists, somewhere, each corresponding to an XID). We don't store the relation oid, because we don't know it yet (only relfilenode is known from WAL).// \ / 13. We're back in XLogReader(). It calls the only callback of interest to us covered in step 3 (and not of interestto XlogReader()/Heikki) – decode_change(). It does this through the apply_cache.apply_change hook. This happens becausewe encounter another record, this time a commit record (in the same codepath as discussed in step 12). |\ /14. In decode_change(),the actual function that raises the interesting WARNINGs within Andres' earlier example [3], showing actual integer/varchar Datum value for tuples previously inserted. Resolve table oid based on relfilenode (albeit unsatisfactorily). Using a StringInfo,tupledescs, syscache and typcache, build WARNING string. (No further steps. Aside: If I've done a good job of writing my initial review [2], I should be able to continually refer back to this as I drill down on other steps in later reviews.) It's fairly obvious why steps 9 and 10 are interesting to us here. Step 11 (the first call of SnapBuildCallback() - this is a bit of a misnomer, since the function isn't ever called through a pointer) is where the visibility-wise decision to decode a particular XLogRecord/XLogRecordBuffer occurs. Here is how the patch describes our reasons for needing this call (curiously, this comment appears not above SnapBuildCallback() itself, but above the decode.c call of said function, which may hint at a lapse in separation of concerns): + /*--------- + * Call the snapshot builder. It needs to be called before we analyze + * tuples for two reasons: + * + * * Only in the snapshot building logic we know whether we have enough + * information to decode a particular tuple + * + * * The Snapshot/CommandIds computed by the SnapshotBuilder need to be + * added to the ApplyCache before we add tuples using them + *--------- + */ Step 12 is a step that I'm not going into for this review. That's all decoding related. Step 13 is not really worthy of separate consideration here, because it just covers what happens when we call DecodeCommit() within decode.c , where text representations of tuples are ultimately printed in simple elog WARNINGs, as in Andres' original example [3], due to the apply_cache.apply_change hook being called. Step 14 *is* kind-of relevant, because this is one place where we resolve relation OID based on relfilenode, which is part of snapbuild.c, since it has a LookupTableByRelFileNode() call (the only other such call is within snapbuild.c itself, as part of checking if a particular relation + xid have had catalogue changes, which can be a concern due to DDL, which is the basic problem that snapbuild.c seeks to solve). Assuming that it truly is necessary to have a LookupTableByRelFileNode() function, I don't think your plugin (which will soon actually be a contrib module, right?) has any business calling it. Rather, this should all be part of some higher-level abstraction, probably within applycache, that spoonfeeds your example contrib module changesets without it having to care about system cache invalidation and what-not. As I've already noted, snapbuild.c (plus one or two other isolated places) have rather heavy-handed and out of place InvalidateSystemCaches() calls like these: + HeapTuple + LookupTableByRelFileNode(RelFileNode *relfilenode) + { + Oid spc; + + InvalidateSystemCaches(); However, since you've privately told me that your next revision will address the need to execute sinval messages granularly, rather than using this heavy-handed kludge, I won't once again stress the need to do better. If I've understood correctly, you've indicated that this can be done by processing the shared invalidation messages in each xl_xact_commit (i.e. each XLOG_XACT_COMMIT entry) during decoding (which I guess means that decoding's reponsibility's have been widened a bit, but that's still the place to do it - within the decoding "dispatcher"/glue code). Apparently we can expect this revision in the next couple of days. Thankfully, I *think* that this implies that plugins don't need to directly concern themselves with cache invalidation. By the way, shouldn't this use InvalidOid?: + /* + * relations in the default tablespace are stored with a reltablespace = 0 + * for some reason. + */ + spc = relfilenode->spcNode == DEFAULTTABLESPACE_OID ? + 0 : relfilenode->spcNode; The objectionable thing about having your “plugin” handle cache invalidation, apart from the fact that, as we all agree, the way you're doing it is just not acceptable, is that your plugin is doing it directly *at all*. You need to analyse the requirements of the universe of possible logical changeset subscriber plugins, and whittle them down to a lowest common denominator interface that likely generalises cache invalidation, and ideally doesn't require plugin authors to even know what a relfilenode or syscache is – some might say this shouldn't be a priority, but I take the view that it should. Exposing innards like this to these plugins is wrong headed, and to do so will likely paint us into a corner. Now, I grant that catcache + relcache can only be considered private innards in a relative sense with Postgres, and indeed a few contrib modules do use syscache directly for simple stuff, like hstore, which needs syscache + typcache for generating text representations in a way that is not completely unlike what you have here. Still, my feeling is that all the heavy lifting should be encapsulated elsewhere, in core. I think that you could easily justify adding another module/translation unit that is tasked with massaging this data in a form amenable to serving the needs of client code/plugins. If you don't get things quite right, plugin authors can still do it all for themselves just as well. I previously complained about having to take a leap of faith in respect of xmin horizon handling [2]. This new document also tries to allay some of those concerns. It says: > == xmin Horizon Handling == > > Reusing MVCC for timetravel access has one obvious major problem: > VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also > obviously, we want autovacuum/manual vacuum to work as before. > > The idea here is to reuse the infrastrcuture built for hot_standby_feedback > which allows us to keep the xmin horizon of a walsender backend artificially > low. We keep it low enough so we can restart decoding from the last location > the client has confirmed to be safely received. The means that we keep it low > enough to contain the last checkpoints oldestXid value. These ideas are still underdeveloped. For one thing, it seems kind of weak to me that we're obliged to have vacuum grind to a halt across the cluster because some speculative plugin has its proc's xmin pegged to some value due to a logical standby still needing it for reading catalogues only. Think of the failure modes – what happens when the standby participating in a plugin-based logical replication system croaks for indeterminate reasons? Doing this with the WAL sender as part of hot_standby_feedback is clearly much less hazardous, because there *isn't* a WAL sender when streaming replication isn't active in respect of some corresponding standby, and hot_standby_feeback need only support deferring vacuum for the streaming replication standby whose active snapshot's xmin is furthest in the past. If a standby is taken out of commission for an hour, it can probably catch up without difficulty (difficulty like needing a base-backup), and it has no repercussions for the master or anyone else. However, I believe that logical change records would be rendered meaningless in the same scenario, because VACUUM, having not seen a “pegged” xmin horizon due to the standby's unavailability, goes ahead and vacuums past a point still needed to keep the standby in a consistent state. Maybe you can invent a new type of xmin horizon that applies only to catalogues so this isn't so bad, and I see you've suggested as much in follow-up mail to Robert, but it might be unsatisfactory to have that impact the PGAXT performance optimisation introduced in commit ed0b409d, or obfuscate that code. You could have the xmin be a special xmin, say though PGAXT.vacuumFlags indicating this, but wouldn't that necessarily preclude the backend from having a non-special notion of its xmin? How does that bode for this actually becoming maximally generic infrastructure? You do have some ideas about how to re-sync a speculative in-core logical replication system standby, and indeed you talk about this in the design document posted a few weeks back [1] (4.8. Setup of replication nodes). This process is indeed similar to a base backup, and we'd hope to avoid having to do it for similar reasons - it would be undesirable to have to do it much more often in practice due to these types of concerns. You go on to say: > That also means we need to make that value persist across restarts/crashes in a > very similar manner to twophase.c's. That infrastructure actually also useful > to make hot_standby_feedback work properly across primary restarts. So here you anticipating my criticism above about needing to peg the xmin horizon. That's fine, but I still don't know yet is how you propose to make this work in a reasonable way, free from all the usual caveats about leaving prepared transactions open for an indefinitely long time (what happens when we need to XID wraparound?, how does the need to hold a transaction open interfere with a given plugin backend's ability to execute regular queries?, etc). Furthermore, I don't know how it's going to be independently useful to make hot_standby_feedback work across primary restarts, because the primary cannot then generate VACUUM cleanup WAL records that the standby might replay, resulting in a hard conflict. Maybe there's something incredibly obvious that I'm missing, but doesn't that problem almost take care of itself? Granted, those cleanup records aren't the only reason for a hard conflict, but they've got to be by far the most important, and are documented as such. Either the cleanup record already exists and you're usually already out of luck anyway, or it doesn't and you're not. Are you thinking about race conditions? You talk about the relfilenode/oid mapping problem some: > == Catalog/User Table Detection == > > To detect whether a record/transaction does catalog modifications - which we > need to do for memory/performance reasons - we need to resolve the > RelFileNode's in xlog records back to the original tables. Unfortunately > RelFileNode's only contain the tables relfilenode, not their table oid. We only > can do catalog access once we reached FULL_SNAPSHOT, before that we can use > some heuristics but otherwise we have to assume that every record changes the > catalog. What exactly are the implications of having to assume catalogue changes? I see that right now, you're just skipping actual decoding once you've taken care of snapshot building chores within SnapBuildCallback(): + if (snapstate->state == SNAPBUILD_START) + return SNAPBUILD_SKIP; > The heuristics we can use are: > * relfilenode->spcNode == GLOBALTABLESPACE_OID > * relfilenode->relNode <= FirstNormalObjectId > * RelationMapFilenodeToOid(relfilenode->relNode, false) != InvalidOid > > Those detect some catalog tables but not all (think VACUUM FULL), but if they > detect one they are correct. If the heuristics aren't going to be completely reliable, why is that acceptable? You've said it "seems to work fine", but I don't quite follow. I see this within SnapBuildCallback() (the function whose name I said was a misnomer). > After reaching FULL_SNAPSHOT we can do catalog access if our heuristics tell us > a table might not be a catalog table. For that we use the new RELFILENODE > syscache with (spcNode, relNode). I share some of Robert's concerns here. The fact that relfilenode mappings have to be time-relativised may tip the scales against this approach. As Robert has said, it may be that simply adding the table OID to the WAL header is the way forward. It's not as if we can't optimise that later. One compromise might be to only do that when we haven't yet reached FULL_SNAPSHOT On the subject of making decoding restartable, you say: > == Restartable Decoding == > > As we want to generate a consistent stream of changes we need to have the > ability to start from a previously decoded location without going to the whole > multi-phase setup because that would make it very hard to calculate up to where > we need to keep information. Indeed, it would. > To make that easier everytime a decoding process finds an online checkpoint > record it exlusively takes a global lwlock and checks whether visibility > information has been already been written out for that checkpoint and does so > if not. We only need to do that once as visibility information is the same > between all decoding backends. Where and how would that visibility information be represented? So, typically you'd expect it to say “no catalogue changes for entire checkpoint“ much of the time? The patch we've seen (0008-Introduce-wal-decoding-via-catalog-timetravel.patch [4]) doesn't address the question of transactions that contain DDL: + /* FIXME: For now skip transactions with catalog changes entirely */ + if (ent && ent->does_timetravel) + does_timetravel = true; + else + does_timetravel = SnapBuildHasCatalogChanges(snapstate, xid, relfilenode); but you do address this question (or a closely related question, which, I gather is the crux of the issue: How to mix DDL and DML in transactions?) in the new doc (README.SNAPBUILD.txt [6]): > == mixed DDL/DML transaction handling == > > When a transactions uses DDL and DML in the same transaction things get a bit > more complicated because we need to handle CommandIds and ComboCids as we need > to use the correct version of the catalog when decoding the individual tuples. Right, so it becomes necessary to think about time-travelling not just to a particular transaction, but to a particular point in a particular transaction – the exact point at which the catalogue showed a structure consistent with sanely interpreting logical WAL records created during that window after the last DDL command (if any), but before the next (if any). This intelligence is only actually needed when decoding tuples created in that actual transaction, because only those tuples have their format change throughout a single transaction. > CommandId handling itself is relatively simple, we can figure out the current > CommandId relatively easily by looking at the currently used one in > changes. The problematic part is that those CommandId frequently will not be > actual cmin or cmax values but ComboCids. Those are used to minimize space in > the heap. During normal operation cmin/cmax values are only used within the > backend emitting those rows and only during one toplevel transaction, so > instead of storing cmin/cmax only a reference to an in-memory value is stored > that contains both. Whenever we see a new CommandId we call > ApplyCacheAddNewCommandId. Right. So in general, transaction A doesn't have to concern itself with the order that other transactions had tuples become visible or invisible (cmin and cmax); transaction A need only concern itself with whether they're visible or invisible based on if relevant transactions (xids) committed, its own xid, plus each tuple's xmin and xmax. It is this property of cmin/cmax that enabled the combocid optimisation in 8.3, which introduces an array in *backend local* memory, to map a single HeapTupleHeader field (where previously there were 2 – cmin and cmax) to an entry in that array, under the theory that it's unusual for a tuple to be created and then deleted in the same transaction. Most of the time, that one HeapTupleHeader field wouldn't have a mapping to the local array – rather, it would simply have a cmin or a cmax. That's how we save heap space. > To resolve this problem during heap_* whenever we generate a new combocid > (detected via an new parameter to HeapTupleHeaderAdjustCmax) in a catalog table > we log the new XLOG_HEAP2_NEW_COMBOCID record containing the mapping. During > decoding this ComboCid is added to the applycache > (ApplyCacheAddNewComboCid). They are only guaranteed to be visible within a > single transaction, so we cannot simply setup all of them globally. This seems more or less reasonable. The fact that the combocid optimisation uses a local memory array isn't actually an important property of combocids as a performance optimisation – it's just that there was no need for the actual cmin and cmax values to be visible to other backends, until now. Comments in combocids.c go on at length about how infrequently actual combocids are actually used in practice. For ease of implementation, but also because real combocids are expected to be needed infrequently, I suggest that rather than logging the mapping, you log the values directly in a record (i.e. The full cmin and cmax mapped to the catalogue + catalogue tuple's ctid). You could easily exhaust the combocid space otherwise, and besides, you cannot do anything with the mapping from outside of the backend that originated the combocid for that transaction (you don't have the local array, or the local hashtable used for combocids). > Before calling the output plugin ComboCids are temporarily setup and torn down > afterwards. How? You're using a combocid-like array + hashtable local to the plugin backend? Anyway, for now, this is unimplemented, which is perhaps the biggest concern about it: + /* check if its one of our txids, toplevel is also in there */ + else if (TransactionIdInArray(xmin, snapshot->subxip, snapshot->subxcnt)) + { + CommandId cmin = HeapTupleHeaderGetRawCommandId(tuple); + /* no support for that yet */ + if (tuple->t_infomask & HEAP_COMBOCID){ + elog(WARNING, "combocids not yet supported"); + return false; + } + if (cmin >= snapshot->curcid) + return false; /* inserted after scan started */ + } Above, you aren't taking this into account (code from HeapTupleHeaderGetCmax()): /* We do not store cmax when locking a tuple */ Assert(!(tup->t_infomask & (HEAP_MOVED | HEAP_IS_LOCKED))); Sure, you're only interested in cmin, so this doesn't look like it applies (isn't this just a sanity check?), but actually, based on this it seems to me that the current sane representation of cmin needs to be obtained in a more concurrency aware fashion - having the backend local data-structures that originate the tuple isn't even good enough. The paucity of other callers to HeapTupleHeaderGetRawCommandId() seems to support this. Apart from contrib/pageinspect, there is only this one other direct caller, within heap_getsysattr(): /** cmin and cmax are now both aliases for the same field, which* can in fact also be a combo command id. XXX perhapswe should* return the "real" cmin or cmax if possible, that is if we are* inside the originating transaction?*/ result = CommandIdGetDatum(HeapTupleHeaderGetRawCommandId(tup->t_data)); So these few direct call-sites that inspect CommandId outside of their originating backend are all non-critical observers of the CommandId, that are roughly speaking allowed to be wrong. Callers to the higher level abstractions (HeapTupleHeaderGetCmin()/HeapTupleHeaderGetCmax()) know that the CommandId must be a cmin or cmax respectively, because they have as their contract that TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tup)) and TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmax(tup)) respectively. I guess this can all happen when you write that XLOG_HEAP2_NEW_COMBOCID record within the originating transaction (that is, when you xmin is that of the tuple in the originating transaction, you are indeed dealing with a cmin), but these are hazards that you need to consider in addition to the more obvious ComboCid hazard. I have an idea that the HeapTupleHeaderGetRawCommandId(tuple) call in your code could well be bogus even when (t_infomask & HEAP_COMBOCID) == 0. I look forward to seeing your revision that addressed the issue with sinval messages. [1] http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com [2] http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com [3] http://archives.postgresql.org/message-id/201209150233.25616.andres@2ndquadrant.com [4] http://archives.postgresql.org/message-id/1347669575-14371-8-git-send-email-andres@2ndquadrant.com [5] http://archives.postgresql.org/message-id/201206131327.24092.andres@2ndquadrant.com [6] http://archives.postgresql.org/message-id/201210161330.37967.andres@2ndquadrant.com -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
pgsql-hackers by date: