Re: Catalog/Metadata consistency during changeset extraction from wal - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Catalog/Metadata consistency during changeset extraction from wal |
Date | |
Msg-id | CA+TgmobL-GY7pXou-9huHiyEU3so1YYXBocXMcUR1OeoY3rMhQ@mail.gmail.com Whole thread Raw |
In response to | Re: Catalog/Metadata consistency during changeset extraction from wal (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: Catalog/Metadata consistency during changeset
extraction from wal
Re: Catalog/Metadata consistency during changeset extraction from wal |
List | pgsql-hackers |
On Mon, Jun 25, 2012 at 9:43 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> > The only theoretical way I see against that problem would be to postpone >> > all relation unlinks untill everything that could possibly read them has >> > finished. Doesn't seem to alluring although it would be needed if we >> > ever move more things of SnapshotNow. >> > >> > Input/Ideas/Opinions? >> >> Yeah, this is slightly nasty. I'm not sure whether or not there's a >> way to make it work. > Postponing all non-rollback unlinks to the next "logical checkpoint" is the > only thing I can think of... There are a number of cool things we could do if we postponed unlinks.Like, why can't we allow concurrent read-only querieswhile a CLUSTER operation is in progress? Well, two reasons. The first is that we currently can't do ANY DDL with less than a full table lock because of SnapshotNow-related race conditions. The second is that people might still need to look at the old heap after the CLUSTER transaction commits. Some kind of delayed unlink facility where we garbage-collect relation backing files when their refcount falls to zero would solve the second problem - not that that's any help by itself without a solution to the first one, but hey. >> I had another idea. Suppose decoding happens directly on the primary, >> because I'm still hoping there's a way to swing that. Suppose further >> that we handle DDL by insisting that (1) any backend which wants to >> add columns or change the types of existing columns must first wait >> for logical replication to catch up and (2) if a backend which has >> added columns or changed the types of existing columns then writes to >> the modified table, decoding of those writes will be postponed until >> transaction commit. I think that's enough to guarantee that the >> decoding process can just use the catalogs as they stand, with plain >> old SnapshotNow. > I don't think its that easy. If you e.g. have multiple ALTER's in the same > transaction interspersed with inserted rows they will all have different > TupleDesc's. If new columns were added, then tuples created with those older tuple-descriptors can still be interpreted with the latest tuple-descriptor. Columns that are dropped or retyped are a little trickier, but honestly... how much do we care about those cases? How practical is it to suppose we're going to be able to handle them sanely anyway? Suppose that the user defines a type which works just like int4 except that the output functions writes out each number in pig latin (and the input function parses pig latin). The user defines the types as binary coercible to each other and then does ALTER TABLE on a large table with an int4 column, transforming it into an int4piglatin column. Due to Noah Misch's fine work, we will conclude that no table rewrite is needed. But if logical replication is in use, then in theory we should scan the whole table and generate an LCR for each row saying "the row with primary key X was updated, and column Y, which used to contain 42, now contains ourty-two-fay". Otherwise, if we're doing heterogenous replication into a system that just stores that column as text, it'll end up with the wrong contents. On the other hand, if we're trying to ship data to another PostgreSQL instance where the column hasn't yet been updated, then all of those LCRs are just going to error out when we try to apply them. A more realistic scenario where you have the same problem is with something like ALTER TABLE .. ADD COLUMN .. DEFAULT. If you add a column with a default in a single step (as opposed to first adding the column and then setting its default), we rewrite the table and set every row to the default value. Should that generate LCRs showing every row being updated to add that new value, or should we generate no LCRs and assume that the DBA will independently do the same operation on the remote side? Either answer could be correct, depending on how the LCRs are being used. If you're just rewriting with a constant default, then perhaps the sensible thing is to generate no LCRs, since it will be more efficient to mimic the operation on the remote side than to replay the changes row-by-row. But what if the default isn't a constant, like maybe it's nextval('new_synthetic_pkey_seq') or even something like now(). In those cases, it seems quite likely that if you don't generate LCRs, manual user intervention will be required to get things back on track.On the other hand, if you do generate LCRs, the remoteside will become horribly bloated on replay, unless the LCRs also instruct the far side that they should be applied via a full-table rewrite. Can we just agree to punt all this complexity for version 1 (and maybe versions 2, 3, and 4)? I'm not sure what Slony does in situations like this but I bet for a lot of replication systems, the answer is "do a full resync". In other words, we either forbid the operation outright when the table is enabled for logical replication, or else we emit an LCR that says, in effect, "transaction 12345 monkeyed with the table, please resync". It strikes me that it's really the job of some higher-level control logic to decide what the "correct" behavior is in these cases; the decoding process doesn't really have enough information about what the user is trying to do to make a sensible decision anyway. It would be nice to be able to support some simple cases like "adding a column that has no default" or "dropping a column" without punting, but going much further than that seems like it will require embedding policy decisions that should really be happening at a higher level. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: