Re: Changeset Extraction v7.1 - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Changeset Extraction v7.1 |
Date | |
Msg-id | CA+TgmobObSwQS0qwvLr-JQN8QobDWov4b2H7rWk43k1vRMpg2g@mail.gmail.com Whole thread Raw |
In response to | Re: Changeset Extraction v7.1 (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: Changeset Extraction v7.1
|
List | pgsql-hackers |
On Fri, Jan 24, 2014 at 12:49 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> But this code is riddled with places where you track a catalog xmin >> and a data xmin separately. The only point of doing it that way is to >> support a division that hasn't been made yet. > > If you think it will make stuff more manageable I can try separating all > lines dealing with catalog_xmin into another patch as long as data_xmin > doesn't have to be renamed. > That said, I don't really think it's a big problem that the division > hasn't been made, essentially the meaning is different, even if we don't > take advantage of it yet. data_xmin is there for streaming replication > where we need to prevent all removal, catalog_xmin is there for > changeset extraction. I spent some more time studying the 0001 and 0002 patches this afternoon, with a side dish of 0004. I'm leaning toward thinking we should go ahead and make that division. I'm also wondering about whether we've got the right naming here. AFAICT, it's not the case that we're going to use the "catalog xmin" for catalogs and the "data xmin" for non-catalogs. Rather, the "catalog xmin" is going to always be included in globalxmin calculations, so IOW it should just be called "xmin". The "data xmin" is going to be included only for non-catalog tables. I guess "data" is a reasonable antonym for catalog, but I'm slightly tempted to propose RecentGlobalNonCatalogXmin and similar. Maybe that's too ugly to live, but I can see someone failing to guess what the exact distinction is between "xmin" and "data xmin", and I bet they'd be a lot less likely to misguess if we wrote "non catalog". It's interesting (though not immediately relevant) to speculate about how we might extend this to fine-grained xmin tracking more generally.The design sketch that comes to mind (and I think partsof this have been proposed before) is to have a means by which backends can promise not to lock any more tables except under a new snapshot. At the read committed isolation level, or in any single-statement transaction, backends can so promise whenever (a) all tables mentioned in the query have been locked and (b) all functions to be invoked during the query via the fmgr interface promise (e.g. via function labeling) that they won't directly or indirectly do such a thing. If they break their promise, we detect it and ereport(ERROR). Backends that have made such a guarantee can be ignored for global-xmin calculations that don't involve the tables they have locked. One idea is to keep a hash table keyed by <dboid, reloid> with some limited number of entries in shared memory; it caches the table-specific xmin, a usage counter, and a flag indicating whether the cached xmin might be stale. In order to promise not to lock any new tables, backends must make or update entries for all the tables they already have locked in this hash table; if there aren't enough entries, they're not allowed to promise.Thus, backends wishing to prune can use the cachedxmin value if it's present (optionally updating it if it's stale) and the minimum of the xmins of the backends that haven't made a promise if it isn't. This is a bit hairy though; access to the shared hash table had better be *really* fast, and we'd better not need to recompute the cached value too often. Anyway, whether we end up pursuing something like that or not, I think I'm persuaded that this particular optimization won't really be a problem for hypothetical future work in this area; and also that it's a good idea to do this much now specifically for logical decoding. >> I have zero confidence that it's OK to treat fsync() as an operation >> that won't fail. Linux documents EIO as a plausible error return, for >> example. (And really, how could it not?) > > But quite fundamentally having a the most basic persistency building > block fail is something you can't really handle safely. Note that > issue_xlog_fsync() has always (and I wager, will always) treated that as > a PANIC. > I don't recall many complaints about that for WAL. All of the ones I > found in a quick search were like "oh, the fs invalidated my fd because > of corruption". And few. Well, you have a point. And certainly this version looks much better than the previous version in terms of the likelihood of PANIC occurring in practice. But I wonder if we couldn't cut it down even further without too much effort. Suppose we define a slot to exist if, and only if, the state file exists. A directory without a state file is subject to arbitrary removal. Then we can proceed as follows: - mkdir() the directory. - open() state.tmp - write() state.tmp - close() state.tmp - fsync() parent directory, directory and state.tmp - rename() state.tmp to state - fsync() state If any step except the last one fails, no problem. The next startup can nuke the leftovers; any future attempt to create a slot with the name can ignore an EEXIST failure from mkdir() and procedure just as above. Only a failure of the very last fsync is a PANIC. In some ways I think this'd be simpler than what you've got now, because we'd eliminate the dance with renaming the directory as well as the state file; only the state file matters. To drop a slot, just unlink the state file and fsync() the directory. If the unlink fails, it's just an error. If the fsync() fails, it's a PANIC. Once the state file is gone, removing everything else is only an ERROR, and you don't even need to fsync() it again. To update a slot, open, write, close, and fsync state.tmp, then rename it to state and fsync() again. None of these steps need PANIC; hold off on updating the values in memory until they're all done. If any step fails, the attempt to update the slot fails, but either memory and disk are still consistent, or the disk has an xmin newer than memory, but still legal. On restart, when restoring slots, fsync() each state file, dying horribly if we can't, and remove any directories that don't contain one. As compared with what you have here, this eliminates the risk of PANIC entirely for slot updates, which is good because those will be quite frequent. For creating or dropping a slot, it doesn't quite eliminate the risk entirely but only one fsync() call per create or drop is at risk. We still risk startup time failures, but that's unavoidable anyway if the data we need can't be read; the chances of blowing up a running system are very low. Looking over patch 0002, I see that there's code to allow a walsender to create or drop a physical replication slot. Also, if we've acquired a replication slot, there's code to update it, and code to make sure we disconnect from it, but there's no code to acquire it. I think maybe the hunk in StartReplication() is supposed to be calling ReplicationSlotAcquire() instead of ReplicationSlotRelease(), which (ahem) makes one wonder how thoroughly this code has been tested. There's also no provision for walsender (or pg_receivexlog?) to send the new SLOT option to walreceiver, which seems somewhat necessary. I'm tempted to suggest also adding something to src/bin/scripts to create and drop slots, though I suppose we could just recommend psql -c 'CREATE_REPLICATION_SLOT SLOT zippy PHYSICAL' 'replication=true'. (BTW, isn't that kind of a strange syntax, with the word SLOT appearing twice? I think we could drop the second one.) It also occurred to me that we need documentation for all of this; I see that's in patch 0010, but I haven't reviewed it in detail yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: