Re: Declarative partitioning - another take - Mailing list pgsql-hackers
From | Amit Langote |
---|---|
Subject | Re: Declarative partitioning - another take |
Date | |
Msg-id | a002d027-66cb-a221-4069-c1ebcc8dfd0e@lab.ntt.co.jp Whole thread Raw |
In response to | Re: Declarative partitioning - another take (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Declarative partitioning - another take
|
List | pgsql-hackers |
On 2016/08/18 5:23, Robert Haas wrote: > On Wed, Aug 17, 2016 at 2:21 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I am slightly tempted to eliminate the pg_partition catalog and associated >> syscache altogether and add a column to pg_class as Robert suggested. >> That way, all relid_is_partition() calls will be replaced by >> rel->rd_partbound != NULL check. But one potential problem with that >> approach is that now whenever a parent relation is opened, all the >> partition relations must be opened to get the partbound value (to form the >> PartitionDesc to be stored in parent relation's rd_partdesc). Whereas >> currently, we just look up the pg_partition catalog (or the associated >> cache) for every partition and that gets us the partbound. > > Well, you could just look up the pg_class row without opening the > relation, too. There is a system cache on pg_class.oid, after all. I Yes, I somehow didn't think of that. > think the issue is whether it's safe to read either one of those > things without a lock on the child relation. If altering the > partitioning information for a relation requires holding only > AccessExclusiveLock on that relation, and no lock on the parent, then > you really can't read the information for any child relation without > taking at least AccessShareLock. Otherwise, it might change under > you, and that would be bad. I'd imagine this won't be a problem because we take an AccessExclusiveLock on the parent when adding/removing a partition. > I'm inclined to think that changing the partitioning information for a > child is going to require AccessExclusiveLock on both the child and > the parent. That seems unfortunate from a concurrency point of view, > but we may be stuck with it: suppose you require only > ShareUpdateExclusiveLock on the parent. Well, then a concurrent read > transaction might see the partition boundaries change when it does a > relcache rebuild, which would cause it to suddenly start expecting the > data to be in a different plan in mid-transaction, perhaps even in > mid-scan. Maybe that's survivable with really careful coding, but it > seems like it's probably a bad thing. For example, it would mean that > the executor would be unable to rely on the partitioning information > in the relcache remaining stable underneath it. Moreover, the > relcache is always going to be scanned with the most recent possible > MVCC snapshot, but the transaction snapshot may be older, so such a > system creates all sorts of nasty possibilities for there to be skew > between the snapshot being used to via the data and the snapshot being > used to read the metadata that says where the data is. We do take a lock on the parent because we would be changing its partition descriptor (relcache). I changed MergeAttributes() such that an AccessExclusiveLock instead of ShareUpdateExclusiveLock is taken if the parent is a partitioned table. > This may need some more thought, but if we go with that approach of > requiring an AccessExclusiveLock on both parent and child, then it > seems to me that maybe we should consider the partitioning information > to be a property of the parent rather than the child. Just take all > the partitioning information for all children and put it in one big > node tree and store it in the pg_class or pg_partition_root entry for > the parent as one big ol' varlena. Now you can open the parent and > get all of the partitioning information for all of the children > without needing any lock on any child, and that's *really* good, > because it means that some day we might be able to do partition > elimination before locking any of the children! That would be > excellent. If we need an AccessExclusiveLock on parent to add/remove a partition (IOW, changing that child table's partitioning information), then do we need to lock the individual partitions when reading partition's information? I mean to ask why the simple syscache look-ups to get each partition's bound wouldn't do. Thanks, Amit
pgsql-hackers by date: