Re: ATTACH/DETACH PARTITION CONCURRENTLY - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: ATTACH/DETACH PARTITION CONCURRENTLY |
Date | |
Msg-id | CA+TgmoYg4x7AH=_QSptvuBKf+3hUdiCa4frPkt+RvXZyjX1n=w@mail.gmail.com Whole thread Raw |
In response to | Re: ATTACH/DETACH PARTITION CONCURRENTLY (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: ATTACH/DETACH PARTITION CONCURRENTLY
|
List | pgsql-hackers |
On Wed, Nov 7, 2018 at 1:37 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Maybe you could give my patch a look. > > I have, a bit. While thinking about this problem a bit more, I realized that what is called RelationBuildPartitionDesc in master and BuildPartitionDesc in Alvaro's patch has a synchronization problem as soon as we start to reduce lock levels. At some point, find_inheritance_children() gets called to get a list of the OIDs of the partitions. Then, later, SysCacheGetAttr(RELOID, ...) gets called for each one to get its relpartbound value. But since catalog lookups use the most current snapshot, they might not see a compatible view of the catalogs. That could manifest in a few different ways: - We might see a newer version of relpartbound, where it's now null because it's been detached. - We might see a newer version of relpartbound where it now has an unrelated value because it has been detached and then reattached to some other partitioned table. - We might see newer versions of relpartbound for some tables than others. For instance, suppose we had partition A for 1..200 and B for 201..300. Then we realize that this is not what we actually wanted to do, so we detach A and reattach it with a bound of 1..100 and detached B and reattach it with a bound of 101..300. If we perform the syscache lookup for A before this happens and the syscache lookup for B after this happens, we might see the old bound for A and the new bound for B, and that would be sad, 'cuz they overlap. - Seeing an older relpartbound for some other table is also a problem for other reasons -- we will have the wrong idea about the bounds of that partition and may put the wrong tuples into it. Without AccessExclusiveLock, I don't think there is anything that keeps us from reading stale syscache entries. Alvaro's patch defends against the first of these cases by throwing an error, which, as I already said, I don't think is acceptable, but I don't see any defense at all against the other cases. The root of the problem is that the way catalog lookups work today - each individual lookup uses the latest available snapshot, but there is zero guarantee that consecutive lookups use the same snapshot. Therefore, as soon as you start lowering lock levels, you are at risk for inconsistent data. I suspect the only good way of fixing this problem is using a single snapshot to perform both the scan of pg_inherits and the subsequent pg_class lookups. That way, you know that you are seeing the state of the whole partitioning hierarchy as it existed at some particular point in time -- every commit is either fully reflected in the constructed PartitionDesc or not reflected at all. Unfortunately, that would mean that we can't use the syscache to perform the lookups, which might have unhappy performance consequences. Note that this problem still exists even if you allow concurrent attach but not concurrent detach, but it's not as bad, because when you encounter a concurrently-attached partition, you know it hasn't also been concurrently-detached from someplace else. Presumably you either see the latest value of the partition bound or the NULL value which preceded it, but not anything else. If that's so, then maybe you could get by without using a consistent snapshot for all of your information gathering: if you see NULL, you know that the partition was concurrently added and you just ignore it. There's still no guarantee that all parallel workers would come to the same conclusion, though, which doesn't feel too good. Personally, I don't think it's right to blame that problem on parallel query. The problem is more general than that: we assume that holding any kind of a lock on a relation is enough to keep the important details of the relation static, and therefore it's fine to do staggered lookups within one backend, and it's also fine to do staggered lookups across different backends. When you remove the basic assumption that any lock is enough to prevent concurrent DDL, then the whole idea that you can do different lookups at different times with different snapshots (possibly in different backends) and get sane answers also ceases to be correct. But the idea that you can look up different bits of catalog data at whatever time is convenient undergirds large amounts of our current machinery -- it's built into relcache, syscache, sinval, ... I think that things get even crazier if we postpone locking on individual partitions until we need to do something with that partition, as has been proposed elsewhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: