Re: ATTACH/DETACH PARTITION CONCURRENTLY - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | Re: ATTACH/DETACH PARTITION CONCURRENTLY |
Date | |
Msg-id | 20181025202622.d3x4y4ch7m4pxwnn@alvherre.pgsql Whole thread Raw |
In response to | ATTACH/DETACH PARTITION CONCURRENTLY (David Rowley <david.rowley@2ndquadrant.com>) |
Responses |
Re: ATTACH/DETACH PARTITION CONCURRENTLY
|
List | pgsql-hackers |
Hello Here's my take on this feature, owing to David Rowley's version. Firstly, I took Robert's advice and removed the CONCURRENTLY keyword from the syntax. We just do it that way always. When there's a default partition, only that partition is locked with an AEL; all the rest is locked with ShareUpdateExclusive only. I added some isolation tests for it -- they all pass for me. There are two main ideas supporting this patch: 1. The Partition descriptor cache module (partcache.c) now contains a long-lived hash table that lists all the current partition descriptors; when an invalidation message is received for a relation, we unlink the partdesc from the hash table *but do not free it*. The hash table-linked partdesc is rebuilt again in the future, when requested, so many copies might exist in memory for one partitioned table. 2. Snapshots have their own cache (hash table) of partition descriptors. If a partdesc is requested and the snapshot has already obtained that partdesc, the original one is returned -- we don't request a new one from partcache. Then there are a few other implementation details worth mentioning: 3. parallel query: when a worker starts on a snapshot that has a partition descriptor cache, we need to transmit those partdescs from leader via shmem ... but we cannot send the full struct, so we just send the OID list of partitions, then rebuild the descriptor in the worker. Side effect: if a partition is detached right between the leader taking the partdesc and the worker starting, the partition loses its relpartbound column, so it's not possible to reconstruct the partdesc. In this case, we raise an error. Hopefully this should be rare. 4. If a partitioned table is dropped, but was listed in a snapshot's partdesc cache, and then parallel query starts, the worker will try to restore the partdesc for that table, but there are no catalog rows for it. The implementation choice here is to ignore the table and move on. I would like to just remove the partdesc from the snapshot, but that would require a relcache inval callback, and a) it'd kill us to scan all snapshots for every relation drop; b) it doesn't work anyway because we don't have any way to distinguish invals arriving because of DROP from invals arriving because of anything else, say ANALYZE. 5. snapshots are copied a lot. Copies share the same hash table as the "original", because surely all copies should see the same partition descriptor. This leads to the pinning/unpinning business you see for the structs in snapmgr.c. Some known defects: 6. this still leaks memory. Not as terribly as my earlier prototypes, but clearly it's something that I need to address. 7. I've considered the idea of tracking snapshot-partdescs in resowner.c to prevent future memory leak mistakes. Not done yet. Closely related to item 6. 8. Header changes may need some cleanup yet -- eg. I'm not sure snapmgr.h compiles standalone. 9. David Rowley recently pointed out that we can modify CREATE TABLE .. PARTITION OF to likewise not obtain AEL anymore. Apparently it just requires removal of three lines in MergeAttributes. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: