Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CA+TgmoZdQGv35Qzrrj+-X_GypCPqv0FqfpQLmUfva1rY75hCfw@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
|
List | pgsql-hackers |
On Wed, Jul 10, 2019 at 12:36 PM Robert Haas <robertmhaas@gmail.com> wrote: > Broadly, you are correct to point out that you need to avoid chasing > stale pointers, and there are a bunch of ways to accomplish that: > approach #1 avoids using real pointers, and the rest just make sure > that any stale pointers don't stick around long enough to cause any > harm. There are probably also several other totally realistic > alternatives, and I don't know for sure what is best, or how much it > matters. After some off-list discussion with Andres ... Another possible approach here, which I think I like better, is to switch from using a binary heap to using an rbtree. That wouldn't work well in DSM because of the way it uses pointers, but here we're putting data in the shared memory segment so it seems like it should work. The idea would be to allocate an array of entries with a freelist, and then have allocfunc and freefunc defined to push and pop the freelist. Unlike a binary heap, an rbtree lets us (a) do peek-ahead in sorted order and (b) delete elements from an arbitrary position without rebuilding anything. If we adopt this approach, then I think a bunch of the problems we've been talking about actually get a lot easier. If we pull an item from the ordered-by-XID rbtree or the ordered-by-undo-size rbtree, we can remove it from the other one cheaply, because we can store a pointer to the RBTNode in the main object. So then we never have any stale pointers in any data structure, which means we don't have to have a strategy to avoid accidentally following them. The fact that we can peak-ahead correctly without any new code is also very nice. I'm still concerned that peeking ahead isn't the right approach in general, but if we're going to do it, peeking ahead to the actually-next-highest-priority item is a lot better than peeking ahead to some-item-that-may-be-fairly-high-priority. One problem which Andres spotted is that rbt_delete() can actually move content around, so if you just cache the RBTNode returned by rbt_insert(), it might not be the right one by the time you rbt_delete(), if other stuff has been deleted first. There are several possible approaches to that problem, but one that I'm wondering about is modifying rbt_delete_node() so that it doesn't rely on rbt_copy_data. The idea is that if y != z, instead of copying the data from y to z, copy the left/right/parent pointers from z into y, and make z's left, right, and parent nodes point to y instead. Then we always end up removing the correct node, which would make things much easier for us and might well be helpful to other code that uses rbtree as well. Another small problem, also spotted by Andres, is that rbt_create() uses palloc. That seems easy to work around: just provide an rbt_intialize() function that a caller can use instead of it wants to initialize an already-allocated block of memory. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: