Re: VACUUM in SP-GiST - Mailing list pgsql-hackers
From | yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) |
---|---|
Subject | Re: VACUUM in SP-GiST |
Date | |
Msg-id | 20120116214335.D88C414A22B@mail.netbsd.org Whole thread Raw |
In response to | VACUUM in SP-GiST (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: VACUUM in SP-GiST
|
List | pgsql-hackers |
hi, > I started reading the spgist vacuum code last night, and didn't like it > at all. I found a number of smaller issues, but it seems to me that > the design is just fundamentally wrong. Unless I'm misunderstanding it, > the approach is to recursively traverse the tree in sort of the same way > that a full-index search would do, except that it continues to hold lock > on the immediate parent page of whichever page is currently being > visited, so that it can update the downlink if it has to delete the > first leaf tuple of a chain. I've got several complaints about that: > > 1. Holding exclusive locks on upper pages while we visit > possibly-hundreds of leaf pages seems pretty bad from a concurrency > standpoint. It doesn't hold locks all the way up to the root, so it's > not a complete loss, but even a first-level inner page can have a lot of > children. > > 2. I do not trust this code to visit every leaf page (which, if it > failed to do so and thereby missed deleting a heap reference, would > result in a silently corrupt index). Unlike a regular index search, > which makes a list of lower-level targets while examining an inner tuple > just once, this code depends on the assumption that it can reacquire > lock on an upper page and then resychronize itself with whatever's been > done meanwhile to the inner tuple it was looking at. I don't trust > that. The mechanism that's being used relies on page LSN having been > changed if anything was changed on the page, which does not work in > unlogged tables. Furthermore, if the inner tuple did change, it has to > rescan *everything* that the inner tuple leads to. That's horrendously > inefficient, and it's not even clear that it would ever terminate in the > face of a constant stream of concurrent updates. > > 3. Even if it all worked, it would be slow as can be, because it > requires a whole lot of nonsequential accesses to the index. For > instance, the same leaf page might be visited many times (once for > each leaf chain on the page), and not necessarily close together either. > > Also, the code doesn't seem to perform any cleanup at all on inner > pages. I am not expecting it to try to get rid of childless inner > tuples, but surely it ought to at least convert old redirects to dead > and get rid of dead tuples at the end of the page, much as for leaf > pages? > > What I'm envisioning doing instead is making a single sequential scan > over the entire index. On both leaf and inner pages we could clean > redirects and remove end dead tuples. On leaf pages we'd also check for > tuples to be deleted. There's no real difficulty in removing deleted > tuples as long as they are not at the heads of their chains. (The code > would have to reverse-engineer which tuples are at the heads of their > chains, but that's easily done while scanning the page; we just make a > reverse-lookup map for the nextOffset pointers.) If we have to delete a > tuple at the head of its chain, but there's still at least one live > tuple in its chain, we could reshuffle the tuples to bring another live > tuple to that same offset number, thereby not invalidating the parent > downlink. The only hard part is what to do when we delete all members > of a chain: we can't reset the parent downlink because we do not know > where it is. What I propose to do in that case is install a variant > form of dead tuple that just indicates "this chain is empty". One way > to represent that would be a redirect tuple pointing nowhere, but I > think it would be cleaner to repurpose the isDead and isRedirect bits as > a two-bit field with four states. We'd have LIVE, DEAD, REDIRECT, and > this fourth state for a dead tuple that is not recyclable because we > know there's a link to it somewhere. A scan would ignore such a tuple. > An insertion that arrives at such a tuple could either replace it (if > there's enough room on the page) or convert it to a redirect (if not). > > Last night I was thinking the fourth state could be named TOMBSTONE, > but maybe it'd be better to use DEAD for this case and rename the > existing "removable dead tuple" state to PLACEHOLDER, since that case > only exists to preserve the offset numbers of other tuples on the > page. > > Comments? with a concurrent split moving leaf tuples between pages, can't it make bulkdelete fail reporting some TIDs and make CREATE INDEX CONCURRENTLY create duplicates? YAMAMOTO Takashi > > regards, tom lane > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
pgsql-hackers by date: