Re: Re: Buffer access rules, and a probable bug - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Re: Buffer access rules, and a probable bug |
Date | |
Msg-id | 26093.994199237@sss.pgh.pa.us Whole thread Raw |
In response to | RE: Re: Buffer access rules, and a probable bug ("Mikheev, Vadim" <vmikheev@SECTORBASE.COM>) |
List | pgsql-hackers |
Okay, on to the next concern. I've been thinking some more about the restrictions needed to make the world safe for concurrent VACUUM. I previously said: > 5. To physically remove a tuple or compact free space on a page, one > must hold a pin and an exclusive lock, *and* observe while holding the > exclusive lock that the buffer's shared reference count is one (ie, > no other backend holds a pin). If these conditions are met then no other > backend can perform a page scan until the exclusive lock is dropped, and > no other backend can be holding a reference to an existing tuple that it > might expect to examine again. Note that another backend might pin the > buffer (increment the refcount) while one is performing the cleanup, but > it won't be able to to actually examine the page until it acquires shared > or exclusive lock. This is OK when considering a page in isolation, but it does not get the job done when one is deleting related index tuples and heap tuples. It seems to me that there *must* be some cross-page coupling to make that work safely. Otherwise you could have this scenario: 1. Indexscanning process visits an index tuple, decides to access the corresponding heap tuple, drops its lock on the indexbuffer page. 2. VACUUMing process visits the index buffer page and marks the index tuple dead. (It won't try to delete the tuple yet,since it sees the pin still held on the page by process #1, but it can acquire exclusive lock and mark the tuple deadanyway.) 3. VACUUMing process is the first to acquire pin and lock on the heap buffer page. It sees no other pin, so it deletesthe tuple. 4. Indexscanning process finally acquires pin and lock on the heap page, tries to access what is now a gone tuple. Ooops. (Even if we made that not an error condition, it could be worse: what if a third process already reused the linepointer for a new tuple?) It does not help to postpone the actual cleaning of an index or heap page until its own pin count drops to zero --- the problem here is that an indexscanner has acquired a reference into a heap page from the index, but does not yet hold a pin on the heap page to ensure that the reference stays good. So we can't just postpone the cleaning of the index page till it has pin count zero, we have to make the related heap page(s)' cleanup wait for that to happen too. I can think of two ways of guaranteeing that this problem cannot happen. One is for an indexscanning process to retain its shared lock on the index page until it has acquired at least a pin on the heap page. This is very bad for concurrency --- it means that we'd typically be holding indexpage shared locks for the time needed to read in a randomly-accessed disk page. And it's very complicated, since we still need all the other rules, plus the mechanism for postponing cleanups until pin count goes to zero. It'd cause considerable changes to the index access method API, too. The other is to forget about asynchronous cleaning, and instead have the VACUUM process directly do the wait for pin count zero, then clean the index page. Then when it does the same for the heap page, we know for sure there are no indexscanners in transit to the heap page. This would be logically a lot simpler, it seems to me. Another advantage is that we need only one WAL entry per cleaned page, not two (one for the initial mark-dead step and one for the postponable compaction step), and there's no need for an intermediate "gone but not forgotten" state for index tuples. We could implement this in pretty nearly the same way as the "mark for cleanup" facility that you partially implemented awhile back: essentially, the cleanup callback would send a signal or semaphore increment to the waiting process, which would then try to acquire pin and exclusive lock on the buffer. If it succeeded in observing pin count 1 with exclusive lock, it could proceed with cleanup, else loop back and try again. Eventually it'll get the lock. (It might take awhile, but for a background VACUUM I think that's OK.) What I'm wondering is if you had any other intended use for "mark for cleanup" than VACUUM. The cheapest implementation would allow only one process to be waiting for cleanup on a given buffer, which is OK for VACUUM because we'll only allow one VACUUM at a time on a relation anyway. But if you had some other uses in mind, maybe the code needs to support multiple waiters. Comments? regards, tom lane
pgsql-hackers by date: