Plans for solving the VACUUM problem - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Plans for solving the VACUUM problem |
Date | |
Msg-id | 12833.990140724@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Plans for solving the VACUUM problem
Re: Plans for solving the VACUUM problem Re: Plans for solving the VACUUM problem Re: Plans for solving the VACUUM problem Re: Plans for solving the VACUUM problem |
List | pgsql-hackers |
I have been thinking about the problem of VACUUM and how we might fix it for 7.2. Vadim has suggested that we should attack this by implementing an overwriting storage manager and transaction UNDO, but I'm not totally comfortable with that approach: it seems to me that it's an awfully large change in the way Postgres works. Instead, here is a sketch of an attack that I think fits better into the existing system structure. First point: I don't think we need to get rid of VACUUM, exactly. What we want for 24x7 operation is to be able to do whatever housekeeping we need without locking out normal transaction processing for long intervals. We could live with routine VACUUMs if they could run in parallel with reads and writes of the table being vacuumed. They don't even have to run in parallel with schema updates of the target table (CREATE/DROP INDEX, ALTER TABLE, etc). Schema updates aren't things you do lightly for big tables anyhow. So what we want is more of a "background VACUUM" than a "no VACUUM" solution. Second: if VACUUM can run in the background, then there's no reason not to run it fairly frequently. In fact, it could become an automatically scheduled activity like CHECKPOINT is now, or perhaps even a continuously running daemon (which was the original conception of it at Berkeley, BTW). This is important because it means that VACUUM doesn't have to be perfect. The existing VACUUM code goes to huge lengths to ensure that it compacts the table as much as possible. We don't need that; if we miss some free space this time around, but we can expect to get it the next time (or eventually), we can be happy. This leads to thinking of space management in terms of steady-state behavior, rather than the periodic "big bang" approach that VACUUM represents now. But having said that, there's no reason to remove the existing VACUUM code: we can keep it around for situations where you need to crunch a table as much as possible and you can afford to lock the table while you do it. The new code would be a new command, maybe "VACUUM LAZY" (or some other name entirely). Enough handwaving, what about specifics? 1. Forget moving tuples from one page to another. Doing that in a transaction-safe way is hugely expensive and complicated. Lazy VACUUM will only delete dead tuples and coalesce the free space thus made available within each page of a relation. 2. This does no good unless there's a provision to re-use that free space. To do that, I propose a free space map (FSM) kept in shared memory, which will tell backends which pages of a relation have free space. Only if the FSM shows no free space available will the relation be extended to insert a new or updated tuple. 3. Lazy VACUUM processes a table in five stages: A. Scan relation looking for dead tuples; accumulate a list of their TIDs, as well as info about existing free space. (This pass is completely read-only and so incurs no WAL traffic.) B. Remove index entries for the dead tuples. (See below for details.) C. Physically delete dead tuples and compactfree space on their pages. D. Truncate any completely-empty pages at relation's end. (Optional, see below.) E. Create/update FSM entry for the table. Note that this is crash-safe as long as the individual update operations are atomic (which can be guaranteed by WAL entries for them). If a tuple is dead, we care not whether its index entries are still around or not; so there's no risk to logical consistency. 4. Observe that lazy VACUUM need not really be a transaction at all, since there's nothing it does that needs to be cancelled or undone if it is aborted. This means that its WAL entries do not have to hang around past the next checkpoint, which solves the huge-WAL-space-usage problem that people have noticed while VACUUMing large tables under 7.1. 5. Also note that there's nothing saying that lazy VACUUM must do the entire table in one go; once it's accumulated a big enough batch of dead tuples, it can proceed through steps B,C,D,E even though it's not scanned the whole table. This avoids a rather nasty problem that VACUUM has always had with running out of memory on huge tables. Free space map details ---------------------- I envision the FSM as a shared hash table keyed by table ID, with each entry containing a list of page numbers and free space in each such page. The FSM is empty at system startup and is filled by lazy VACUUM as it processes each table. Backends then decrement/remove page entries as they use free space. Critical point: the FSM is only a hint and does not have to be perfectly accurate. It can omit space that's actually available without harm, and if it claims there's more space available on a page than there actually is, we haven't lost much except a wasted ReadBuffer cycle. This allows us to take shortcuts in maintaining it. In particular, we can constrain the FSM to a prespecified size, which is critical for keeping it in shared memory. We just discard entries (pages or whole relations) as necessary to keep it under budget. Obviously, we'd not bother to make entries in the first place for pages with only a little free space. Relation entries might be discarded on a least-recently-used basis. Accesses to the FSM could create contention problems if we're not careful. I think this can be dealt with by having each backend remember (in its relcache entry for a table) the page number of the last page it chose from the FSM to insert into. That backend will keep inserting new tuples into that same page, without touching the FSM, as long as there's room there. Only then does it go back to the FSM, update or remove that page entry, and choose another page to start inserting on. This reduces the access load on the FSM from once per tuple to once per page. (Moreover, we can arrange that successive backends consulting the FSM pick different pages if possible. Then, concurrent inserts will tend to go to different pages, reducing contention for shared buffers; yet any single backend does sequential inserts in one page, so that a bulk load doesn't cause disk traffic scattered all over the table.) The FSM can also cache the overall relation size, saving an lseek kernel call whenever we do have to extend the relation for lack of internal free space. This will help pay for the locking cost of accessing the FSM. Locking issues -------------- We will need two extensions to the lock manager: 1. A new lock type that allows concurrent reads and writes (AccessShareLock, RowShareLock, RowExclusiveLock) but not anything else. Lazy VACUUM will grab this type of table lock to ensure the table schema doesn't change under it. Call it a VacuumLock until we think of a better name. 2. A "conditional lock" operation that acquires a lock if available, but doesn't block if not. The conditional lock will be used by lazy VACUUM to try to upgrade its VacuumLock to an AccessExclusiveLock at step D (truncate table). If it's able to get exclusive lock, it's safe to truncate any unused end pages. Without exclusive lock, it's not, since there might be concurrent transactions scanning or inserting into the empty pages. We do not want lazy VACUUM to block waiting to do this, since if it does that it will create a lockout situation (reader/writer transactions will stack up behind it in the lock queue while everyone waits for the existing reader/writer transactions to finish). Better to not do the truncation. Another place where lazy VACUUM may be unable to do its job completely is in compaction of space on individual disk pages. It can physically move tuples to perform compaction only if there are not currently any other backends with pointers into that page (which can be tested by looking to see if the buffer reference count is one). Again, we punt and leave the space to be compacted next time if we can't do it right away. The fact that inserted/updated tuples might wind up anywhere in the table, not only at the end, creates no headaches except for heap_update. That routine needs buffer locks on both the page containing the old tuple and the page that will contain the new. To avoid possible deadlocks between different backends locking the same two pages in opposite orders, we need to constrain the lock ordering used by heap_update. This is doable but will require slightly more code than is there now. Index access method improvements -------------------------------- Presently, VACUUM deletes index tuples by doing a standard index scan and checking each returned index tuple to see if it points at any of the tuples to be deleted. If so, the index AM is called back to delete the tested index tuple. This is horribly inefficient: it means one trip into the index AM (with associated buffer lock/unlock and search overhead) for each tuple in the index, plus another such trip for each tuple actually deleted. This is mainly a problem of a poorly chosen API. The index AMs should offer a "bulk delete" call, which is passed a sorted array of main-table TIDs. The loop over the index tuples should happen internally to the index AM. At least in the case of btree, this could be done by a sequential scan over the index pages, which avoids the random I/O of an index-order scan and so should offer additional speedup. Further out (possibly not for 7.2), we should also look at making the index AMs responsible for shrinking indexes during deletion, or perhaps via a separate "vacuum index" API. This can be done without exclusive locks on the index --- the original Lehman & Yao concurrent-btrees paper didn't describe how, but more recent papers show how to do it. As with the main tables, I think it's sufficient to recycle freed space within the index, and not necessarily try to give it back to the OS. We will also want to look at upgrading the non-btree index types to allow concurrent operations. This may be a research problem; I don't expect to touch that issue for 7.2. (Hence, lazy VACUUM on tables with non-btree indexes will still create lockouts until this is addressed. But note that the lockout only lasts through step B of the VACUUM, not the whole thing.) There you have it. If people like this, I'm prepared to commit to making it happen for 7.2. Comments, objections, better ideas? regards, tom lane
pgsql-hackers by date: