BTree free pages again - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | BTree free pages again |
Date | |
Msg-id | 20021022031229.GC20564@dcc.uchile.cl Whole thread Raw |
Responses |
Re: BTree free pages again
|
List | pgsql-hackers |
Hello hackers, I've been toying around with freelist in btrees, getting lots of deadlocks and bootstrap problems. I think I've learned something about the code, though. Just for the record: I already read Lehman and Yao's paper and I think I grok it. There's few ideas I'd like to present and get some feedback about the implementation, hoping that something that I miss may be catched by someone with more experience in these things. First is about freelist structure. I had originally proposed that the freelist should keep an array of BlockNumbers in the metapage. Tom argued that the number that fit in there was too small, and proposed that the first 2000 or so freepages be recorded in the metapage, and the rest should have a link to the next freepage each. I propose instead an hybrid approach: the metapage has an array of BlockNumbers, _and_ a pointer to a "next freelist page". That page has another array of BlockNumbers and another link to next freelist page. This allows for easier "compaction" of the freelist, an operation which should be done on a regular basis (with each VACUUM FULL, for example). The list of freelist-pages should actually be double linked; that way, the compaction process can take the BlockNumbers from the last page and put it on the first to fill it up, etc. (Remember that each newpage operation has to sequentially scan the freelist, and put a zero when it takes one). The second idea is about how to do the free page detection. Each time a tuple is deleted the page is checked to see if it's empty. If it is, it's added to a "candidate-empty" list. At the end of the btbulkdelete() operation, we have a list of pages that are candidates for adding into the freelist. For each one, the page is exclusive-locked, and rechecked if it's empty. If it is, the parent is also exclusive-locked (beware of deadlock here!) and also its left and right siblings. In the parent, the item pointing to this page is deleted; in the siblings, the side pointers are updated (high key on the left sibling also?). Then this page is added to the freelist. For the btree_xlog_delete() operation the recovery should be similar, but the state should be checked with every operation, not with a candidate-empty list. On the "add-to-free-list" operation, the page is checked to see if it's the last page of the relation. If it is, the page just before is checked for emptyness (using the BTP_FREE flag) iteratively until a nonfree page is found. All those pages are deleted from the freelist. Then the relation is shrinked in that many pages. Third and last idea: how the _bt_getbuf() operation is modified. I think the best way is to create a new _bt_newbuf() operation that grabs one from the freelist, and use that whenever _bt_getbuf() is called with P_NEW as BlockNumber in the current code (current _bt_getbuf() will have and Assert(blkno != P_NEW). To prevent deadlocks, the newroot operation does not get a page from the freelist; it always extends the relation. Comments? I think I've put too many things in one email. Sorry for this. -- Alvaro Herrera (<alvherre[a]atentus.com>) "The eagle never lost so much time as when he submitted to learn from the crow." (William Blake, citado por Nobody)
pgsql-hackers by date: