Re: Faster inserts with mostly-monotonically increasing values - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Faster inserts with mostly-monotonically increasing values |
Date | |
Msg-id | CAH2-WznYxqHenGJbXbD3QdTE3BWjbgcBjcx8Ey5PwHvNgu7-Dw@mail.gmail.com Whole thread Raw |
In response to | Re: Faster inserts with mostly-monotonically increasing values (Claudio Freire <klaussfreire@gmail.com>) |
Responses |
Re: Faster inserts with mostly-monotonically increasing values
|
List | pgsql-hackers |
On Thu, Mar 1, 2018 at 3:18 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > given... > > + if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) && > + !P_INCOMPLETE_SPLIT(lpageop) && > + !P_IGNORE(lpageop) && > + (PageGetFreeSpace(page) > itemsz) && > + _bt_compare(rel, natts, itup_scankey, page, > PageGetMaxOffsetNumber(page)) > 0) One simple problem with this code is that it assumes that there must be at least one item on a leaf page, which isn't guaranteed. There has to be a highkey on non-righmost pages, but rightmost leaf pages can be totally empty. While btvacuumpage() is very unlikely to not be able to begin deleting the page there and then, it can happen (see remarks at the end of btvacuumpage() about recursing). Other issues spotted: * The variable name "lpageop" seems like is was lifted from somewhere dealing with a page to the left of some other page, which is not the case here. * P_INCOMPLETE_SPLIT() checks a bit that can only be set on a non-rightmost page -- a page that has a sibling to its right that doesn't have a downlink in parent. The bit is only ever set on the "target" of a (still unfinished) page split. This is why _bt_moveright() doesn't bother with completing a page split when it reaches a rightmost page. I don't see why you need this part of the test at all. > The key there is strictly greater than the rightmost key. As such, it > must precede the first page with an equal key, and since it's the > rightmost page... there's no such key. But if there was, the check for > uniqueness only needs the insert point to point to the first such > page, and this guarantees it. This code assumes that it can use block number as a stable way of finding the same point in the B-Tree over time, without any interlock. That may actually work out in practice, since even if the page is recycled there can only ever be one rightmost leaf page, but it seems like a brittle approach to me. The page image may be recycled by the FSM already, and we really shouldn't be depending on the page not looking a particular way once that has happened. I guess that you can say the same thing about using page LSN to determine cached block staleness instead, but that still seems a lot safer. Basically, I'm worried that we could do something entirely unpredictable when a page has actually been recycled, since we're entirely opting out of the RecentGlobalXmin interlock business on the assumption that we can be sure that recycling hasn't occurred in some other way. Imagine what could happen if we ever teach the FSM to share freespace among small relations, which seems quite possible to me. > You could allow for some slack, by comparing against the leftmost key > instead. You just need to know whether the key fits in the page. Then, > the insert can proceed as usual, doing the binsearch to find the > proper insertion point, etc. The whole way that this patch opts out of using an exclusive buffer lock on the "first page the value could be on" (the _bt_check_unique() + _bt_findinsertloc() stuff) makes me uneasy. I know that that isn't terribly concrete. -- Peter Geoghegan
pgsql-hackers by date: