Re: Musings - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Musings |
Date | |
Msg-id | 29182.1020657054@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Musings (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
I said: > On the other hand --- if the index *is* unique, and we are checking > equality on all columns (a fairly easily checked condition), then we > know we should retrieve at most one visible tuple. So, without making > any incorrect assumptions, we could terminate the indexscan after the > first successful match. Hmm ... you might be right that there's a > cheap win to be had there. I tried this out on a quick-hack basis of just teaching IndexNext to terminate an indexscan once it's gotten a single visible tuple out of a unique index. It works pretty much as expected, but I didn't see any noticeable improvement in pgbench speed. Investigation with gprof led to the following conclusions: 1. pgbench spends an awful lot of its time in _bt_check_unique, which I hadn't touched. (AFAICS it couldn't be sped up anyway with this technique, since in the non-error case it won't find any visible tuples.) I think the only real hope for speeding up _bt_check_unique is to mark dead index entries so that we can avoid repeated heap_fetches. 2. When I said that new index entries would be visited first because they're inserted to the left of existing entries of the same key, I forgot that that's only true when there's room for them there. The comments in nbtinsert.c give a more complete picture: * NOTE: if the new key is equal to one or more existing keys, we can* legitimately place it anywhere in the seriesof equal keys --- in fact,* if the new key is equal to the page's "high key" we can place it on* the next page. If it is equal to the high key, and there's not room* to insert the new tuple on the current page without splitting,then* we can move right hoping to find more free space and avoid a split.* (We should not move right indefinitely,however, since that leads to* O(N^2) insertion behavior in the presence of many equal keys.)* Once wehave chosen the page to put the key on, we'll insert it before* any existing equal keys because of the way _bt_binsrch()works. If we repeatedly update the same tuple (keeping its index key the same), after awhile the first btree page containing that index key will fill up, and subsequently we will tend to insert duplicate entries somewhere in the middle of the multiple-page sequence of duplicates. We could guarantee that the newest tuple is visited first only if we were prepared to split the first btree page in the above-described case, rather than looking for subsequent pages with sufficient room to insert the index entry. This seems like a bad idea --- it would guarantee inefficient space usage in the index, because as soon as we had a series of duplicate keys spanning multiple pages, we would *never* attempt to insert into the middle of that series, and thus freed space within the series of pages would go forever unused. So my conclusion is that this idea is probably worth doing, but it's not earthshaking by itself. The major difficulty with doing it in a non-hack fashion is that there isn't a clean place to insert the logic. index_getnext and subroutines cannot apply the test, because they don't know anything about tuple visibility (they don't get passed the snapshot being used). So without restructuring, we'd have to teach all the couple-dozen callers of index_getnext what to do. I have been thinking for awhile that we need to restructure the indexscan API, however. There is no good reason why index_getnext *shouldn't* be responsible for time qual checking, and if it were then it could correctly apply the one-returned-tuple rule. Even more important, it would then become possible for index_getnext to also be the place that detects completely-dead tuples and notifies the index AMs to mark those index entries as uninteresting. Basically I'd like to make index_getnext have an API essentially the same as heap_getnext. There are a few callers that actually want index_getnext's behavior (fetch index tuples but not heap tuples), but we could create an alternative subroutine for them to use. A more detailed proposal will follow by and by... regards, tom lane
pgsql-hackers by date: