Re: Yet another fast GiST build - Mailing list pgsql-hackers
From | Oleg Bartunov |
---|---|
Subject | Re: Yet another fast GiST build |
Date | |
Msg-id | CAF4Au4wCj6W2JZUQD3BU+Qca5e853Om6QTgmm+TsiBUzePM-bg@mail.gmail.com Whole thread Raw |
In response to | Re: Yet another fast GiST build (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Yet another fast GiST build
|
List | pgsql-hackers |
On Mon, Sep 7, 2020 at 7:50 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 07/09/2020 13:59, Pavel Borisov wrote: > >>>> I suppose there is a big jump in integer value (whether signed or > >>>> unsigned) as you cross from positive to negative floats, and then the > >>>> sort order is reversed. I have no idea if either of those things is a > >>>> problem worth fixing. That made me wonder if there might also be an > >> > >> I took a stab at fixing this, see attached patch (applies on top of your > >> patch v14). > >> > >> To evaluate this, I used the other attached patch to expose the zorder > >> function to SQL, and plotted points around zero with gnuplot. See the > >> attached two images, one with patch v14, and the other one with this patch. > > > > I'd made testing of sorted SpGist build in cases of points distributed only > > in 2d quadrant and points in all 4 quadrants and it appears that this > > abnormality doesn't affect as much as Andrey supposed. But Heikki's patch > > is really nice way to avoid what can be avoided and I'd like it is included > > together with Andrey's patch. > > Thanks! Did you measure the quality of the built index somehow? The > ordering shouldn't make any difference to the build speed, but it > affects the shape of the resulting index and the speed of queries > against it. > > I played with some simple queries like this: > > explain (analyze, buffers) select count(*) from points_good where p <@ > box(point(50, 50), point(75, 75)); > > and looking at the "Buffers" line for how many pages were accessed. > There doesn't seem to be any consistent difference between v14 and my > fix. So I concur it doesn't seem to matter much. > > > I played some more with plotting the curve. I wrote a little python > program to make an animation of it, and also simulated how the points > would be divided into pages, assuming that each GiST page can hold 200 > tuples (I think the real number is around 150 with default page size). > In the animation, the leaf pages appear as rectangles as it walks > through the Z-order curve. This is just a simulation by splitting all > the points into batches of 200 and drawing a bounding box around each > batch. I haven't checked the actual pages as the GiST creates, but I > think this gives a good idea of how it works. Heikki, you may use our gevel extension to visualize index tree http://www.sai.msu.su/~megera/wiki/Gevel I used it to investigate rtree index http://www.sai.msu.su/~megera/wiki/Rtree_Index > > The animation shows that there's quite a lot of overlap between the > pages. It's not necessarily this patch's business to try to improve > that, and the non-sorting index build isn't perfect either. But it > occurs to me that there's maybe one pretty simple trick we could do: > instead of blindly filling the leaf pages in Z-order, collect tuples > into a larger buffer, in Z-order. I'm thinking 32 pages worth of tuples, > or something in that ballpark, or maybe go all the way up to work_mem. > When the buffer fills up, call the picksplit code to divide the buffer > into the actual pages, and flush them to disk. If you look at the > animation and imagine that you would take a handful of pages in the > order they're created, and re-divide the points with the split > algorithm, there would be much less overlap. Interesting to see also the size of index, it should be several times less. > > - Heikki -- Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
pgsql-hackers by date: