Re: GiST penalty functions [PoC] - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: GiST penalty functions [PoC] |
Date | |
Msg-id | 236d7d60-4415-f4c9-34d7-656a8314d1bd@iki.fi Whole thread Raw |
In response to | GiST penalty functions [PoC] (Andrew Borodin <borodin@octonica.com>) |
Responses |
Re: GiST penalty functions [PoC]
|
List | pgsql-hackers |
On 08/29/2016 09:04 AM, Andrew Borodin wrote: > In this message I address only first problem. I want to construct a > penalty function that will choose: > 1. Entry with a zero volume and smallest margin, that can > accommodate insertion tuple without extension, if one exists; > 2. Entry with smallest volume, that can accommodate insertion tuple > without extension, if one exists; > 3. Entry with zero volume increase after insert and smallest margin > increase, if one exists; > 4. Entry with minimal volume increase after insert. Looking at the code, by "margin", you mean the sum of all "edges", i.e. of each dimension, of the cube. I guess the point of that is to differentiate between cubes that have the same volume, but a different shape. So, let's try to come up with a scheme that doesn't require IEEE 754 floats. Those above cases can be split into two categories, depending on whether the new value has zero volume or not. We can use a completely different scheme for the two cases, if we want, because when we're choosing a target page to insert to, penalty gets called for every original tuple, but with the same new tuple. Here's a scheme I just came up with. There might be better ones, but it's something. Let's have: oX: Original tuple's edge sum nX: New tuple's edge sum dX: Edge increase oV: Original tuple's volume nX: New tuple's volume dX: Volume increase Case A: New entry has zero volume. ------ Two sub-cases: A1: if dE > 0, use dE. dE must be in the range [0, nE] A2: otherwise, use oE. So how do we cram those two into a single float? If we offset case A1 by n, we can use the range between [0, nE] for A2. Something like this pseudocode: if (dE > 0) return nE + dE; /* A1, offset dE into range [nE, inf] */ else return nE - (nE/oE); /* A2, scale oE into range [0, nE] */ Case B: New entry has non-zero volume ------ B1: if dV > 0. use dV. dV must be in the range [0, nV]. B2: if dE > 0, use dE. dE must be in the range [0, nE]. B3: oV, otherwise. By offsetting cases B1 and B2, we can again divide the range into three parts, with pseudocode like: if (dV > 0) return nV + nE + dV; /* B1, offset dV into range [nV+nE, inf] */ else if (dE > 0) return nV + dE; /* B2, offset dE into range [nV, nV+nE] */ else return nV - (nV/oV) /* B3, scale oV into range [0, nV] > I’ve tested patch performance with attached test. On my machine patch > slows index construction time from 60 to 76 seconds, reduces size of > the index from 85Mb to 82Mb, reduces time of aggregates computation > from 36 seconds to 29 seconds. Hmm. That's a pretty large increase in construction time. Have you done any profiling on where the time goes? > I do not know: should I continue this work in cube, or it’d be better > to fork cube? Should definitely continue work within cube, IMHO. I don't think forking it to a new datatype would make this any easier. - Heikki
pgsql-hackers by date: