Thread: Index AM change proposals, redux

Index AM change proposals, redux

From

Tom Lane

Date:

09 April 2008, 21:42:43

I just finished looking through the various threads about index AM
changes that Bruce has been holding onto in the commit-fest queue.
There seem to be the following issues:

* Proposed change to have amgetmulti return its results by ORing bits
into a caller-supplied TIDBitmap, rather than the current interface
involving an intermediate TID array.  The TID-array design was intended
to be agnostic about what would happen on either side of the API, but it
seems that it's really just equally inconvenient for everybody :-(.

Although the original motivation for this was to open a door for bitmap
indexes, it seems to me it's worth doing whether or not bitmap indexes
ever happen.  It's logically simpler (only one amgetmulti call per scan)
and should save at least a few cycles.  I recommend we just go ahead
with this.

* Proposed change to let both amgetnext and amgetmulti mark returned
tuples as "candidate matches", that is in need of rechecking of quals
against the real heap tuple.  I had originally objected to this on
the grounds that it would require setup work that doesn't happen now,
but looking at the code I see that's wrong --- the setup work *does*
happen anyway, see indexqualorig.  nodeIndexscan.c comments "The
indexqualorig expression is always initialized even though it will only
be used in some uncommon cases --- would be nice to improve that".
So this complaint is probably a red herring.  I'm still a bit concerned
by the fact that the patch only tracks this on a page basis in the
amgetmulti case: if any of the tuples on a page are in-doubt then
everything on the page will have to be rechecked.  Is that good enough?

A bigger issue is whether this is worth applying when nobody seems to be
working on either of the main uses for it (bitmap indexes and GIT
indexes).  There seemed to be some possible marginal use for it in GIST
indexes, but I'm not convinced that's a sufficient reason to complicate
the APIs.

* Proposed change to let amgetnext mark returned tuples as not
necessarily in order, requiring somebody downstream to sort them again.
I was pretty desperately unhappy with that because it required injection
of sorting knowledge into code that really shouldn't have anything to do
with it --- not least because not all indexscans even have a defined
ordering.  Given that it is only needed for one particular possible
implementation of GIT, and Heikki was leaning away from that
implementation anyway at last report, I vote against considering this
any further.

* Proposed changes to refactor the TIDBitmap support to include a
concept of a "stream bitmap" being read from disk.  (Not strictly an
index AM change, but closely related.)  While this is clean and nice on
its own, it doesn't seem to have any use unless bitmap indexes happen.
If we leave the code sit it'll probably acquire bit rot, but I'm
disinclined to add a bunch of unused code, too.  Thoughts?

* Proposed changes to allow amgetnext to return the actual index keys,
allowing certain types of "unindexable" quals to be checked without
having to fetch the heap entry.  For example a LIKE condition could be
fully checked against the index entry.  Since certain types of indexes
(GIN now, possibly hash in future) are incapable of doing this, there'd
need to be a way of marking index AMs (or perhaps individual indexes) as
able or not able to return their keys, so that the planner could know
whether quals could be pushed into the indexscan.

We don't have any actual submitted patch for this, but AFAIR everyone
agrees it's a good idea.

* Bitmap indexes themselves.  As far as I can tell, this development
effort has stalled because of intractable problems with VACUUM.
Can anyone provide a status update on that?

* GIT (Grouped Index Tuple) indexes, which achieve index space savings
in btrees by having a single index tuple represent multiple heap tuples
(on a single heap page) containing a range of key values.  I am not sure
what the development status is --- Heikki had submitted a completed
patch but there seemed to be agreement on making changes, and that's not
been done AFAIK.  The really serious problem I've got with it is that
it'd foreclose the possibility of returning actual index keys from btree
indexes, thus basically killing the usefulness of that idea.  I'm not
convinced it would offer enough gain to be worth paying that price.
Another issue is that we'd need to check how much of the use-case for
GIT has been taken over by HOT.

* "Thick" indexes containing copies of tuple visibility info.  As far
as I'm concerned, this patch isn't going anywhere :-(.  It's got the
same showstopper problem as retail vacuum and some of the earlier forms
of the HOT patch: it assumes that during tuple update or delete, it
can re-find the tuple's index entries by re-computing the indexed
expressions.  That's just not reliable enough.

Is that a reasonable summary of where we are?
        regards, tom lane

Re: Index AM change proposals, redux

From

Bruce Momjian

Date:

09 April 2008, 22:15:56

I am glad you have summarized this.  The details of exactly what was
being proposed was too complex for me to understand before.

---------------------------------------------------------------------------

Tom Lane wrote:
> I just finished looking through the various threads about index AM
> changes that Bruce has been holding onto in the commit-fest queue.
> There seem to be the following issues:
> 
> * Proposed change to have amgetmulti return its results by ORing bits
> into a caller-supplied TIDBitmap, rather than the current interface
> involving an intermediate TID array.  The TID-array design was intended
> to be agnostic about what would happen on either side of the API, but it
> seems that it's really just equally inconvenient for everybody :-(.
> 
> Although the original motivation for this was to open a door for bitmap
> indexes, it seems to me it's worth doing whether or not bitmap indexes
> ever happen.  It's logically simpler (only one amgetmulti call per scan)
> and should save at least a few cycles.  I recommend we just go ahead
> with this.
> 
> * Proposed change to let both amgetnext and amgetmulti mark returned
> tuples as "candidate matches", that is in need of rechecking of quals
> against the real heap tuple.  I had originally objected to this on
> the grounds that it would require setup work that doesn't happen now,
> but looking at the code I see that's wrong --- the setup work *does*
> happen anyway, see indexqualorig.  nodeIndexscan.c comments "The
> indexqualorig expression is always initialized even though it will only
> be used in some uncommon cases --- would be nice to improve that".
> So this complaint is probably a red herring.  I'm still a bit concerned
> by the fact that the patch only tracks this on a page basis in the
> amgetmulti case: if any of the tuples on a page are in-doubt then
> everything on the page will have to be rechecked.  Is that good enough?
> 
> A bigger issue is whether this is worth applying when nobody seems to be
> working on either of the main uses for it (bitmap indexes and GIT
> indexes).  There seemed to be some possible marginal use for it in GIST
> indexes, but I'm not convinced that's a sufficient reason to complicate
> the APIs.
> 
> * Proposed change to let amgetnext mark returned tuples as not
> necessarily in order, requiring somebody downstream to sort them again.
> I was pretty desperately unhappy with that because it required injection
> of sorting knowledge into code that really shouldn't have anything to do
> with it --- not least because not all indexscans even have a defined
> ordering.  Given that it is only needed for one particular possible
> implementation of GIT, and Heikki was leaning away from that
> implementation anyway at last report, I vote against considering this
> any further.
> 
> * Proposed changes to refactor the TIDBitmap support to include a
> concept of a "stream bitmap" being read from disk.  (Not strictly an
> index AM change, but closely related.)  While this is clean and nice on
> its own, it doesn't seem to have any use unless bitmap indexes happen.
> If we leave the code sit it'll probably acquire bit rot, but I'm
> disinclined to add a bunch of unused code, too.  Thoughts?
> 
> * Proposed changes to allow amgetnext to return the actual index keys,
> allowing certain types of "unindexable" quals to be checked without
> having to fetch the heap entry.  For example a LIKE condition could be
> fully checked against the index entry.  Since certain types of indexes
> (GIN now, possibly hash in future) are incapable of doing this, there'd
> need to be a way of marking index AMs (or perhaps individual indexes) as
> able or not able to return their keys, so that the planner could know
> whether quals could be pushed into the indexscan.
> 
> We don't have any actual submitted patch for this, but AFAIR everyone
> agrees it's a good idea.
> 
> * Bitmap indexes themselves.  As far as I can tell, this development
> effort has stalled because of intractable problems with VACUUM.
> Can anyone provide a status update on that?
> 
> * GIT (Grouped Index Tuple) indexes, which achieve index space savings
> in btrees by having a single index tuple represent multiple heap tuples
> (on a single heap page) containing a range of key values.  I am not sure
> what the development status is --- Heikki had submitted a completed
> patch but there seemed to be agreement on making changes, and that's not
> been done AFAIK.  The really serious problem I've got with it is that
> it'd foreclose the possibility of returning actual index keys from btree
> indexes, thus basically killing the usefulness of that idea.  I'm not
> convinced it would offer enough gain to be worth paying that price.
> Another issue is that we'd need to check how much of the use-case for
> GIT has been taken over by HOT.
> 
> * "Thick" indexes containing copies of tuple visibility info.  As far
> as I'm concerned, this patch isn't going anywhere :-(.  It's got the
> same showstopper problem as retail vacuum and some of the earlier forms
> of the HOT patch: it assumes that during tuple update or delete, it
> can re-find the tuple's index entries by re-computing the indexed
> expressions.  That's just not reliable enough.
> 
> Is that a reasonable summary of where we are?
> 
>             regards, tom lane
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Index AM change proposals, redux

From

Heikki Linnakangas

Date:

10 April 2008, 02:38:14

Tom Lane wrote:
> * Proposed change to let both amgetnext and amgetmulti mark returned
> tuples as "candidate matches", that is in need of rechecking of quals
> against the real heap tuple.  I had originally objected to this on
> the grounds that it would require setup work that doesn't happen now,
> but looking at the code I see that's wrong --- the setup work *does*
> happen anyway, see indexqualorig.  nodeIndexscan.c comments "The
> indexqualorig expression is always initialized even though it will only
> be used in some uncommon cases --- would be nice to improve that".
> So this complaint is probably a red herring.  I'm still a bit concerned
> by the fact that the patch only tracks this on a page basis in the
> amgetmulti case: if any of the tuples on a page are in-doubt then
> everything on the page will have to be rechecked.  Is that good enough?

I think it's good enough. If we wanted to track it on a per-tuple basis, 
we'd need to store two bits per tuple, AFAICS, doubling the size of the 
bitmaps.

> A bigger issue is whether this is worth applying when nobody seems to be
> working on either of the main uses for it (bitmap indexes and GIT
> indexes).  There seemed to be some possible marginal use for it in GIST
> indexes, but I'm not convinced that's a sufficient reason to complicate
> the APIs.

It has some merit on its own. Consider the case where you bitmap AND a 
lossy page and a non-lossy one. We currently make the result lossy, 
because we can't know which tuples on the page match, and then we need 
to recheck all tuples on that page. With the support for candidate 
matches, we could instead keep the non-lossy version of the bitmap and 
mark the matches as candidates.

In the very best case, we might even apply a further AND to the page, 
which eliminates all the remaining candidate matches, and skip the heap 
access of that page altogether.

> * Proposed change to let amgetnext mark returned tuples as not
> necessarily in order, requiring somebody downstream to sort them again.
> I was pretty desperately unhappy with that because it required injection
> of sorting knowledge into code that really shouldn't have anything to do
> with it --- not least because not all indexscans even have a defined
> ordering.  Given that it is only needed for one particular possible
> implementation of GIT, and Heikki was leaning away from that
> implementation anyway at last report, I vote against considering this
> any further.

Agreed, there's no use for that without GIT.

> * Proposed changes to refactor the TIDBitmap support to include a
> concept of a "stream bitmap" being read from disk.  (Not strictly an
> index AM change, but closely related.)  While this is clean and nice on
> its own, it doesn't seem to have any use unless bitmap indexes happen.
> If we leave the code sit it'll probably acquire bit rot, but I'm
> disinclined to add a bunch of unused code, too.  Thoughts?

If the code is unused, it can easily bit rot even if it's in CVS. Let's 
not apply it. The patch is out there if/when someone picks up the bitmap 
index patch again.

> * Proposed changes to allow amgetnext to return the actual index keys,
> allowing certain types of "unindexable" quals to be checked without
> having to fetch the heap entry.  For example a LIKE condition could be
> fully checked against the index entry.  Since certain types of indexes
> (GIN now, possibly hash in future) are incapable of doing this, there'd
> need to be a way of marking index AMs (or perhaps individual indexes) as
> able or not able to return their keys, so that the planner could know
> whether quals could be pushed into the indexscan.
> 
> We don't have any actual submitted patch for this, but AFAIR everyone
> agrees it's a good idea.

Agreed :-).

> * GIT (Grouped Index Tuple) indexes, which achieve index space savings
> in btrees by having a single index tuple represent multiple heap tuples
> (on a single heap page) containing a range of key values.  I am not sure
> what the development status is --- Heikki had submitted a completed
> patch but there seemed to be agreement on making changes, and that's not
> been done AFAIK.  The really serious problem I've got with it is that
> it'd foreclose the possibility of returning actual index keys from btree
> indexes, thus basically killing the usefulness of that idea.  I'm not
> convinced it would offer enough gain to be worth paying that price.
> Another issue is that we'd need to check how much of the use-case for
> GIT has been taken over by HOT.

I don't think there's much overlap with HOT. On the contrary, HOT makes 
GIT much more useful, because GIT was very sensitive to the cluster 
order of the heap, and HOT helps with that.

There is, however, a ton of overlap with index-only scans, and the 
possibility to return keys from indexes, as you pointed out.

--  Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Index AM change proposals, redux

From

Gregory Stark

Date:

10 April 2008, 03:34:55


>> * Proposed changes to allow amgetnext to return the actual index keys,
>> allowing certain types of "unindexable" quals to be checked without
>> having to fetch the heap entry.  For example a LIKE condition could be
>> fully checked against the index entry.  

In the extreme we could build tuples and push them up several nodes -- even
including joins -- before fetching the rest of the attributes from the heap.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!

Re: Index AM change proposals, redux

From

Teodor Sigaev

Date:

10 April 2008, 07:22:22

> * Proposed change to let both amgetnext and amgetmulti mark returned
> tuples as "candidate matches", that is in need of rechecking of quals

> indexes).  There seemed to be some possible marginal use for it in GIST
> indexes, but I'm not convinced that's a sufficient reason to complicate
> the APIs.

This is good way to eliminate awful operation '@@@' without performance loss.



> * Proposed changes to allow amgetnext to return the actual index keys,
> allowing certain types of "unindexable" quals to be checked without
> having to fetch the heap entry.  For example a LIKE condition could be
> fully checked against the index entry.  Since certain types of indexes
> (GIN now, possibly hash in future) are incapable of doing this, there'd
GiST too, because type of storage may be differ from type to be indexed. The 
same touches GIN too.


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Index AM change proposals, redux

From

"Zeugswetter Andreas OSB SD"

Date:

10 April 2008, 12:21:26

> * GIT (Grouped Index Tuple) indexes, which achieve index space savings
> in btrees by having a single index tuple represent multiple heap
tuples
> (on a single heap page) containing a range of key values.  I am not
sure
> what the development status is --- Heikki had submitted a completed
> patch but there seemed to be agreement on making changes, and that's
not


> been done AFAIK.  The really serious problem I've got with it is that
> it'd foreclose the possibility of returning actual index keys from
btree
> indexes, thus basically killing the usefulness of that idea.  I'm not
> convinced it would offer enough gain to be worth paying that price.
> Another issue is that we'd need to check how much of the use-case for
> GIT has been taken over by HOT.

I do not see the serious problem ? The one key that is stored would
represent all tuples it points to. So the interface could eighter
be designed to allow skipping the key in one go, or return the same
key repeatedly. All that the second approach would need is return
the key and the heap tuple pointer in two different vars.

Andreas

Re: Index AM change proposals, redux

From

Tom Lane

Date:

10 April 2008, 12:47:42

"Zeugswetter Andreas OSB SD" <Andreas.Zeugswetter@s-itsolutions.at> writes:
>> ... The really serious problem I've got with it is that
>> it'd foreclose the possibility of returning actual index keys from btree
>> indexes, thus basically killing the usefulness of that idea.  I'm not
>> convinced it would offer enough gain to be worth paying that price.

> I do not see the serious problem ? The one key that is stored would 
> represent all tuples it points to.

No, the entry represents a range of values for which the one key is the
lower bound.  You don't know just what the keys are for the other
tuples, unless you go to the heap and look.

We could restrict GIT to only represent tuples with exactly the same
key, but that takes away a whole lot of its use-case (especially so
now that HOT is in there).
        regards, tom lane

Re: Index AM change proposals, redux

From

Tom Lane

Date:

10 April 2008, 14:11:21

Teodor Sigaev <teodor@sigaev.ru> writes:
>> * Proposed change to let both amgetnext and amgetmulti mark returned
>> tuples as "candidate matches", that is in need of rechecking of quals
>> ...
>> indexes).  There seemed to be some possible marginal use for it in GIST
>> indexes, but I'm not convinced that's a sufficient reason to complicate
>> the APIs.

> This is good way to eliminate awful operation '@@@' without performance loss.

Oh yeah, that kluge :-(.  Okay, that's probably a sufficient reason
all by itself.

>> * Proposed changes to allow amgetnext to return the actual index keys,
>> allowing certain types of "unindexable" quals to be checked without
>> having to fetch the heap entry.  For example a LIKE condition could be
>> fully checked against the index entry.  Since certain types of indexes
>> (GIN now, possibly hash in future) are incapable of doing this, there'd

> GiST too, because type of storage may be differ from type to be indexed. The 
> same touches GIN too.

Is this the case for *all* GiST and GIN indexes, or might we need a
per-index (rather than per-AM) flag to tell whether the original keys
are available?
        regards, tom lane

Re: Index AM change proposals, redux

From

Tom Lane

Date:

10 April 2008, 14:18:33

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> A bigger issue is whether this is worth applying when nobody seems to be
>> working on either of the main uses for it (bitmap indexes and GIT
>> indexes).  There seemed to be some possible marginal use for it in GIST
>> indexes, but I'm not convinced that's a sufficient reason to complicate
>> the APIs.

> It has some merit on its own.

Yeah, and Teodor's point about cleaning up the @@@ hack pretty much
seals the deal for me.

Unless anyone has objections, I will review and apply Heikki's patch
http://archives.postgresql.org/pgsql-patches/2007-03/msg00163.php
which covers both the amgetmulti return-a-bitmap change and the
candidate-matches change.  (Heiiki, you don't have a later version
of that do you?)

The remaining topics associated with index AMs are closed for this
commit fest, unless anyone has specific questions they want discussed
right now...
        regards, tom lane

Re: Index AM change proposals, redux

From

Bruce Momjian

Date:

10 April 2008, 14:26:49

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
> > Tom Lane wrote:
> >> A bigger issue is whether this is worth applying when nobody seems to be
> >> working on either of the main uses for it (bitmap indexes and GIT
> >> indexes).  There seemed to be some possible marginal use for it in GIST
> >> indexes, but I'm not convinced that's a sufficient reason to complicate
> >> the APIs.
> 
> > It has some merit on its own.
> 
> Yeah, and Teodor's point about cleaning up the @@@ hack pretty much
> seals the deal for me.
> 
> Unless anyone has objections, I will review and apply Heikki's patch
> http://archives.postgresql.org/pgsql-patches/2007-03/msg00163.php
> which covers both the amgetmulti return-a-bitmap change and the
> candidate-matches change.  (Heiiki, you don't have a later version
> of that do you?)
> 
> The remaining topics associated with index AMs are closed for this
> commit fest, unless anyone has specific questions they want discussed
> right now...

OK, do you want the items moved to the next commit-fest, discarded, or
made into TODOs?  And which ones?  Or do you want to do it?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Index AM change proposals, redux

From

Teodor Sigaev

Date:

10 April 2008, 14:42:53

>> GiST too, because type of storage may be differ from type to be indexed. The 
>> same touches GIN too.
> 
> Is this the case for *all* GiST and GIN indexes, or might we need a
> per-index (rather than per-AM) flag to tell whether the original keys
> are available?

Ughm. GiST and GIN are different here. For GiST it is clear that is  per index 
flag: - rtree emulation over box stores original values - rtree emulation over points or circles, btree_gist, ltree
storesmodified   original value which can be restored from index with call of specific   function - tsvector opclass
doesn'thave this possibility at all
 
So, only rtree emulation over box is able to return original value from index.
For GIN index I know only one opclass where it's possible to get original value, 
it's a wildspeed, but in any case that requires some transformation before. 
However, it's possible to develop opclass for GIN which will be similar to 
classic Btree, for indexing scalar values.

Both GIN and GiST make a call of transformation function before indexing value. 
I suppose, there is no automatic way to set this flag even in case when type of 
storage and type of indexing value are the same.

So, I see three variant: - add flag in pg_am - add flag to create operator class and by default it should point to
impossibilityto get value from index. At least for GIN and GiST.
 



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Index AM change proposals, redux

From

Tom Lane

Date:

10 April 2008, 14:43:07

Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> The remaining topics associated with index AMs are closed for this
>> commit fest, unless anyone has specific questions they want discussed
>> right now...

> OK, do you want the items moved to the next commit-fest, discarded, or
> made into TODOs?  And which ones?  Or do you want to do it?

TODOs for each of the open items I listed in my summary, please.
They aren't material for the next commit fest until someone does more
work.
        regards, tom lane

Re: Index AM change proposals, redux

From

Tom Lane

Date:

10 April 2008, 14:48:21

Teodor Sigaev <teodor@sigaev.ru> writes:
> Both GIN and GiST make a call of transformation function before indexing value. 
> I suppose, there is no automatic way to set this flag even in case when type of 
> storage and type of indexing value are the same.

> So, I see three variant:
>   - add flag in pg_am
>   - add flag to create operator class and by default it should point to
>     impossibility to get value from index. At least for GIN and GiST.

Yeah, just as messy as I feared :-(.  My inclination for the first pass
would be to just make it a per-AM flag in pg_am.  We could always
complicate matters later.

(Of course, this is all hypothetical since no patch is on the
horizon...)
        regards, tom lane

Re: Index AM change proposals, redux

From

Teodor Sigaev

Date:

10 April 2008, 15:02:05

> Yeah, just as messy as I feared :-(.  My inclination for the first pass
> would be to just make it a per-AM flag in pg_am.  We could always
> complicate matters later.

Agree, and the single existing suitable opclass hasn't operation marked with 
RECHECK flag.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Index AM change proposals, redux

From

Heikki Linnakangas

Date:

10 April 2008, 15:10:31

Tom Lane wrote:
> Yeah, and Teodor's point about cleaning up the @@@ hack pretty much
> seals the deal for me.

Note that we'll need to add candidate match support to the regular index 
scan API as well for that.

> Unless anyone has objections, I will review and apply Heikki's patch
> http://archives.postgresql.org/pgsql-patches/2007-03/msg00163.php
> which covers both the amgetmulti return-a-bitmap change and the
> candidate-matches change.

Ok. Thank you!

> (Heiiki, you don't have a later version of that do you?)

Nope.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Index AM change proposals, redux

From

Tom Lane

Date:

10 April 2008, 15:20:41

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> Yeah, and Teodor's point about cleaning up the @@@ hack pretty much
>> seals the deal for me.

> Note that we'll need to add candidate match support to the regular index 
> scan API as well for that.

[ confused... ]  Thought you'd done that in this patch?

>> (Heiiki, you don't have a later version of that do you?)

> Nope.

Going through my mailbox, I found Alexey Klyukin's update:
http://archives.postgresql.org/pgsql-patches/2007-06/msg00204.php
Did you look at his version at all?
        regards, tom lane

Re: Index AM change proposals, redux

From

Heikki Linnakangas

Date:

10 April 2008, 16:11:44

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Note that we'll need to add candidate match support to the regular index 
>> scan API as well for that.
> 
> [ confused... ]  Thought you'd done that in this patch?

No, I only modified the bitmap scan API.

>>> (Heiiki, you don't have a later version of that do you?)
> 
>> Nope.
> 
> Going through my mailbox, I found Alexey Klyukin's update:
> http://archives.postgresql.org/pgsql-patches/2007-06/msg00204.php
> Did you look at his version at all?

Oh, I didn't remember that exists at all. It looks good at first glance, 
though it's missing the doc changes of the original.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Index AM change proposals, redux

From

Tom Lane

Date:

10 April 2008, 19:30:38

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>>> Note that we'll need to add candidate match support to the regular index 
>>> scan API as well for that.
>> 
>> [ confused... ]  Thought you'd done that in this patch?

> No, I only modified the bitmap scan API.

Okay.  I applied what you'd done (with minor revisions), but we'll have
to fix up the regular indexscan API before anything can be done about
@@@.  I'll take a look at that part tomorrow.

Teodor, do you have any thoughts about exactly how you'd fix @@@ ?
I suppose that the recheck-need is not really a property of specific
tuples, but of a particular query, for that case.  Where would you
want to detect that?
        regards, tom lane

Re: Index AM change proposals, redux

From

"Zeugswetter Andreas OSB SD"

Date:

11 April 2008, 07:43:27

> >> ... The really serious problem I've got with it is that
> >> it'd foreclose the possibility of returning actual index keys from
btree
> >> indexes, thus basically killing the usefulness of that idea.  I'm
not
> >> convinced it would offer enough gain to be worth paying that price.
>
> > I do not see the serious problem ? The one key that is stored would
> > represent all tuples it points to.
>
> No, the entry represents a range of values for which the one key is
the
> lower bound.  You don't know just what the keys are for the other
> tuples, unless you go to the heap and look.

Thanks for explaining. I think that idea stands in the way of future
improvements and should be dropped, yes.

> We could restrict GIT to only represent tuples with exactly the same
> key, but that takes away a whole lot of its use-case (especially so
> now that HOT is in there).

Ok, I was forgetting pg's outstanding "partial index" feature.
In pg you will most likely exclude highly duplicate keys from the index.
Since other dbs don't have this feature, it is common to have highly
duplicate keys (millions) there (unless you use very ugly workarounds).

I agree that the possibility of returning actual index keys and
filtering
rows early has more merrit. It could also be used to skip forward, when
the
qual misses middle key columns.

I do still see compressing exactly equal keys (or exactly equal parts),
but not restricted to the same heap page, as a useful btree TODO
(especially for long non unique keys).
But it may really not be so important in pg.

Andreas

Re: Index AM change proposals, redux

From

Teodor Sigaev

Date:

11 April 2008, 10:29:44

> Teodor, do you have any thoughts about exactly how you'd fix @@@ ?
> I suppose that the recheck-need is not really a property of specific
> tuples, but of a particular query, for that case.  Where would you
> want to detect that?

tsquery may include restriction by weight of search terms: 'sea & port:A'. GIN 
index doesn't store information about weights, so the only difference between @@ 
and @@@ is that @@@ is marked with RECHECK flag. I think, the better way is set 
flag about required recheck by looking value from index, not for tsquery. It 
gives to us more flexibility.

So, I planned to add pointer to bool to consistent method, so signature will be
bool consistent( bool check[], StrategyNumber n, Datum query, bool *needRecheck)

Returning value of needRecheck should be ignored for operation not marked by 
RECHECK flag in opclass. needRecheck should be initialized to true before call 
of consistent method to keep compatibility with old opclasses.

To define, is recheck needed or not, the better way is to check actually needed 
values. For example, let tsquery is equal to
'foo | bar | qq:A' and tsvetor = 'foo:1,2,3 asdasdasd:4'. Obviously recheck is 
not needed. So patch is close to trivial:

*** tsginidx.c.orig     2008-04-11 17:08:37.000000000 +0400
--- tsginidx.c  2008-04-11 17:18:45.000000000 +0400
***************
*** 109,114 ****
--- 109,115 ----  {        QueryItem  *frst;        bool       *mapped_check;
+       bool       *needRecheck;  } GinChkVal;
  static bool
***************
*** 116,121 ****
--- 117,125 ----  {        GinChkVal  *gcv = (GinChkVal *) checkval;

+       if ( val->weight )
+               *(gcv->needRecheck) = true;
+        return gcv->mapped_check[((QueryItem *) val) - gcv->frst];  }

***************
*** 144,149 ****
--- 148,155 ----
                gcv.frst = item = GETQUERY(query);                gcv.mapped_check = (bool *) palloc(sizeof(bool) *
query->size);
+               gcv.needRecheck = PG_GETARG_POINTER(3);
+               *(gcv.needRecheck) = false;
                for (i = 0; i < query->size; i++)                        if (item[i].type == QI_VAL)





-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Index AM change proposals, redux

From

Tom Lane

Date:

11 April 2008, 13:11:11

Teodor Sigaev <teodor@sigaev.ru> writes:
> So, I planned to add pointer to bool to consistent method, so signature will be
> bool consistent( bool check[], StrategyNumber n, Datum query, bool *needRecheck)

> Returning value of needRecheck should be ignored for operation not marked by 
> RECHECK flag in opclass. needRecheck should be initialized to true before call 
> of consistent method to keep compatibility with old opclasses.

Perhaps it would be better to initialize needRecheck to the opclass
RECHECK flag value?  If the consistent function does nothing, the
behavior is the same as before, but it can flip the flag in either
direction if it wants.
        regards, tom lane

Re: Index AM change proposals, redux

From

Heikki Linnakangas

Date:

11 April 2008, 13:16:07

Tom Lane wrote:
> Perhaps it would be better to initialize needRecheck to the opclass
> RECHECK flag value?  If the consistent function does nothing, the
> behavior is the same as before, but it can flip the flag in either
> direction if it wants.

I remember that last spring, in the context of GIT, you were worried 
about the performance implication of preparing to recheck rows when no 
rechecks are needed. I didn't quite buy that back then, but this would 
have the same issue.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Index AM change proposals, redux

From

Tom Lane

Date:

11 April 2008, 13:23:11

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> Perhaps it would be better to initialize needRecheck to the opclass
>> RECHECK flag value?  If the consistent function does nothing, the
>> behavior is the same as before, but it can flip the flag in either
>> direction if it wants.

> I remember that last spring, in the context of GIT, you were worried 
> about the performance implication of preparing to recheck rows when no 
> rechecks are needed. I didn't quite buy that back then, but this would 
> have the same issue.

As I mentioned upthread, it appears that we're paying that overhead
anyway --- at least nodeIndexscan.c thinks we are.  I need to dig into
the planner a bit today and see whether it's taking any shortcuts for
non-RECHECK operators.

If it really is saving anything, then I'd agree that only RECHECK-marked
operators should be allowed to adjust the flag.
        regards, tom lane

Re: Index AM change proposals, redux

From

Teodor Sigaev

Date:

11 April 2008, 13:48:33

> Perhaps it would be better to initialize needRecheck to the opclass
> RECHECK flag value?  If the consistent function does nothing, the
> behavior is the same as before, but it can flip the flag in either
> direction if it wants.

I have not any objections. In this case (and preparing of rechecking is cheap) 
RECHECK flag could be removed.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Index AM change proposals, redux

From

Tom Lane

Date:

11 April 2008, 14:31:24

I wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> I remember that last spring, in the context of GIT, you were worried 
>> about the performance implication of preparing to recheck rows when no 
>> rechecks are needed. I didn't quite buy that back then, but this would 
>> have the same issue.

> As I mentioned upthread, it appears that we're paying that overhead
> anyway --- at least nodeIndexscan.c thinks we are.  I need to dig into
> the planner a bit today and see whether it's taking any shortcuts for
> non-RECHECK operators.

Yeah, we are paying that overhead.  The reason is that the planner
always constructs an "indexqualorig" expression and the executor
always initializes it.  The only place where it's used currently in
a plain indexscan is for EvalPlanQual rechecking, but we could certainly
use it for lossy-operator rechecking.  (The implication of this is that
if any of the operators in a multi-index-qual indexscan are lossy, we'd
recheck all of them upon fetching the heap tuple.  Does anyone feel
that's not good enough?  Most of the practical cases I can think of
for multi-operator scans are for btree anyway, so it's not clear that
there's much value in complicating matters to evaluate just some of the
indexqualorig clauses.)

This would effectively move *all* management of lossy operators to
runtime; the planner wouldn't really think about it at all.  We could
simplify createplan.c a bit.
        regards, tom lane

Re: Index AM change proposals, redux

From

Oleg Bartunov

Date:

11 April 2008, 15:33:22

Slightly offtopic. How to get benefit on tuple level ? For example,
we mark GiST tsearch index as lossy, while for not very big documents it's
actually exact and we could save a lot not rechecking them.

Oleg

On Fri, 11 Apr 2008, Teodor Sigaev wrote:

>> Teodor, do you have any thoughts about exactly how you'd fix @@@ ?
>> I suppose that the recheck-need is not really a property of specific
>> tuples, but of a particular query, for that case.  Where would you
>> want to detect that?
>
> tsquery may include restriction by weight of search terms: 'sea & port:A'. 
> GIN index doesn't store information about weights, so the only difference 
> between @@ and @@@ is that @@@ is marked with RECHECK flag. I think, the 
> better way is set flag about required recheck by looking value from index, 
> not for tsquery. It gives to us more flexibility.
>
> So, I planned to add pointer to bool to consistent method, so signature will 
> be
> bool consistent( bool check[], StrategyNumber n, Datum query, bool 
> *needRecheck)
>
> Returning value of needRecheck should be ignored for operation not marked by 
> RECHECK flag in opclass. needRecheck should be initialized to true before 
> call of consistent method to keep compatibility with old opclasses.
>
> To define, is recheck needed or not, the better way is to check actually 
> needed values. For example, let tsquery is equal to
> 'foo | bar | qq:A' and tsvetor = 'foo:1,2,3 asdasdasd:4'. Obviously recheck 
> is not needed. So patch is close to trivial:
>
> *** tsginidx.c.orig     2008-04-11 17:08:37.000000000 +0400
> --- tsginidx.c  2008-04-11 17:18:45.000000000 +0400
> ***************
> *** 109,114 ****
> --- 109,115 ----
>  {
>        QueryItem  *frst;
>        bool       *mapped_check;
> +       bool       *needRecheck;
>  } GinChkVal;
>
>  static bool
> ***************
> *** 116,121 ****
> --- 117,125 ----
>  {
>        GinChkVal  *gcv = (GinChkVal *) checkval;
>
> +       if ( val->weight )
> +               *(gcv->needRecheck) = true;
> +
>        return gcv->mapped_check[((QueryItem *) val) - gcv->frst];
>  }
>
> ***************
> *** 144,149 ****
> --- 148,155 ----
>
>                gcv.frst = item = GETQUERY(query);
>                gcv.mapped_check = (bool *) palloc(sizeof(bool) * 
> query->size);
> +               gcv.needRecheck = PG_GETARG_POINTER(3);
> +               *(gcv.needRecheck) = false;
>
>                for (i = 0; i < query->size; i++)
>                        if (item[i].type == QI_VAL)
>
>
>
>
>
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Index AM change proposals, redux

From

Tom Lane

Date:

11 April 2008, 15:49:21

Oleg Bartunov <oleg@sai.msu.su> writes:
> Slightly offtopic. How to get benefit on tuple level ? For example,
> we mark GiST tsearch index as lossy, while for not very big documents it's
> actually exact and we could save a lot not rechecking them.

Won't that just fall out of this?  Assuming the consistent() function
knows when the match is exact, it can set the flag properly.
        regards, tom lane

Re: Index AM change proposals, redux

From

Oleg Bartunov

Date:

11 April 2008, 16:13:32

On Fri, 11 Apr 2008, Tom Lane wrote:

> Oleg Bartunov <oleg@sai.msu.su> writes:
>> Slightly offtopic. How to get benefit on tuple level ? For example,
>> we mark GiST tsearch index as lossy, while for not very big documents it's
>> actually exact and we could save a lot not rechecking them.
>
> Won't that just fall out of this?  Assuming the consistent() function
> knows when the match is exact, it can set the flag properly.

Ah, yes. Looks like a new life for GiST tsearch index.

>
>             regards, tom lane
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Index AM change proposals, redux

From

Ron Mayer

Date:

11 April 2008, 21:08:27

Heikki Linnakangas wrote:
>> * GIT (Grouped Index Tuple) indexes, which achieve index space savings
>> in btrees by having a single index tuple represent multiple heap tuples
>> [...]
>> Another issue is that we'd need to check how much of the use-case for
>> GIT has been taken over by HOT.
> 
> There is, however, a ton of overlap with index-only scans, and the 
> possibility to return keys from indexes, as you pointed out.

One use case that I think GIT would help a lot with are my
large address tables that are clustered by zip-code but
often queried by State, City, County, School District,
Police Beat, etc.

I imagine a GIT index on "state" would just occupy
a couple pages at most regardless of how large the
table gets.   And likewise, even an index on City
would be orders of magnitude smaller than the existing
ones; since all records for any given city are all
on the same few disk pages.

Or am I misunderstanding how GIT works.

Re: Index AM change proposals, redux

From

Heikki Linnakangas

Date:

12 April 2008, 03:23:51

Ron Mayer wrote:
> One use case that I think GIT would help a lot with are my
> large address tables that are clustered by zip-code but
> often queried by State, City, County, School District,
> Police Beat, etc.

Yep, GIT would shrink the index on zip-code tremendously...

> I imagine a GIT index on "state" would just occupy
> a couple pages at most regardless of how large the
> table gets.

.. Not quite that much, though. GIT still stores one index pointer per 
heap page even on a fully clustered table. Otherwise it's not much good 
for searches.

>   And likewise, even an index on City
> would be orders of magnitude smaller than the existing
> ones; since all records for any given city are all
> on the same few disk pages.

Yep, it would help with that as well.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Index AM change proposals, redux

From

Tom Lane

Date:

12 April 2008, 17:10:39

I looked over the issue of making the regular-indexscan code path able
to handle runtime determination of operator lossiness. Here's my
implementation plan:

* Extend amgettuple API to allow a boolean recheck flag to be passed back.

* Remove index_getnext_indexitem, which is dead code and has been for
awhile. (It seems to have been put in to help gist and hash vacuuming,
which before that were using index_getnext loops and thus thrashing the
heap on every call... but they were both rewritten entirely since then.)

* Have index_getnext() just pass the recheck flag back to its caller,
without doing any extra work. The existing callers arenodeIndexscan.cCLUSTERsystable_getnextGetNewOidWithIndex (should
usesystable_getnext)

and about ten places that use index_getnext directly, instead of
going through systable_getnext, because they rely on seeing ordered
results which systable_getnext doesn't promise.

* nodeIndexscan can execute rechecks using the indexqualorig expressions,
which it has set up and ready to go already, so there's basically no
added overhead for non-lossy cases.

* CLUSTER doesn't pass any scankeys so it should never see recheck=true,
and can/should error out.

* systable_getnext and the other direct callers are only used with
system catalogs, which should never have lossy operators anyway,
so for the moment we can just have them error out on recheck=true.
But I think it's important to have a plan for fixing that later.
We know that these functions are used only with "simple" scan keys
(no index expressions), so it would be easy enough to apply the recheck
using the scan keys ... if we only had the heap attnums at hand
rather than the index attnums. In the case of systable_getnext
it'd be very simple to extend systable_beginscan and the SysScanDesc
data structure to preserve the heap attnums. What I propose doing
about the other direct callers is to make a variant of the systable
scan code that *does* promise to preserve ordering, but otherwise
has a similar API to systable_beginscan/systable_getnext. If they
all go through that then we'll have a single place to fix to support
lossy results.

* As discussed yesterday, eliminate fixed RECHECK markings on operators,
and change the API for GIST/GIN "consistent" functions to pass back
a recheck indicator.

* The planner will no longer care whether an index operator is lossy.
This means that as far as the planner is concerned there isn't any strong
reason for fix_indexqual_references to call get_op_opfamily_properties ---
the only reason it's doing that is to build the indexstrategy and
indexsubtype lists to attach to the generated indexscan plan node.
It's tempting to drop those lists from the plan tree altogether and
instead do the get_op_opfamily_properties lookup during
ExecIndexBuildScanKeys. For a plan executed only once, this would be
a clear win --- it's the same total number of catcache lookups and we
avoid some palloc overhead to construct the lists. For a plan that's
cached and reused many times, at some point the extra lookups would
start to cost more than the list storage costs. But it's not like
executor startup doesn't involve a lot of catcache fetches anyway.
Any thoughts about that tradeoff?

BTW, it occurs to me that this'll make it trivially easy to experiment
with hash indexes that only store the hash values ... they'll just
always set the recheck flag to true.
regards, tom lane

Re: Index AM change proposals, redux

From

Ron Mayer

Date:

14 April 2008, 10:16:13

Heikki Linnakangas wrote:
> Ron Mayer wrote:
>> One use case that I think GIT would help a lot with are my
>> large address tables that are clustered by zip-code but
>> often queried by State, City, County, School District,
>> Police Beat, etc.
> 
>> I imagine a GIT index on "state" would just occupy
>> a couple pages at most regardless of how large the
>> table gets.
> 
> .. Not quite that much, though. GIT still stores one index pointer per 
> heap page even on a fully clustered table. Otherwise it's not much good 
> for searches.

Then I wonder if I can conceive of yet another related
index type that'd be useful for such clustered tables.  If
I had something like GIT that stored something like
"values State='CA' can be found on pages 1000 through 10000
and 20000 through 21000" would it be even more effective on
such a table than GIT?

If so, it seems it'd give many advantages that partitioning
by state could give (at least for querying).

Re: Index AM change proposals, redux

From

Heikki Linnakangas

Date:

14 April 2008, 11:08:46

Ron Mayer wrote:
> Then I wonder if I can conceive of yet another related
> index type that'd be useful for such clustered tables.  If
> I had something like GIT that stored something like
> "values State='CA' can be found on pages 1000 through 10000
> and 20000 through 21000" would it be even more effective on
> such a table than GIT?

Yep, a bitmap index.

Bitmap indexes don't actually work quite like that, but the point is 
that it is very efficient on a column with few distinct values, and even 
more so if the table is clustered on that column.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Index AM change proposals, redux

From

Simon Riggs

Date:

23 April 2008, 13:03:10

On Wed, 2008-04-09 at 20:30 -0400, Tom Lane wrote:

> * GIT (Grouped Index Tuple) indexes, which achieve index space savings
> in btrees by having a single index tuple represent multiple heap tuples
> (on a single heap page) containing a range of key values.  I am not sure
> what the development status is --- Heikki had submitted a completed
> patch but there seemed to be agreement on making changes, and that's not
> been done AFAIK.  The really serious problem I've got with it is that
> it'd foreclose the possibility of returning actual index keys from btree
> indexes, thus basically killing the usefulness of that idea.  I'm not
> convinced it would offer enough gain to be worth paying that price.

> Another issue is that we'd need to check how much of the use-case for
> GIT has been taken over by HOT.

That seems to be a misunderstanding about HOT and GIT. HOT is an
important requirement for GIT, but other than they are unrelated.

Testing in 2006/2007 showed that HOT stabilised the effects of repeated
updates, which then showed as a "gain" in performance. But GIT did show
considerable actual performance gains in its target use case.

GIT significantly reduces the size of clustered indexes, greatly
improving the number of index pointers that can be held in memory for
very large indexes. That translates directly into a reduction in I/O for
large databases on typical hardware, for primary operations, file
backups and recovery (and this, log replication). Test results validated
that and showed increased performance, over and above that experienced
with HOT, when tested together.

Now there may be problems with the GIT code as it stands, but we should
acknowledge that the general technique has been proven to improve
performance on a recent PostgreSQL codebase. This is an unsurprising
result, since SQLServer, Sybase, DB2, Oracle and Teradata (at least) all
use indexes of this category to improve real-world performance. The idea
is definitely not a benchmark-only feature. 

Many users would be very interested if we could significantly reduce the
size of the main index on their largest tables.

I would at least like to see clustered indexes acknowledged as a TODO
item, so we keep the door open for a future implementation based around
the basic concept of GIT.

Nobody is going to waste their time flogging a dead horse, which is why
the patch isn't ready. Maybe *that* horse is dead, not really for me to
say, but if we can at least agree on a basic statement that equine
animals are fast we may find a rider willing to invest time in them. 

I don't see the "returns index keys" idea as being killed by or killing
this concept. Returning keys is valid and useful when we can, but there
are other considerations that, in some use cases, will be a dominant
factor.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: Index AM change proposals, redux

From

Tom Lane

Date:

23 April 2008, 13:07:48

Simon Riggs <simon@2ndquadrant.com> writes:
> I don't see the "returns index keys" idea as being killed by or killing
> this concept. Returning keys is valid and useful when we can, but there
> are other considerations that, in some use cases, will be a dominant
> factor.

The patch as-submitted was a killer for the concept, because it
automatically discarded information and there was no way to prevent
that.  To be acceptable, a GIT patch would have to be optional and it
would have to expose in the catalogs whether a given index was lossy
in this way or not (so that the planner could know whether a plan based
on returning index keys would work).
        regards, tom lane

Re: Index AM change proposals, redux

From

Simon Riggs

Date:

23 April 2008, 13:47:38

On Wed, 2008-04-23 at 12:07 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I don't see the "returns index keys" idea as being killed by or killing
> > this concept. Returning keys is valid and useful when we can, but there
> > are other considerations that, in some use cases, will be a dominant
> > factor.
> 
> The patch as-submitted was a killer for the concept, because it
> automatically discarded information and there was no way to prevent
> that.

Understood.

> To be acceptable, a GIT patch would have to be optional and it
> would have to expose in the catalogs whether a given index was lossy
> in this way or not (so that the planner could know whether a plan based
> on returning index keys would work).

Would you see it as a separate index type, or a modification of the
b-tree (with option enabled via a "storage parameter")? If it was the
latter, then perhaps there could be a future for the GIT patch after
all.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: Index AM change proposals, redux

From

Tom Lane

Date:

23 April 2008, 15:06:34

Simon Riggs <simon@2ndquadrant.com> writes:
> On Wed, 2008-04-23 at 12:07 -0400, Tom Lane wrote:
>> To be acceptable, a GIT patch would have to be optional and it
>> would have to expose in the catalogs whether a given index was lossy
>> in this way or not (so that the planner could know whether a plan based
>> on returning index keys would work).

> Would you see it as a separate index type, or a modification of the
> b-tree (with option enabled via a "storage parameter")? If it was the
> latter, then perhaps there could be a future for the GIT patch after
> all.

Hmm, well, separate index type doesn't seem real nice because most all
the places that currently know special things about btree would need to
be hacked to recognize the other type too; plus we'd have to double all
the btree entries in pg_amop/pg_amproc/pg_opclass/pg_opfamily.

I think storage parameter is no good also, given the current design that
assumes those can be changed on-the-fly.  It'd be okay to GIT-ify an
existing index, perhaps, but not the other way round.

I was considering a new pg_index column.  Or else we'd have to fix
the storage-parameter infrastructure to support restricting changes
of some parameters.
        regards, tom lane

Re: Index AM change proposals, redux

From

Simon Riggs

Date:

23 April 2008, 15:22:34

On Wed, 2008-04-23 at 14:06 -0400, Tom Lane wrote:

> I think storage parameter is no good also, given the current design that
> assumes those can be changed on-the-fly.  It'd be okay to GIT-ify an
> existing index, perhaps, but not the other way round.
> 
> I was considering a new pg_index column.  Or else we'd have to fix
> the storage-parameter infrastructure to support restricting changes
> of some parameters.

The latter seems best, since we're likely to need to do it anyway one
day.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: Index AM change proposals, redux

From

Bruce Momjian

Date:

23 April 2008, 23:26:24

Simon Riggs wrote:
> GIT significantly reduces the size of clustered indexes, greatly
> improving the number of index pointers that can be held in memory for
> very large indexes. That translates directly into a reduction in I/O for
> large databases on typical hardware, for primary operations, file
> backups and recovery (and this, log replication). Test results validated
> that and showed increased performance, over and above that experienced
> with HOT, when tested together.
> 
> Now there may be problems with the GIT code as it stands, but we should
> acknowledge that the general technique has been proven to improve
> performance on a recent PostgreSQL codebase. This is an unsurprising
> result, since SQLServer, Sybase, DB2, Oracle and Teradata (at least) all
> use indexes of this category to improve real-world performance. The idea
> is definitely not a benchmark-only feature. 
> 
> Many users would be very interested if we could significantly reduce the
> size of the main index on their largest tables.

Yes, basically GIT allows index compression for clustered tables, and
stated that way it is clear it would be a big feature if we had it for
8.4.

> I would at least like to see clustered indexes acknowledged as a TODO
> item, so we keep the door open for a future implementation based around
> the basic concept of GIT.

Yes, it is already a TODO:

* Consider compressing indexes by storing key values duplicated in several rows as a single index entry

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Index AM change proposals, redux

From

Ron Mayer

Date:

23 April 2008, 23:43:21

Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
>> On Wed, 2008-04-23 at 12:07 -0400, Tom Lane wrote:
>>> To be acceptable, a GIT patch would have to be optional and it
>>> ...
> I was considering a new pg_index column.  Or else we'd have to fix
> the storage-parameter infrastructure to support restricting changes
> of some parameters.

Interesting.  Does this mean that down the road a postgis index
might be GIT-ified?

On my biggest tables (clustered by zip/postal code) the index on
the  geometry  column with a postgis index probably sees all the
rows on each  of it's pages pointing to the same few heap pages.
If I understand right, that's the kind of pattern that GIT helps.

Re: Index AM change proposals, redux

From

Teodor Sigaev

Date:

24 April 2008, 05:11:48

> Interesting.  Does this mean that down the road a postgis index
> might be GIT-ified?

Only if GiST will support GIT, but I don't follow on this discussion. In any 
case, GIT on GiST will not be so effective as on Btree, because GiST doesn't 
guarantee that all close values will be close in index: it's possible to have 
several clusters (groups) with close values in different parts of index.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Index AM change proposals, redux

From

Simon Riggs

Date:

24 April 2008, 06:09:45

On Wed, 2008-04-23 at 22:26 -0400, Bruce Momjian wrote:
> Simon Riggs wrote:
> > GIT significantly reduces the size of clustered indexes, greatly
> > improving the number of index pointers that can be held in memory for
> > very large indexes. That translates directly into a reduction in I/O for
> > large databases on typical hardware, for primary operations, file
> > backups and recovery (and this, log replication). Test results validated
> > that and showed increased performance, over and above that experienced
> > with HOT, when tested together.
> > 
> > Now there may be problems with the GIT code as it stands, but we should
> > acknowledge that the general technique has been proven to improve
> > performance on a recent PostgreSQL codebase. This is an unsurprising
> > result, since SQLServer, Sybase, DB2, Oracle and Teradata (at least) all
> > use indexes of this category to improve real-world performance. The idea
> > is definitely not a benchmark-only feature. 
> > 
> > Many users would be very interested if we could significantly reduce the
> > size of the main index on their largest tables.
> 
> Yes, basically GIT allows index compression for clustered tables, and
> stated that way it is clear it would be a big feature if we had it for
> 8.4.

> > I would at least like to see clustered indexes acknowledged as a TODO
> > item, so we keep the door open for a future implementation based around
> > the basic concept of GIT.
> 
> Yes, it is already a TODO:
> 
> * Consider compressing indexes by storing key values duplicated in
>   several rows as a single index entry

That sounds similar, but definitely does not describe what GIT does.

GIT holds one index pointer for a *range* of values held on one block.
e.g. if we have a table with values 1-200 in it then GIT would store
*one* index pointer for all 200 rows, even though the values are all
different.

That property makes it very different from the TODO item stated, since
GIT can work very well on Unique indexes, which clearly do not have
duplicated keys.

I'd prefer the term "Clustered Indexes" rather than the name GIT, which
is also a slang insult.

Index compression is possible in many ways, depending upon the
situation. All of the following sound similar at a high level, but each
covers a different use case.

* For Long, Similar data e.g. Text we can use Prefix Compression
We still store one pointer per row, but we reduce the size of the index
by reducing the size of the key values. This requires us to reach inside
datatypes, so isn't a very general solution but is probably an important
one in the future for Text. 

* For Unique/nearly-Unique indexes we can use Range Compression
We reduce the size of the index by holding one index pointer per range
of values, thus removing both keys and pointers. It's more efficient
than prefix compression and isn't datatype-dependant. 

* For Highly Non-Unique Data we can use Duplicate Compression
The latter is the technique used by Bitmap Indexes. Efficient, but not
useful for unique/nearly-unique data

* Multi-Column Leading Value Compression - if you have a multi-column
index, then leading columns are usually duplicated between rows inserted
at the same time. Using an on-block dictionary we can remove duplicates.
Only useful for multi-column indexes, possibly overlapping/contained
subset of the GIT use case.

As with HOT, all of these techniques need to be controlled by
heuristics. Taken to extremes, these techniques can hurt other aspects
of performance, so its important we don't just write the code but also
investigate reasonable limits for behaviour also.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: Index AM change proposals, redux

From

Simon Riggs

Date:

24 April 2008, 07:58:16

On Wed, 2008-04-23 at 19:43 -0700, Ron Mayer wrote:
> Tom Lane wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> >> On Wed, 2008-04-23 at 12:07 -0400, Tom Lane wrote:
> >>> To be acceptable, a GIT patch would have to be optional and it
> >>> ...
> > I was considering a new pg_index column.  Or else we'd have to fix
> > the storage-parameter infrastructure to support restricting changes
> > of some parameters.
> 
> 
> Interesting.  Does this mean that down the road a postgis index
> might be GIT-ified?

Yes, all types of index can benefit from some form of compression. The
only issue is that there isn't one technique that covers all use cases,
so it will take time and we should be careful to do the most important
ones first.

> On my biggest tables (clustered by zip/postal code) the index on
> the  geometry  column with a postgis index probably sees all the
> rows on each  of it's pages pointing to the same few heap pages.
> If I understand right, that's the kind of pattern that GIT helps.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: Index AM change proposals, redux

From

Martijn van Oosterhout

Date:

24 April 2008, 08:13:33

On Thu, Apr 24, 2008 at 10:11:02AM +0100, Simon Riggs wrote:
> Index compression is possible in many ways, depending upon the
> situation. All of the following sound similar at a high level, but each
> covers a different use case.

True, but there is one significant difference:

> * For Long, Similar data e.g. Text we can use Prefix Compression
> * For Highly Non-Unique Data we can use Duplicate Compression
> * Multi-Column Leading Value Compression - if you have a multi-column

These are all not lossy and so are candidate to use on any b-tree even
by default. They don't affect plan-construction materially, except
perhaps in cost calculations. Given the index tuple overhead I don't
see how you could lose.

> * For Unique/nearly-Unique indexes we can use Range Compression

This one is lossy and so does affect possible plans. I beleive the term
for this is "sparse" index. Also, it's seems harder to optimise, since
it would seem to me the index would have to have some idea on how
"close" values are together.

> As with HOT, all of these techniques need to be controlled by
> heuristics. Taken to extremes, these techniques can hurt other aspects
> of performance, so its important we don't just write the code but also
> investigate reasonable limits for behaviour also.

For the first three I can't imagine what the costs would be, except the
memory usage to store the unduplicated data once when its read in.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Index AM change proposals, redux

From

Simon Riggs

Date:

24 April 2008, 08:51:12

On Thu, 2008-04-24 at 13:13 +0200, Martijn van Oosterhout wrote:
> On Thu, Apr 24, 2008 at 10:11:02AM +0100, Simon Riggs wrote:
> > Index compression is possible in many ways, depending upon the
> > situation. All of the following sound similar at a high level, but each
> > covers a different use case.
> 
> True, but there is one significant difference:

> > * For Long, Similar data e.g. Text we can use Prefix Compression
> > * For Highly Non-Unique Data we can use Duplicate Compression
> > * Multi-Column Leading Value Compression - if you have a multi-column
> 
> These are all not lossy and so are candidate to use on any b-tree even
> by default. They don't affect plan-construction materially, except
> perhaps in cost calculations. Given the index tuple overhead I don't
> see how you could lose.

True, but it is important that this is not a discussion about the one
best technique for index compression. I think there are many techniques,
covering many use cases. If we think there is only one, the different
use cases get thrown out the door because "there was no consensus". If
the TODO list does not say clearly what we want, it will never happen.
We know no sponsor will pay for an idea that didn't make the list.
Worse, they will pay for what is on the list, even if it wasn't exactly
what we wanted.

> > * For Unique/nearly-Unique indexes we can use Range Compression
> 
> This one is lossy and so does affect possible plans. I beleive the term
> for this is "sparse" index. Also, it's seems harder to optimise, since
> it would seem to me the index would have to have some idea on how
> "close" values are together.

Value Lossy, true, but we already support lossy searches with
BitmapHeapScans, so lets not be fooled by how that word sounds. That was
Tom's original complaint against Block Indexes in Sep 2006, which is why
the non-lossy GIT design was conceived. GIT knows exactly which tuples
are indexed, and which are not. 

There is no need for the index to know how close the values are. Only
that values in that range live on that block.

> > As with HOT, all of these techniques need to be controlled by
> > heuristics. Taken to extremes, these techniques can hurt other aspects
> > of performance, so its important we don't just write the code but also
> > investigate reasonable limits for behaviour also.
> 
> For the first three I can't imagine what the costs would be, except the
> memory usage to store the unduplicated data once when its read in.
> 
> Have a nice day,
--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: Index AM change proposals, redux

From

ITAGAKI Takahiro

Date:

24 April 2008, 09:12:01

Simon Riggs <simon@2ndquadrant.com> wrote:

> * For Highly Non-Unique Data we can use Duplicate Compression
> The latter is the technique used by Bitmap Indexes. Efficient, but not
> useful for unique/nearly-unique data

I heard that GIN has already had duplicate-compression feature.

http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gin
| Gin consists of a B-tree index constructed over entries (ET, entries tree),
| where each entry is an element of the indexed value (element of array,
| lexeme for tsvector) and where each tuple in a leaf page is either a
| pointer to a B-tree over item pointers (PT, posting tree), or a list of
| item pointers (PL, posting list) if the tuple is small enough.

If GIT comes, can we merge or share some modules between btree and gin?
I guess the page layout of GIT is better than ET/PT pair when the index
size are larger than main memory because the key and item pointers are
placed in near pages. Gin-over-btree might be useful some usages of
inverted indexes.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Index AM change proposals, redux

From

Gregory Stark

Date:

24 April 2008, 09:13:32

"Martijn van Oosterhout" <kleptog@svana.org> writes:

> On Thu, Apr 24, 2008 at 10:11:02AM +0100, Simon Riggs wrote:
>> Index compression is possible in many ways, depending upon the
>> situation. All of the following sound similar at a high level, but each
>> covers a different use case.
>
> True, but there is one significant difference:
>
>> * For Long, Similar data e.g. Text we can use Prefix Compression
>> * For Highly Non-Unique Data we can use Duplicate Compression
>> * Multi-Column Leading Value Compression - if you have a multi-column
>
> These are all not lossy and so are candidate to use on any b-tree even
> by default. They don't affect plan-construction materially, except
> perhaps in cost calculations. Given the index tuple overhead I don't
> see how you could lose.

The problem is that while it's easy to see what to do for text (and even then
perhaps not so easy in non-C locales) Postgres's extensibility makes it quite
tricky to see what to do from there.

Perhaps we need a btree procedure which takes two parameters, a "parent" and
"child" and returns a compressed value. There would have to be a second
procedure to decompress values given their full parent.

There are a lot of tricky bits here, like what do you do on leaf pages? You
have to be able to follow leaf pages down the chain without consulting their
"parent".

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!

Re: Index AM change proposals, redux

From

Bruce Momjian

Date:

24 April 2008, 12:43:57

Simon Riggs wrote:
> > > Many users would be very interested if we could significantly reduce the
> > > size of the main index on their largest tables.
> > 
> > Yes, basically GIT allows index compression for clustered tables, and
> > stated that way it is clear it would be a big feature if we had it for
> > 8.4.
> 
> > > I would at least like to see clustered indexes acknowledged as a TODO
> > > item, so we keep the door open for a future implementation based around
> > > the basic concept of GIT.
> > 
> > Yes, it is already a TODO:
> > 
> > * Consider compressing indexes by storing key values duplicated in
> >   several rows as a single index entry
> 
> That sounds similar, but definitely does not describe what GIT does.
> 
> GIT holds one index pointer for a *range* of values held on one block.
> e.g. if we have a table with values 1-200 in it then GIT would store
> *one* index pointer for all 200 rows, even though the values are all
> different.
> 
> That property makes it very different from the TODO item stated, since
> GIT can work very well on Unique indexes, which clearly do not have
> duplicated keys.

Agreed, TODO updated:
* Consider smaller indexes that record a range of values per heap page,  rather than having one index entry for every
heaprow  This is useful if the heap is clustered by the indexed values.
http://archives.postgresql.org/pgsql-hackers/2006-12/msg00341.php
http://archives.postgresql.org/pgsql-hackers/2007-02/msg01264.php
http://archives.postgresql.org/pgsql-hackers/2007-03/msg00465.php
http://archives.postgresql.org/pgsql-patches/2007-03/msg00163.php
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00014.php
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00487.php
http://archives.postgresql.org/pgsql-hackers/2008-04/msg01589.php

> I'd prefer the term "Clustered Indexes" rather than the name GIT, which
> is also a slang insult.

I try to avoid technical terms which might not be familiar, but instead
describe exactly what is requested.

> Index compression is possible in many ways, depending upon the
> situation. All of the following sound similar at a high level, but each
> covers a different use case.
> 
> * For Long, Similar data e.g. Text we can use Prefix Compression
> We still store one pointer per row, but we reduce the size of the index
> by reducing the size of the key values. This requires us to reach inside
> datatypes, so isn't a very general solution but is probably an important
> one in the future for Text. 
> 
> * For Unique/nearly-Unique indexes we can use Range Compression
> We reduce the size of the index by holding one index pointer per range
> of values, thus removing both keys and pointers. It's more efficient
> than prefix compression and isn't datatype-dependant. 
> 
> * For Highly Non-Unique Data we can use Duplicate Compression
> The latter is the technique used by Bitmap Indexes. Efficient, but not
> useful for unique/nearly-unique data
> 
> * Multi-Column Leading Value Compression - if you have a multi-column
> index, then leading columns are usually duplicated between rows inserted
> at the same time. Using an on-block dictionary we can remove duplicates.
> Only useful for multi-column indexes, possibly overlapping/contained
> subset of the GIT use case.

Should any of these others be TODOs?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Index AM change proposals, redux

From

Simon Riggs

Date:

24 April 2008, 13:20:21

On Thu, 2008-04-24 at 11:43 -0400, Bruce Momjian wrote:

> > Index compression is possible in many ways, depending upon the
> > situation. All of the following sound similar at a high level, but each
> > covers a different use case.
> > 
> > * For Long, Similar data e.g. Text we can use Prefix Compression
> > We still store one pointer per row, but we reduce the size of the index
> > by reducing the size of the key values. This requires us to reach inside
> > datatypes, so isn't a very general solution but is probably an important
> > one in the future for Text. 
> > 
> > * For Unique/nearly-Unique indexes we can use Range Compression
> > We reduce the size of the index by holding one index pointer per range
> > of values, thus removing both keys and pointers. It's more efficient
> > than prefix compression and isn't datatype-dependant. 
> > 
> > * For Highly Non-Unique Data we can use Duplicate Compression
> > The latter is the technique used by Bitmap Indexes. Efficient, but not
> > useful for unique/nearly-unique data
> > 
> > * Multi-Column Leading Value Compression - if you have a multi-column
> > index, then leading columns are usually duplicated between rows inserted
> > at the same time. Using an on-block dictionary we can remove duplicates.
> > Only useful for multi-column indexes, possibly overlapping/contained
> > subset of the GIT use case.
> 
> Should any of these others be TODOs?

Possibly, though I think we have the priority items covered now.

Greg has passed comment on this thread of the difficulties inherent with
approach 1.

I've previously argued that technique 1 can be handled much better by
having "implied functional index use", e.g. an index can be defined on
SUBSTR(col, 10) rather than on the whole column, yet still used
automatically without needing to mention SUBSTR. 
Link: Hackers 11 Jul 2006.

There's a variety of ways to improve multi-column indexes beyond what we
currently support, though those haven't ever been discussed on hackers
to my knowledge.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: Index AM change proposals, redux

From

Decibel!

Date:

24 April 2008, 13:24:45

On Apr 24, 2008, at 10:43 AM, Bruce Momjian wrote:

Bruce asked if these should be TODOs...

>> Index compression is possible in many ways, depending upon the
>> situation. All of the following sound similar at a high level, but  
>> each
>> covers a different use case.
>>
>> * For Long, Similar data e.g. Text we can use Prefix Compression
>> We still store one pointer per row, but we reduce the size of the  
>> index
>> by reducing the size of the key values. This requires us to reach  
>> inside
>> datatypes, so isn't a very general solution but is probably an  
>> important
>> one in the future for Text.

I think what would be even more useful is doing this within the table  
itself, and then bubbling that up to the index.

>> * For Unique/nearly-Unique indexes we can use Range Compression
>> We reduce the size of the index by holding one index pointer per  
>> range
>> of values, thus removing both keys and pointers. It's more efficient
>> than prefix compression and isn't datatype-dependant.

Definitely.

>> * For Highly Non-Unique Data we can use Duplicate Compression
>> The latter is the technique used by Bitmap Indexes. Efficient, but  
>> not
>> useful for unique/nearly-unique data

Also definitely. This would be hugely useful for things like "status"  
or "type" fields.

>> * Multi-Column Leading Value Compression - if you have a multi-column
>> index, then leading columns are usually duplicated between rows  
>> inserted
>> at the same time. Using an on-block dictionary we can remove  
>> duplicates.
>> Only useful for multi-column indexes, possibly overlapping/contained
>> subset of the GIT use case.


Also useful, though I generally try and put the most diverse values  
first in indexes to increase the odds of them being used. Perhaps if  
we had compression this would change.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Re: Index AM change proposals, redux

From

Martijn van Oosterhout

Date:

24 April 2008, 13:29:23

Well, for these two:
> >> * For Highly Non-Unique Data we can use Duplicate Compression
> >> * Multi-Column Leading Value Compression - if you have a multi-column

You don't need any new logic at all. If _bt_compare says they're equal
you can compact them.

The long similar case is harder, ISTM there are better ways to handle
that anyway (tsearch? hashing?).

> There are a lot of tricky bits here, like what do you do on leaf pages? You
> have to be able to follow leaf pages down the chain without consulting their
> "parent".

I was thinking the compression would only occur on a single page,
anything else seems to be asking for trouble. Basically, on a single
page you compact:

A B -> 2
A B -> 3
A C -> 4
A D -> 1
B F -> 2

to:

A B -> [2, 3]
(1) C -> 4
(1) D -> 1
B F -> 2

Within the context of a single page it's perfectly clear.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Index AM change proposals, redux

From

Tom Lane

Date:

24 April 2008, 13:39:04

Martijn van Oosterhout <kleptog@svana.org> writes:
> Well, for these two:
> * For Highly Non-Unique Data we can use Duplicate Compression
> * Multi-Column Leading Value Compression - if you have a multi-column

> You don't need any new logic at all. If _bt_compare says they're equal
> you can compact them.

Not if you'd like to support index key retrieval.  (The equality
function that btree is using doesn't necessarily mean "equal for all
purposes".)
        regards, tom lane