Thread: Index AM change proposals, redux
I just finished looking through the various threads about index AM changes that Bruce has been holding onto in the commit-fest queue. There seem to be the following issues: * Proposed change to have amgetmulti return its results by ORing bits into a caller-supplied TIDBitmap, rather than the current interface involving an intermediate TID array. The TID-array design was intended to be agnostic about what would happen on either side of the API, but it seems that it's really just equally inconvenient for everybody :-(. Although the original motivation for this was to open a door for bitmap indexes, it seems to me it's worth doing whether or not bitmap indexes ever happen. It's logically simpler (only one amgetmulti call per scan) and should save at least a few cycles. I recommend we just go ahead with this. * Proposed change to let both amgetnext and amgetmulti mark returned tuples as "candidate matches", that is in need of rechecking of quals against the real heap tuple. I had originally objected to this on the grounds that it would require setup work that doesn't happen now, but looking at the code I see that's wrong --- the setup work *does* happen anyway, see indexqualorig. nodeIndexscan.c comments "The indexqualorig expression is always initialized even though it will only be used in some uncommon cases --- would be nice to improve that". So this complaint is probably a red herring. I'm still a bit concerned by the fact that the patch only tracks this on a page basis in the amgetmulti case: if any of the tuples on a page are in-doubt then everything on the page will have to be rechecked. Is that good enough? A bigger issue is whether this is worth applying when nobody seems to be working on either of the main uses for it (bitmap indexes and GIT indexes). There seemed to be some possible marginal use for it in GIST indexes, but I'm not convinced that's a sufficient reason to complicate the APIs. * Proposed change to let amgetnext mark returned tuples as not necessarily in order, requiring somebody downstream to sort them again. I was pretty desperately unhappy with that because it required injection of sorting knowledge into code that really shouldn't have anything to do with it --- not least because not all indexscans even have a defined ordering. Given that it is only needed for one particular possible implementation of GIT, and Heikki was leaning away from that implementation anyway at last report, I vote against considering this any further. * Proposed changes to refactor the TIDBitmap support to include a concept of a "stream bitmap" being read from disk. (Not strictly an index AM change, but closely related.) While this is clean and nice on its own, it doesn't seem to have any use unless bitmap indexes happen. If we leave the code sit it'll probably acquire bit rot, but I'm disinclined to add a bunch of unused code, too. Thoughts? * Proposed changes to allow amgetnext to return the actual index keys, allowing certain types of "unindexable" quals to be checked without having to fetch the heap entry. For example a LIKE condition could be fully checked against the index entry. Since certain types of indexes (GIN now, possibly hash in future) are incapable of doing this, there'd need to be a way of marking index AMs (or perhaps individual indexes) as able or not able to return their keys, so that the planner could know whether quals could be pushed into the indexscan. We don't have any actual submitted patch for this, but AFAIR everyone agrees it's a good idea. * Bitmap indexes themselves. As far as I can tell, this development effort has stalled because of intractable problems with VACUUM. Can anyone provide a status update on that? * GIT (Grouped Index Tuple) indexes, which achieve index space savings in btrees by having a single index tuple represent multiple heap tuples (on a single heap page) containing a range of key values. I am not sure what the development status is --- Heikki had submitted a completed patch but there seemed to be agreement on making changes, and that's not been done AFAIK. The really serious problem I've got with it is that it'd foreclose the possibility of returning actual index keys from btree indexes, thus basically killing the usefulness of that idea. I'm not convinced it would offer enough gain to be worth paying that price. Another issue is that we'd need to check how much of the use-case for GIT has been taken over by HOT. * "Thick" indexes containing copies of tuple visibility info. As far as I'm concerned, this patch isn't going anywhere :-(. It's got the same showstopper problem as retail vacuum and some of the earlier forms of the HOT patch: it assumes that during tuple update or delete, it can re-find the tuple's index entries by re-computing the indexed expressions. That's just not reliable enough. Is that a reasonable summary of where we are? regards, tom lane
I am glad you have summarized this. The details of exactly what was being proposed was too complex for me to understand before. --------------------------------------------------------------------------- Tom Lane wrote: > I just finished looking through the various threads about index AM > changes that Bruce has been holding onto in the commit-fest queue. > There seem to be the following issues: > > * Proposed change to have amgetmulti return its results by ORing bits > into a caller-supplied TIDBitmap, rather than the current interface > involving an intermediate TID array. The TID-array design was intended > to be agnostic about what would happen on either side of the API, but it > seems that it's really just equally inconvenient for everybody :-(. > > Although the original motivation for this was to open a door for bitmap > indexes, it seems to me it's worth doing whether or not bitmap indexes > ever happen. It's logically simpler (only one amgetmulti call per scan) > and should save at least a few cycles. I recommend we just go ahead > with this. > > * Proposed change to let both amgetnext and amgetmulti mark returned > tuples as "candidate matches", that is in need of rechecking of quals > against the real heap tuple. I had originally objected to this on > the grounds that it would require setup work that doesn't happen now, > but looking at the code I see that's wrong --- the setup work *does* > happen anyway, see indexqualorig. nodeIndexscan.c comments "The > indexqualorig expression is always initialized even though it will only > be used in some uncommon cases --- would be nice to improve that". > So this complaint is probably a red herring. I'm still a bit concerned > by the fact that the patch only tracks this on a page basis in the > amgetmulti case: if any of the tuples on a page are in-doubt then > everything on the page will have to be rechecked. Is that good enough? > > A bigger issue is whether this is worth applying when nobody seems to be > working on either of the main uses for it (bitmap indexes and GIT > indexes). There seemed to be some possible marginal use for it in GIST > indexes, but I'm not convinced that's a sufficient reason to complicate > the APIs. > > * Proposed change to let amgetnext mark returned tuples as not > necessarily in order, requiring somebody downstream to sort them again. > I was pretty desperately unhappy with that because it required injection > of sorting knowledge into code that really shouldn't have anything to do > with it --- not least because not all indexscans even have a defined > ordering. Given that it is only needed for one particular possible > implementation of GIT, and Heikki was leaning away from that > implementation anyway at last report, I vote against considering this > any further. > > * Proposed changes to refactor the TIDBitmap support to include a > concept of a "stream bitmap" being read from disk. (Not strictly an > index AM change, but closely related.) While this is clean and nice on > its own, it doesn't seem to have any use unless bitmap indexes happen. > If we leave the code sit it'll probably acquire bit rot, but I'm > disinclined to add a bunch of unused code, too. Thoughts? > > * Proposed changes to allow amgetnext to return the actual index keys, > allowing certain types of "unindexable" quals to be checked without > having to fetch the heap entry. For example a LIKE condition could be > fully checked against the index entry. Since certain types of indexes > (GIN now, possibly hash in future) are incapable of doing this, there'd > need to be a way of marking index AMs (or perhaps individual indexes) as > able or not able to return their keys, so that the planner could know > whether quals could be pushed into the indexscan. > > We don't have any actual submitted patch for this, but AFAIR everyone > agrees it's a good idea. > > * Bitmap indexes themselves. As far as I can tell, this development > effort has stalled because of intractable problems with VACUUM. > Can anyone provide a status update on that? > > * GIT (Grouped Index Tuple) indexes, which achieve index space savings > in btrees by having a single index tuple represent multiple heap tuples > (on a single heap page) containing a range of key values. I am not sure > what the development status is --- Heikki had submitted a completed > patch but there seemed to be agreement on making changes, and that's not > been done AFAIK. The really serious problem I've got with it is that > it'd foreclose the possibility of returning actual index keys from btree > indexes, thus basically killing the usefulness of that idea. I'm not > convinced it would offer enough gain to be worth paying that price. > Another issue is that we'd need to check how much of the use-case for > GIT has been taken over by HOT. > > * "Thick" indexes containing copies of tuple visibility info. As far > as I'm concerned, this patch isn't going anywhere :-(. It's got the > same showstopper problem as retail vacuum and some of the earlier forms > of the HOT patch: it assumes that during tuple update or delete, it > can re-find the tuple's index entries by re-computing the indexed > expressions. That's just not reliable enough. > > Is that a reasonable summary of where we are? > > regards, tom lane > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom Lane wrote: > * Proposed change to let both amgetnext and amgetmulti mark returned > tuples as "candidate matches", that is in need of rechecking of quals > against the real heap tuple. I had originally objected to this on > the grounds that it would require setup work that doesn't happen now, > but looking at the code I see that's wrong --- the setup work *does* > happen anyway, see indexqualorig. nodeIndexscan.c comments "The > indexqualorig expression is always initialized even though it will only > be used in some uncommon cases --- would be nice to improve that". > So this complaint is probably a red herring. I'm still a bit concerned > by the fact that the patch only tracks this on a page basis in the > amgetmulti case: if any of the tuples on a page are in-doubt then > everything on the page will have to be rechecked. Is that good enough? I think it's good enough. If we wanted to track it on a per-tuple basis, we'd need to store two bits per tuple, AFAICS, doubling the size of the bitmaps. > A bigger issue is whether this is worth applying when nobody seems to be > working on either of the main uses for it (bitmap indexes and GIT > indexes). There seemed to be some possible marginal use for it in GIST > indexes, but I'm not convinced that's a sufficient reason to complicate > the APIs. It has some merit on its own. Consider the case where you bitmap AND a lossy page and a non-lossy one. We currently make the result lossy, because we can't know which tuples on the page match, and then we need to recheck all tuples on that page. With the support for candidate matches, we could instead keep the non-lossy version of the bitmap and mark the matches as candidates. In the very best case, we might even apply a further AND to the page, which eliminates all the remaining candidate matches, and skip the heap access of that page altogether. > * Proposed change to let amgetnext mark returned tuples as not > necessarily in order, requiring somebody downstream to sort them again. > I was pretty desperately unhappy with that because it required injection > of sorting knowledge into code that really shouldn't have anything to do > with it --- not least because not all indexscans even have a defined > ordering. Given that it is only needed for one particular possible > implementation of GIT, and Heikki was leaning away from that > implementation anyway at last report, I vote against considering this > any further. Agreed, there's no use for that without GIT. > * Proposed changes to refactor the TIDBitmap support to include a > concept of a "stream bitmap" being read from disk. (Not strictly an > index AM change, but closely related.) While this is clean and nice on > its own, it doesn't seem to have any use unless bitmap indexes happen. > If we leave the code sit it'll probably acquire bit rot, but I'm > disinclined to add a bunch of unused code, too. Thoughts? If the code is unused, it can easily bit rot even if it's in CVS. Let's not apply it. The patch is out there if/when someone picks up the bitmap index patch again. > * Proposed changes to allow amgetnext to return the actual index keys, > allowing certain types of "unindexable" quals to be checked without > having to fetch the heap entry. For example a LIKE condition could be > fully checked against the index entry. Since certain types of indexes > (GIN now, possibly hash in future) are incapable of doing this, there'd > need to be a way of marking index AMs (or perhaps individual indexes) as > able or not able to return their keys, so that the planner could know > whether quals could be pushed into the indexscan. > > We don't have any actual submitted patch for this, but AFAIR everyone > agrees it's a good idea. Agreed :-). > * GIT (Grouped Index Tuple) indexes, which achieve index space savings > in btrees by having a single index tuple represent multiple heap tuples > (on a single heap page) containing a range of key values. I am not sure > what the development status is --- Heikki had submitted a completed > patch but there seemed to be agreement on making changes, and that's not > been done AFAIK. The really serious problem I've got with it is that > it'd foreclose the possibility of returning actual index keys from btree > indexes, thus basically killing the usefulness of that idea. I'm not > convinced it would offer enough gain to be worth paying that price. > Another issue is that we'd need to check how much of the use-case for > GIT has been taken over by HOT. I don't think there's much overlap with HOT. On the contrary, HOT makes GIT much more useful, because GIT was very sensitive to the cluster order of the heap, and HOT helps with that. There is, however, a ton of overlap with index-only scans, and the possibility to return keys from indexes, as you pointed out. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
>> * Proposed changes to allow amgetnext to return the actual index keys, >> allowing certain types of "unindexable" quals to be checked without >> having to fetch the heap entry. For example a LIKE condition could be >> fully checked against the index entry. In the extreme we could build tuples and push them up several nodes -- even including joins -- before fetching the rest of the attributes from the heap. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
> * Proposed change to let both amgetnext and amgetmulti mark returned > tuples as "candidate matches", that is in need of rechecking of quals > indexes). There seemed to be some possible marginal use for it in GIST > indexes, but I'm not convinced that's a sufficient reason to complicate > the APIs. This is good way to eliminate awful operation '@@@' without performance loss. > * Proposed changes to allow amgetnext to return the actual index keys, > allowing certain types of "unindexable" quals to be checked without > having to fetch the heap entry. For example a LIKE condition could be > fully checked against the index entry. Since certain types of indexes > (GIN now, possibly hash in future) are incapable of doing this, there'd GiST too, because type of storage may be differ from type to be indexed. The same touches GIN too. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> * GIT (Grouped Index Tuple) indexes, which achieve index space savings > in btrees by having a single index tuple represent multiple heap tuples > (on a single heap page) containing a range of key values. I am not sure > what the development status is --- Heikki had submitted a completed > patch but there seemed to be agreement on making changes, and that's not > been done AFAIK. The really serious problem I've got with it is that > it'd foreclose the possibility of returning actual index keys from btree > indexes, thus basically killing the usefulness of that idea. I'm not > convinced it would offer enough gain to be worth paying that price. > Another issue is that we'd need to check how much of the use-case for > GIT has been taken over by HOT. I do not see the serious problem ? The one key that is stored would represent all tuples it points to. So the interface could eighter be designed to allow skipping the key in one go, or return the same key repeatedly. All that the second approach would need is return the key and the heap tuple pointer in two different vars. Andreas
"Zeugswetter Andreas OSB SD" <Andreas.Zeugswetter@s-itsolutions.at> writes: >> ... The really serious problem I've got with it is that >> it'd foreclose the possibility of returning actual index keys from btree >> indexes, thus basically killing the usefulness of that idea. I'm not >> convinced it would offer enough gain to be worth paying that price. > I do not see the serious problem ? The one key that is stored would > represent all tuples it points to. No, the entry represents a range of values for which the one key is the lower bound. You don't know just what the keys are for the other tuples, unless you go to the heap and look. We could restrict GIT to only represent tuples with exactly the same key, but that takes away a whole lot of its use-case (especially so now that HOT is in there). regards, tom lane
Teodor Sigaev <teodor@sigaev.ru> writes: >> * Proposed change to let both amgetnext and amgetmulti mark returned >> tuples as "candidate matches", that is in need of rechecking of quals >> ... >> indexes). There seemed to be some possible marginal use for it in GIST >> indexes, but I'm not convinced that's a sufficient reason to complicate >> the APIs. > This is good way to eliminate awful operation '@@@' without performance loss. Oh yeah, that kluge :-(. Okay, that's probably a sufficient reason all by itself. >> * Proposed changes to allow amgetnext to return the actual index keys, >> allowing certain types of "unindexable" quals to be checked without >> having to fetch the heap entry. For example a LIKE condition could be >> fully checked against the index entry. Since certain types of indexes >> (GIN now, possibly hash in future) are incapable of doing this, there'd > GiST too, because type of storage may be differ from type to be indexed. The > same touches GIN too. Is this the case for *all* GiST and GIN indexes, or might we need a per-index (rather than per-AM) flag to tell whether the original keys are available? regards, tom lane
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> A bigger issue is whether this is worth applying when nobody seems to be >> working on either of the main uses for it (bitmap indexes and GIT >> indexes). There seemed to be some possible marginal use for it in GIST >> indexes, but I'm not convinced that's a sufficient reason to complicate >> the APIs. > It has some merit on its own. Yeah, and Teodor's point about cleaning up the @@@ hack pretty much seals the deal for me. Unless anyone has objections, I will review and apply Heikki's patch http://archives.postgresql.org/pgsql-patches/2007-03/msg00163.php which covers both the amgetmulti return-a-bitmap change and the candidate-matches change. (Heiiki, you don't have a later version of that do you?) The remaining topics associated with index AMs are closed for this commit fest, unless anyone has specific questions they want discussed right now... regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > Tom Lane wrote: > >> A bigger issue is whether this is worth applying when nobody seems to be > >> working on either of the main uses for it (bitmap indexes and GIT > >> indexes). There seemed to be some possible marginal use for it in GIST > >> indexes, but I'm not convinced that's a sufficient reason to complicate > >> the APIs. > > > It has some merit on its own. > > Yeah, and Teodor's point about cleaning up the @@@ hack pretty much > seals the deal for me. > > Unless anyone has objections, I will review and apply Heikki's patch > http://archives.postgresql.org/pgsql-patches/2007-03/msg00163.php > which covers both the amgetmulti return-a-bitmap change and the > candidate-matches change. (Heiiki, you don't have a later version > of that do you?) > > The remaining topics associated with index AMs are closed for this > commit fest, unless anyone has specific questions they want discussed > right now... OK, do you want the items moved to the next commit-fest, discarded, or made into TODOs? And which ones? Or do you want to do it? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
>> GiST too, because type of storage may be differ from type to be indexed. The >> same touches GIN too. > > Is this the case for *all* GiST and GIN indexes, or might we need a > per-index (rather than per-AM) flag to tell whether the original keys > are available? Ughm. GiST and GIN are different here. For GiST it is clear that is per index flag: - rtree emulation over box stores original values - rtree emulation over points or circles, btree_gist, ltree storesmodified original value which can be restored from index with call of specific function - tsvector opclass doesn'thave this possibility at all So, only rtree emulation over box is able to return original value from index. For GIN index I know only one opclass where it's possible to get original value, it's a wildspeed, but in any case that requires some transformation before. However, it's possible to develop opclass for GIN which will be similar to classic Btree, for indexing scalar values. Both GIN and GiST make a call of transformation function before indexing value. I suppose, there is no automatic way to set this flag even in case when type of storage and type of indexing value are the same. So, I see three variant: - add flag in pg_am - add flag to create operator class and by default it should point to impossibilityto get value from index. At least for GIN and GiST. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Bruce Momjian <bruce@momjian.us> writes: > Tom Lane wrote: >> The remaining topics associated with index AMs are closed for this >> commit fest, unless anyone has specific questions they want discussed >> right now... > OK, do you want the items moved to the next commit-fest, discarded, or > made into TODOs? And which ones? Or do you want to do it? TODOs for each of the open items I listed in my summary, please. They aren't material for the next commit fest until someone does more work. regards, tom lane
Teodor Sigaev <teodor@sigaev.ru> writes: > Both GIN and GiST make a call of transformation function before indexing value. > I suppose, there is no automatic way to set this flag even in case when type of > storage and type of indexing value are the same. > So, I see three variant: > - add flag in pg_am > - add flag to create operator class and by default it should point to > impossibility to get value from index. At least for GIN and GiST. Yeah, just as messy as I feared :-(. My inclination for the first pass would be to just make it a per-AM flag in pg_am. We could always complicate matters later. (Of course, this is all hypothetical since no patch is on the horizon...) regards, tom lane
> Yeah, just as messy as I feared :-(. My inclination for the first pass > would be to just make it a per-AM flag in pg_am. We could always > complicate matters later. Agree, and the single existing suitable opclass hasn't operation marked with RECHECK flag. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Tom Lane wrote: > Yeah, and Teodor's point about cleaning up the @@@ hack pretty much > seals the deal for me. Note that we'll need to add candidate match support to the regular index scan API as well for that. > Unless anyone has objections, I will review and apply Heikki's patch > http://archives.postgresql.org/pgsql-patches/2007-03/msg00163.php > which covers both the amgetmulti return-a-bitmap change and the > candidate-matches change. Ok. Thank you! > (Heiiki, you don't have a later version of that do you?) Nope. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> Yeah, and Teodor's point about cleaning up the @@@ hack pretty much >> seals the deal for me. > Note that we'll need to add candidate match support to the regular index > scan API as well for that. [ confused... ] Thought you'd done that in this patch? >> (Heiiki, you don't have a later version of that do you?) > Nope. Going through my mailbox, I found Alexey Klyukin's update: http://archives.postgresql.org/pgsql-patches/2007-06/msg00204.php Did you look at his version at all? regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> Note that we'll need to add candidate match support to the regular index >> scan API as well for that. > > [ confused... ] Thought you'd done that in this patch? No, I only modified the bitmap scan API. >>> (Heiiki, you don't have a later version of that do you?) > >> Nope. > > Going through my mailbox, I found Alexey Klyukin's update: > http://archives.postgresql.org/pgsql-patches/2007-06/msg00204.php > Did you look at his version at all? Oh, I didn't remember that exists at all. It looks good at first glance, though it's missing the doc changes of the original. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> Heikki Linnakangas <heikki@enterprisedb.com> writes: >>> Note that we'll need to add candidate match support to the regular index >>> scan API as well for that. >> >> [ confused... ] Thought you'd done that in this patch? > No, I only modified the bitmap scan API. Okay. I applied what you'd done (with minor revisions), but we'll have to fix up the regular indexscan API before anything can be done about @@@. I'll take a look at that part tomorrow. Teodor, do you have any thoughts about exactly how you'd fix @@@ ? I suppose that the recheck-need is not really a property of specific tuples, but of a particular query, for that case. Where would you want to detect that? regards, tom lane
> >> ... The really serious problem I've got with it is that > >> it'd foreclose the possibility of returning actual index keys from btree > >> indexes, thus basically killing the usefulness of that idea. I'm not > >> convinced it would offer enough gain to be worth paying that price. > > > I do not see the serious problem ? The one key that is stored would > > represent all tuples it points to. > > No, the entry represents a range of values for which the one key is the > lower bound. You don't know just what the keys are for the other > tuples, unless you go to the heap and look. Thanks for explaining. I think that idea stands in the way of future improvements and should be dropped, yes. > We could restrict GIT to only represent tuples with exactly the same > key, but that takes away a whole lot of its use-case (especially so > now that HOT is in there). Ok, I was forgetting pg's outstanding "partial index" feature. In pg you will most likely exclude highly duplicate keys from the index. Since other dbs don't have this feature, it is common to have highly duplicate keys (millions) there (unless you use very ugly workarounds). I agree that the possibility of returning actual index keys and filtering rows early has more merrit. It could also be used to skip forward, when the qual misses middle key columns. I do still see compressing exactly equal keys (or exactly equal parts), but not restricted to the same heap page, as a useful btree TODO (especially for long non unique keys). But it may really not be so important in pg. Andreas
> Teodor, do you have any thoughts about exactly how you'd fix @@@ ? > I suppose that the recheck-need is not really a property of specific > tuples, but of a particular query, for that case. Where would you > want to detect that? tsquery may include restriction by weight of search terms: 'sea & port:A'. GIN index doesn't store information about weights, so the only difference between @@ and @@@ is that @@@ is marked with RECHECK flag. I think, the better way is set flag about required recheck by looking value from index, not for tsquery. It gives to us more flexibility. So, I planned to add pointer to bool to consistent method, so signature will be bool consistent( bool check[], StrategyNumber n, Datum query, bool *needRecheck) Returning value of needRecheck should be ignored for operation not marked by RECHECK flag in opclass. needRecheck should be initialized to true before call of consistent method to keep compatibility with old opclasses. To define, is recheck needed or not, the better way is to check actually needed values. For example, let tsquery is equal to 'foo | bar | qq:A' and tsvetor = 'foo:1,2,3 asdasdasd:4'. Obviously recheck is not needed. So patch is close to trivial: *** tsginidx.c.orig 2008-04-11 17:08:37.000000000 +0400 --- tsginidx.c 2008-04-11 17:18:45.000000000 +0400 *************** *** 109,114 **** --- 109,115 ---- { QueryItem *frst; bool *mapped_check; + bool *needRecheck; } GinChkVal; static bool *************** *** 116,121 **** --- 117,125 ---- { GinChkVal *gcv = (GinChkVal *) checkval; + if ( val->weight ) + *(gcv->needRecheck) = true; + return gcv->mapped_check[((QueryItem *) val) - gcv->frst]; } *************** *** 144,149 **** --- 148,155 ---- gcv.frst = item = GETQUERY(query); gcv.mapped_check = (bool *) palloc(sizeof(bool) * query->size); + gcv.needRecheck = PG_GETARG_POINTER(3); + *(gcv.needRecheck) = false; for (i = 0; i < query->size; i++) if (item[i].type == QI_VAL) -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: > So, I planned to add pointer to bool to consistent method, so signature will be > bool consistent( bool check[], StrategyNumber n, Datum query, bool *needRecheck) > Returning value of needRecheck should be ignored for operation not marked by > RECHECK flag in opclass. needRecheck should be initialized to true before call > of consistent method to keep compatibility with old opclasses. Perhaps it would be better to initialize needRecheck to the opclass RECHECK flag value? If the consistent function does nothing, the behavior is the same as before, but it can flip the flag in either direction if it wants. regards, tom lane
Tom Lane wrote: > Perhaps it would be better to initialize needRecheck to the opclass > RECHECK flag value? If the consistent function does nothing, the > behavior is the same as before, but it can flip the flag in either > direction if it wants. I remember that last spring, in the context of GIT, you were worried about the performance implication of preparing to recheck rows when no rechecks are needed. I didn't quite buy that back then, but this would have the same issue. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> Perhaps it would be better to initialize needRecheck to the opclass >> RECHECK flag value? If the consistent function does nothing, the >> behavior is the same as before, but it can flip the flag in either >> direction if it wants. > I remember that last spring, in the context of GIT, you were worried > about the performance implication of preparing to recheck rows when no > rechecks are needed. I didn't quite buy that back then, but this would > have the same issue. As I mentioned upthread, it appears that we're paying that overhead anyway --- at least nodeIndexscan.c thinks we are. I need to dig into the planner a bit today and see whether it's taking any shortcuts for non-RECHECK operators. If it really is saving anything, then I'd agree that only RECHECK-marked operators should be allowed to adjust the flag. regards, tom lane
> Perhaps it would be better to initialize needRecheck to the opclass > RECHECK flag value? If the consistent function does nothing, the > behavior is the same as before, but it can flip the flag in either > direction if it wants. I have not any objections. In this case (and preparing of rechecking is cheap) RECHECK flag could be removed. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
I wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> I remember that last spring, in the context of GIT, you were worried >> about the performance implication of preparing to recheck rows when no >> rechecks are needed. I didn't quite buy that back then, but this would >> have the same issue. > As I mentioned upthread, it appears that we're paying that overhead > anyway --- at least nodeIndexscan.c thinks we are. I need to dig into > the planner a bit today and see whether it's taking any shortcuts for > non-RECHECK operators. Yeah, we are paying that overhead. The reason is that the planner always constructs an "indexqualorig" expression and the executor always initializes it. The only place where it's used currently in a plain indexscan is for EvalPlanQual rechecking, but we could certainly use it for lossy-operator rechecking. (The implication of this is that if any of the operators in a multi-index-qual indexscan are lossy, we'd recheck all of them upon fetching the heap tuple. Does anyone feel that's not good enough? Most of the practical cases I can think of for multi-operator scans are for btree anyway, so it's not clear that there's much value in complicating matters to evaluate just some of the indexqualorig clauses.) This would effectively move *all* management of lossy operators to runtime; the planner wouldn't really think about it at all. We could simplify createplan.c a bit. regards, tom lane
Slightly offtopic. How to get benefit on tuple level ? For example, we mark GiST tsearch index as lossy, while for not very big documents it's actually exact and we could save a lot not rechecking them. Oleg On Fri, 11 Apr 2008, Teodor Sigaev wrote: >> Teodor, do you have any thoughts about exactly how you'd fix @@@ ? >> I suppose that the recheck-need is not really a property of specific >> tuples, but of a particular query, for that case. Where would you >> want to detect that? > > tsquery may include restriction by weight of search terms: 'sea & port:A'. > GIN index doesn't store information about weights, so the only difference > between @@ and @@@ is that @@@ is marked with RECHECK flag. I think, the > better way is set flag about required recheck by looking value from index, > not for tsquery. It gives to us more flexibility. > > So, I planned to add pointer to bool to consistent method, so signature will > be > bool consistent( bool check[], StrategyNumber n, Datum query, bool > *needRecheck) > > Returning value of needRecheck should be ignored for operation not marked by > RECHECK flag in opclass. needRecheck should be initialized to true before > call of consistent method to keep compatibility with old opclasses. > > To define, is recheck needed or not, the better way is to check actually > needed values. For example, let tsquery is equal to > 'foo | bar | qq:A' and tsvetor = 'foo:1,2,3 asdasdasd:4'. Obviously recheck > is not needed. So patch is close to trivial: > > *** tsginidx.c.orig 2008-04-11 17:08:37.000000000 +0400 > --- tsginidx.c 2008-04-11 17:18:45.000000000 +0400 > *************** > *** 109,114 **** > --- 109,115 ---- > { > QueryItem *frst; > bool *mapped_check; > + bool *needRecheck; > } GinChkVal; > > static bool > *************** > *** 116,121 **** > --- 117,125 ---- > { > GinChkVal *gcv = (GinChkVal *) checkval; > > + if ( val->weight ) > + *(gcv->needRecheck) = true; > + > return gcv->mapped_check[((QueryItem *) val) - gcv->frst]; > } > > *************** > *** 144,149 **** > --- 148,155 ---- > > gcv.frst = item = GETQUERY(query); > gcv.mapped_check = (bool *) palloc(sizeof(bool) * > query->size); > + gcv.needRecheck = PG_GETARG_POINTER(3); > + *(gcv.needRecheck) = false; > > for (i = 0; i < query->size; i++) > if (item[i].type == QI_VAL) > > > > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov <oleg@sai.msu.su> writes: > Slightly offtopic. How to get benefit on tuple level ? For example, > we mark GiST tsearch index as lossy, while for not very big documents it's > actually exact and we could save a lot not rechecking them. Won't that just fall out of this? Assuming the consistent() function knows when the match is exact, it can set the flag properly. regards, tom lane
On Fri, 11 Apr 2008, Tom Lane wrote: > Oleg Bartunov <oleg@sai.msu.su> writes: >> Slightly offtopic. How to get benefit on tuple level ? For example, >> we mark GiST tsearch index as lossy, while for not very big documents it's >> actually exact and we could save a lot not rechecking them. > > Won't that just fall out of this? Assuming the consistent() function > knows when the match is exact, it can set the flag properly. Ah, yes. Looks like a new life for GiST tsearch index. > > regards, tom lane > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Heikki Linnakangas wrote: >> * GIT (Grouped Index Tuple) indexes, which achieve index space savings >> in btrees by having a single index tuple represent multiple heap tuples >> [...] >> Another issue is that we'd need to check how much of the use-case for >> GIT has been taken over by HOT. > > There is, however, a ton of overlap with index-only scans, and the > possibility to return keys from indexes, as you pointed out. One use case that I think GIT would help a lot with are my large address tables that are clustered by zip-code but often queried by State, City, County, School District, Police Beat, etc. I imagine a GIT index on "state" would just occupy a couple pages at most regardless of how large the table gets. And likewise, even an index on City would be orders of magnitude smaller than the existing ones; since all records for any given city are all on the same few disk pages. Or am I misunderstanding how GIT works.
Ron Mayer wrote: > One use case that I think GIT would help a lot with are my > large address tables that are clustered by zip-code but > often queried by State, City, County, School District, > Police Beat, etc. Yep, GIT would shrink the index on zip-code tremendously... > I imagine a GIT index on "state" would just occupy > a couple pages at most regardless of how large the > table gets. .. Not quite that much, though. GIT still stores one index pointer per heap page even on a fully clustered table. Otherwise it's not much good for searches. > And likewise, even an index on City > would be orders of magnitude smaller than the existing > ones; since all records for any given city are all > on the same few disk pages. Yep, it would help with that as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
I looked over the issue of making the regular-indexscan code path able to handle runtime determination of operator lossiness. Here's my implementation plan: * Extend amgettuple API to allow a boolean recheck flag to be passed back. * Remove index_getnext_indexitem, which is dead code and has been for awhile. (It seems to have been put in to help gist and hash vacuuming, which before that were using index_getnext loops and thus thrashing the heap on every call... but they were both rewritten entirely since then.) * Have index_getnext() just pass the recheck flag back to its caller, without doing any extra work. The existing callers arenodeIndexscan.cCLUSTERsystable_getnextGetNewOidWithIndex (should usesystable_getnext) and about ten places that use index_getnext directly, instead of going through systable_getnext, because they rely on seeing ordered results which systable_getnext doesn't promise. * nodeIndexscan can execute rechecks using the indexqualorig expressions, which it has set up and ready to go already, so there's basically no added overhead for non-lossy cases. * CLUSTER doesn't pass any scankeys so it should never see recheck=true, and can/should error out. * systable_getnext and the other direct callers are only used with system catalogs, which should never have lossy operators anyway, so for the moment we can just have them error out on recheck=true. But I think it's important to have a plan for fixing that later. We know that these functions are used only with "simple" scan keys (no index expressions), so it would be easy enough to apply the recheck using the scan keys ... if we only had the heap attnums at hand rather than the index attnums. In the case of systable_getnext it'd be very simple to extend systable_beginscan and the SysScanDesc data structure to preserve the heap attnums. What I propose doing about the other direct callers is to make a variant of the systable scan code that *does* promise to preserve ordering, but otherwise has a similar API to systable_beginscan/systable_getnext. If they all go through that then we'll have a single place to fix to support lossy results. * As discussed yesterday, eliminate fixed RECHECK markings on operators, and change the API for GIST/GIN "consistent" functions to pass back a recheck indicator. * The planner will no longer care whether an index operator is lossy. This means that as far as the planner is concerned there isn't any strong reason for fix_indexqual_references to call get_op_opfamily_properties --- the only reason it's doing that is to build the indexstrategy and indexsubtype lists to attach to the generated indexscan plan node. It's tempting to drop those lists from the plan tree altogether and instead do the get_op_opfamily_properties lookup during ExecIndexBuildScanKeys. For a plan executed only once, this would be a clear win --- it's the same total number of catcache lookups and we avoid some palloc overhead to construct the lists. For a plan that's cached and reused many times, at some point the extra lookups would start to cost more than the list storage costs. But it's not like executor startup doesn't involve a lot of catcache fetches anyway. Any thoughts about that tradeoff? BTW, it occurs to me that this'll make it trivially easy to experiment with hash indexes that only store the hash values ... they'll just always set the recheck flag to true. regards, tom lane
Heikki Linnakangas wrote: > Ron Mayer wrote: >> One use case that I think GIT would help a lot with are my >> large address tables that are clustered by zip-code but >> often queried by State, City, County, School District, >> Police Beat, etc. > >> I imagine a GIT index on "state" would just occupy >> a couple pages at most regardless of how large the >> table gets. > > .. Not quite that much, though. GIT still stores one index pointer per > heap page even on a fully clustered table. Otherwise it's not much good > for searches. Then I wonder if I can conceive of yet another related index type that'd be useful for such clustered tables. If I had something like GIT that stored something like "values State='CA' can be found on pages 1000 through 10000 and 20000 through 21000" would it be even more effective on such a table than GIT? If so, it seems it'd give many advantages that partitioning by state could give (at least for querying).
Ron Mayer wrote: > Then I wonder if I can conceive of yet another related > index type that'd be useful for such clustered tables. If > I had something like GIT that stored something like > "values State='CA' can be found on pages 1000 through 10000 > and 20000 through 21000" would it be even more effective on > such a table than GIT? Yep, a bitmap index. Bitmap indexes don't actually work quite like that, but the point is that it is very efficient on a column with few distinct values, and even more so if the table is clustered on that column. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2008-04-09 at 20:30 -0400, Tom Lane wrote: > * GIT (Grouped Index Tuple) indexes, which achieve index space savings > in btrees by having a single index tuple represent multiple heap tuples > (on a single heap page) containing a range of key values. I am not sure > what the development status is --- Heikki had submitted a completed > patch but there seemed to be agreement on making changes, and that's not > been done AFAIK. The really serious problem I've got with it is that > it'd foreclose the possibility of returning actual index keys from btree > indexes, thus basically killing the usefulness of that idea. I'm not > convinced it would offer enough gain to be worth paying that price. > Another issue is that we'd need to check how much of the use-case for > GIT has been taken over by HOT. That seems to be a misunderstanding about HOT and GIT. HOT is an important requirement for GIT, but other than they are unrelated. Testing in 2006/2007 showed that HOT stabilised the effects of repeated updates, which then showed as a "gain" in performance. But GIT did show considerable actual performance gains in its target use case. GIT significantly reduces the size of clustered indexes, greatly improving the number of index pointers that can be held in memory for very large indexes. That translates directly into a reduction in I/O for large databases on typical hardware, for primary operations, file backups and recovery (and this, log replication). Test results validated that and showed increased performance, over and above that experienced with HOT, when tested together. Now there may be problems with the GIT code as it stands, but we should acknowledge that the general technique has been proven to improve performance on a recent PostgreSQL codebase. This is an unsurprising result, since SQLServer, Sybase, DB2, Oracle and Teradata (at least) all use indexes of this category to improve real-world performance. The idea is definitely not a benchmark-only feature. Many users would be very interested if we could significantly reduce the size of the main index on their largest tables. I would at least like to see clustered indexes acknowledged as a TODO item, so we keep the door open for a future implementation based around the basic concept of GIT. Nobody is going to waste their time flogging a dead horse, which is why the patch isn't ready. Maybe *that* horse is dead, not really for me to say, but if we can at least agree on a basic statement that equine animals are fast we may find a rider willing to invest time in them. I don't see the "returns index keys" idea as being killed by or killing this concept. Returning keys is valid and useful when we can, but there are other considerations that, in some use cases, will be a dominant factor. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs <simon@2ndquadrant.com> writes: > I don't see the "returns index keys" idea as being killed by or killing > this concept. Returning keys is valid and useful when we can, but there > are other considerations that, in some use cases, will be a dominant > factor. The patch as-submitted was a killer for the concept, because it automatically discarded information and there was no way to prevent that. To be acceptable, a GIT patch would have to be optional and it would have to expose in the catalogs whether a given index was lossy in this way or not (so that the planner could know whether a plan based on returning index keys would work). regards, tom lane
On Wed, 2008-04-23 at 12:07 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > I don't see the "returns index keys" idea as being killed by or killing > > this concept. Returning keys is valid and useful when we can, but there > > are other considerations that, in some use cases, will be a dominant > > factor. > > The patch as-submitted was a killer for the concept, because it > automatically discarded information and there was no way to prevent > that. Understood. > To be acceptable, a GIT patch would have to be optional and it > would have to expose in the catalogs whether a given index was lossy > in this way or not (so that the planner could know whether a plan based > on returning index keys would work). Would you see it as a separate index type, or a modification of the b-tree (with option enabled via a "storage parameter")? If it was the latter, then perhaps there could be a future for the GIT patch after all. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs <simon@2ndquadrant.com> writes: > On Wed, 2008-04-23 at 12:07 -0400, Tom Lane wrote: >> To be acceptable, a GIT patch would have to be optional and it >> would have to expose in the catalogs whether a given index was lossy >> in this way or not (so that the planner could know whether a plan based >> on returning index keys would work). > Would you see it as a separate index type, or a modification of the > b-tree (with option enabled via a "storage parameter")? If it was the > latter, then perhaps there could be a future for the GIT patch after > all. Hmm, well, separate index type doesn't seem real nice because most all the places that currently know special things about btree would need to be hacked to recognize the other type too; plus we'd have to double all the btree entries in pg_amop/pg_amproc/pg_opclass/pg_opfamily. I think storage parameter is no good also, given the current design that assumes those can be changed on-the-fly. It'd be okay to GIT-ify an existing index, perhaps, but not the other way round. I was considering a new pg_index column. Or else we'd have to fix the storage-parameter infrastructure to support restricting changes of some parameters. regards, tom lane
On Wed, 2008-04-23 at 14:06 -0400, Tom Lane wrote: > I think storage parameter is no good also, given the current design that > assumes those can be changed on-the-fly. It'd be okay to GIT-ify an > existing index, perhaps, but not the other way round. > > I was considering a new pg_index column. Or else we'd have to fix > the storage-parameter infrastructure to support restricting changes > of some parameters. The latter seems best, since we're likely to need to do it anyway one day. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > GIT significantly reduces the size of clustered indexes, greatly > improving the number of index pointers that can be held in memory for > very large indexes. That translates directly into a reduction in I/O for > large databases on typical hardware, for primary operations, file > backups and recovery (and this, log replication). Test results validated > that and showed increased performance, over and above that experienced > with HOT, when tested together. > > Now there may be problems with the GIT code as it stands, but we should > acknowledge that the general technique has been proven to improve > performance on a recent PostgreSQL codebase. This is an unsurprising > result, since SQLServer, Sybase, DB2, Oracle and Teradata (at least) all > use indexes of this category to improve real-world performance. The idea > is definitely not a benchmark-only feature. > > Many users would be very interested if we could significantly reduce the > size of the main index on their largest tables. Yes, basically GIT allows index compression for clustered tables, and stated that way it is clear it would be a big feature if we had it for 8.4. > I would at least like to see clustered indexes acknowledged as a TODO > item, so we keep the door open for a future implementation based around > the basic concept of GIT. Yes, it is already a TODO: * Consider compressing indexes by storing key values duplicated in several rows as a single index entry -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: >> On Wed, 2008-04-23 at 12:07 -0400, Tom Lane wrote: >>> To be acceptable, a GIT patch would have to be optional and it >>> ... > I was considering a new pg_index column. Or else we'd have to fix > the storage-parameter infrastructure to support restricting changes > of some parameters. Interesting. Does this mean that down the road a postgis index might be GIT-ified? On my biggest tables (clustered by zip/postal code) the index on the geometry column with a postgis index probably sees all the rows on each of it's pages pointing to the same few heap pages. If I understand right, that's the kind of pattern that GIT helps.
> Interesting. Does this mean that down the road a postgis index > might be GIT-ified? Only if GiST will support GIT, but I don't follow on this discussion. In any case, GIT on GiST will not be so effective as on Btree, because GiST doesn't guarantee that all close values will be close in index: it's possible to have several clusters (groups) with close values in different parts of index. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Wed, 2008-04-23 at 22:26 -0400, Bruce Momjian wrote: > Simon Riggs wrote: > > GIT significantly reduces the size of clustered indexes, greatly > > improving the number of index pointers that can be held in memory for > > very large indexes. That translates directly into a reduction in I/O for > > large databases on typical hardware, for primary operations, file > > backups and recovery (and this, log replication). Test results validated > > that and showed increased performance, over and above that experienced > > with HOT, when tested together. > > > > Now there may be problems with the GIT code as it stands, but we should > > acknowledge that the general technique has been proven to improve > > performance on a recent PostgreSQL codebase. This is an unsurprising > > result, since SQLServer, Sybase, DB2, Oracle and Teradata (at least) all > > use indexes of this category to improve real-world performance. The idea > > is definitely not a benchmark-only feature. > > > > Many users would be very interested if we could significantly reduce the > > size of the main index on their largest tables. > > Yes, basically GIT allows index compression for clustered tables, and > stated that way it is clear it would be a big feature if we had it for > 8.4. > > I would at least like to see clustered indexes acknowledged as a TODO > > item, so we keep the door open for a future implementation based around > > the basic concept of GIT. > > Yes, it is already a TODO: > > * Consider compressing indexes by storing key values duplicated in > several rows as a single index entry That sounds similar, but definitely does not describe what GIT does. GIT holds one index pointer for a *range* of values held on one block. e.g. if we have a table with values 1-200 in it then GIT would store *one* index pointer for all 200 rows, even though the values are all different. That property makes it very different from the TODO item stated, since GIT can work very well on Unique indexes, which clearly do not have duplicated keys. I'd prefer the term "Clustered Indexes" rather than the name GIT, which is also a slang insult. Index compression is possible in many ways, depending upon the situation. All of the following sound similar at a high level, but each covers a different use case. * For Long, Similar data e.g. Text we can use Prefix Compression We still store one pointer per row, but we reduce the size of the index by reducing the size of the key values. This requires us to reach inside datatypes, so isn't a very general solution but is probably an important one in the future for Text. * For Unique/nearly-Unique indexes we can use Range Compression We reduce the size of the index by holding one index pointer per range of values, thus removing both keys and pointers. It's more efficient than prefix compression and isn't datatype-dependant. * For Highly Non-Unique Data we can use Duplicate Compression The latter is the technique used by Bitmap Indexes. Efficient, but not useful for unique/nearly-unique data * Multi-Column Leading Value Compression - if you have a multi-column index, then leading columns are usually duplicated between rows inserted at the same time. Using an on-block dictionary we can remove duplicates. Only useful for multi-column indexes, possibly overlapping/contained subset of the GIT use case. As with HOT, all of these techniques need to be controlled by heuristics. Taken to extremes, these techniques can hurt other aspects of performance, so its important we don't just write the code but also investigate reasonable limits for behaviour also. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Wed, 2008-04-23 at 19:43 -0700, Ron Mayer wrote: > Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > >> On Wed, 2008-04-23 at 12:07 -0400, Tom Lane wrote: > >>> To be acceptable, a GIT patch would have to be optional and it > >>> ... > > I was considering a new pg_index column. Or else we'd have to fix > > the storage-parameter infrastructure to support restricting changes > > of some parameters. > > > Interesting. Does this mean that down the road a postgis index > might be GIT-ified? Yes, all types of index can benefit from some form of compression. The only issue is that there isn't one technique that covers all use cases, so it will take time and we should be careful to do the most important ones first. > On my biggest tables (clustered by zip/postal code) the index on > the geometry column with a postgis index probably sees all the > rows on each of it's pages pointing to the same few heap pages. > If I understand right, that's the kind of pattern that GIT helps. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Thu, Apr 24, 2008 at 10:11:02AM +0100, Simon Riggs wrote: > Index compression is possible in many ways, depending upon the > situation. All of the following sound similar at a high level, but each > covers a different use case. True, but there is one significant difference: > * For Long, Similar data e.g. Text we can use Prefix Compression > * For Highly Non-Unique Data we can use Duplicate Compression > * Multi-Column Leading Value Compression - if you have a multi-column These are all not lossy and so are candidate to use on any b-tree even by default. They don't affect plan-construction materially, except perhaps in cost calculations. Given the index tuple overhead I don't see how you could lose. > * For Unique/nearly-Unique indexes we can use Range Compression This one is lossy and so does affect possible plans. I beleive the term for this is "sparse" index. Also, it's seems harder to optimise, since it would seem to me the index would have to have some idea on how "close" values are together. > As with HOT, all of these techniques need to be controlled by > heuristics. Taken to extremes, these techniques can hurt other aspects > of performance, so its important we don't just write the code but also > investigate reasonable limits for behaviour also. For the first three I can't imagine what the costs would be, except the memory usage to store the unduplicated data once when its read in. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
On Thu, 2008-04-24 at 13:13 +0200, Martijn van Oosterhout wrote: > On Thu, Apr 24, 2008 at 10:11:02AM +0100, Simon Riggs wrote: > > Index compression is possible in many ways, depending upon the > > situation. All of the following sound similar at a high level, but each > > covers a different use case. > > True, but there is one significant difference: > > * For Long, Similar data e.g. Text we can use Prefix Compression > > * For Highly Non-Unique Data we can use Duplicate Compression > > * Multi-Column Leading Value Compression - if you have a multi-column > > These are all not lossy and so are candidate to use on any b-tree even > by default. They don't affect plan-construction materially, except > perhaps in cost calculations. Given the index tuple overhead I don't > see how you could lose. True, but it is important that this is not a discussion about the one best technique for index compression. I think there are many techniques, covering many use cases. If we think there is only one, the different use cases get thrown out the door because "there was no consensus". If the TODO list does not say clearly what we want, it will never happen. We know no sponsor will pay for an idea that didn't make the list. Worse, they will pay for what is on the list, even if it wasn't exactly what we wanted. > > * For Unique/nearly-Unique indexes we can use Range Compression > > This one is lossy and so does affect possible plans. I beleive the term > for this is "sparse" index. Also, it's seems harder to optimise, since > it would seem to me the index would have to have some idea on how > "close" values are together. Value Lossy, true, but we already support lossy searches with BitmapHeapScans, so lets not be fooled by how that word sounds. That was Tom's original complaint against Block Indexes in Sep 2006, which is why the non-lossy GIT design was conceived. GIT knows exactly which tuples are indexed, and which are not. There is no need for the index to know how close the values are. Only that values in that range live on that block. > > As with HOT, all of these techniques need to be controlled by > > heuristics. Taken to extremes, these techniques can hurt other aspects > > of performance, so its important we don't just write the code but also > > investigate reasonable limits for behaviour also. > > For the first three I can't imagine what the costs would be, except the > memory usage to store the unduplicated data once when its read in. > > Have a nice day, -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs <simon@2ndquadrant.com> wrote: > * For Highly Non-Unique Data we can use Duplicate Compression > The latter is the technique used by Bitmap Indexes. Efficient, but not > useful for unique/nearly-unique data I heard that GIN has already had duplicate-compression feature. http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gin | Gin consists of a B-tree index constructed over entries (ET, entries tree), | where each entry is an element of the indexed value (element of array, | lexeme for tsvector) and where each tuple in a leaf page is either a | pointer to a B-tree over item pointers (PT, posting tree), or a list of | item pointers (PL, posting list) if the tuple is small enough. If GIT comes, can we merge or share some modules between btree and gin? I guess the page layout of GIT is better than ET/PT pair when the index size are larger than main memory because the key and item pointers are placed in near pages. Gin-over-btree might be useful some usages of inverted indexes. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
"Martijn van Oosterhout" <kleptog@svana.org> writes: > On Thu, Apr 24, 2008 at 10:11:02AM +0100, Simon Riggs wrote: >> Index compression is possible in many ways, depending upon the >> situation. All of the following sound similar at a high level, but each >> covers a different use case. > > True, but there is one significant difference: > >> * For Long, Similar data e.g. Text we can use Prefix Compression >> * For Highly Non-Unique Data we can use Duplicate Compression >> * Multi-Column Leading Value Compression - if you have a multi-column > > These are all not lossy and so are candidate to use on any b-tree even > by default. They don't affect plan-construction materially, except > perhaps in cost calculations. Given the index tuple overhead I don't > see how you could lose. The problem is that while it's easy to see what to do for text (and even then perhaps not so easy in non-C locales) Postgres's extensibility makes it quite tricky to see what to do from there. Perhaps we need a btree procedure which takes two parameters, a "parent" and "child" and returns a compressed value. There would have to be a second procedure to decompress values given their full parent. There are a lot of tricky bits here, like what do you do on leaf pages? You have to be able to follow leaf pages down the chain without consulting their "parent". -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
Simon Riggs wrote: > > > Many users would be very interested if we could significantly reduce the > > > size of the main index on their largest tables. > > > > Yes, basically GIT allows index compression for clustered tables, and > > stated that way it is clear it would be a big feature if we had it for > > 8.4. > > > > I would at least like to see clustered indexes acknowledged as a TODO > > > item, so we keep the door open for a future implementation based around > > > the basic concept of GIT. > > > > Yes, it is already a TODO: > > > > * Consider compressing indexes by storing key values duplicated in > > several rows as a single index entry > > That sounds similar, but definitely does not describe what GIT does. > > GIT holds one index pointer for a *range* of values held on one block. > e.g. if we have a table with values 1-200 in it then GIT would store > *one* index pointer for all 200 rows, even though the values are all > different. > > That property makes it very different from the TODO item stated, since > GIT can work very well on Unique indexes, which clearly do not have > duplicated keys. Agreed, TODO updated: * Consider smaller indexes that record a range of values per heap page, rather than having one index entry for every heaprow This is useful if the heap is clustered by the indexed values. http://archives.postgresql.org/pgsql-hackers/2006-12/msg00341.php http://archives.postgresql.org/pgsql-hackers/2007-02/msg01264.php http://archives.postgresql.org/pgsql-hackers/2007-03/msg00465.php http://archives.postgresql.org/pgsql-patches/2007-03/msg00163.php http://archives.postgresql.org/pgsql-hackers/2007-08/msg00014.php http://archives.postgresql.org/pgsql-hackers/2007-08/msg00487.php http://archives.postgresql.org/pgsql-hackers/2008-04/msg01589.php > I'd prefer the term "Clustered Indexes" rather than the name GIT, which > is also a slang insult. I try to avoid technical terms which might not be familiar, but instead describe exactly what is requested. > Index compression is possible in many ways, depending upon the > situation. All of the following sound similar at a high level, but each > covers a different use case. > > * For Long, Similar data e.g. Text we can use Prefix Compression > We still store one pointer per row, but we reduce the size of the index > by reducing the size of the key values. This requires us to reach inside > datatypes, so isn't a very general solution but is probably an important > one in the future for Text. > > * For Unique/nearly-Unique indexes we can use Range Compression > We reduce the size of the index by holding one index pointer per range > of values, thus removing both keys and pointers. It's more efficient > than prefix compression and isn't datatype-dependant. > > * For Highly Non-Unique Data we can use Duplicate Compression > The latter is the technique used by Bitmap Indexes. Efficient, but not > useful for unique/nearly-unique data > > * Multi-Column Leading Value Compression - if you have a multi-column > index, then leading columns are usually duplicated between rows inserted > at the same time. Using an on-block dictionary we can remove duplicates. > Only useful for multi-column indexes, possibly overlapping/contained > subset of the GIT use case. Should any of these others be TODOs? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Thu, 2008-04-24 at 11:43 -0400, Bruce Momjian wrote: > > Index compression is possible in many ways, depending upon the > > situation. All of the following sound similar at a high level, but each > > covers a different use case. > > > > * For Long, Similar data e.g. Text we can use Prefix Compression > > We still store one pointer per row, but we reduce the size of the index > > by reducing the size of the key values. This requires us to reach inside > > datatypes, so isn't a very general solution but is probably an important > > one in the future for Text. > > > > * For Unique/nearly-Unique indexes we can use Range Compression > > We reduce the size of the index by holding one index pointer per range > > of values, thus removing both keys and pointers. It's more efficient > > than prefix compression and isn't datatype-dependant. > > > > * For Highly Non-Unique Data we can use Duplicate Compression > > The latter is the technique used by Bitmap Indexes. Efficient, but not > > useful for unique/nearly-unique data > > > > * Multi-Column Leading Value Compression - if you have a multi-column > > index, then leading columns are usually duplicated between rows inserted > > at the same time. Using an on-block dictionary we can remove duplicates. > > Only useful for multi-column indexes, possibly overlapping/contained > > subset of the GIT use case. > > Should any of these others be TODOs? Possibly, though I think we have the priority items covered now. Greg has passed comment on this thread of the difficulties inherent with approach 1. I've previously argued that technique 1 can be handled much better by having "implied functional index use", e.g. an index can be defined on SUBSTR(col, 10) rather than on the whole column, yet still used automatically without needing to mention SUBSTR. Link: Hackers 11 Jul 2006. There's a variety of ways to improve multi-column indexes beyond what we currently support, though those haven't ever been discussed on hackers to my knowledge. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Apr 24, 2008, at 10:43 AM, Bruce Momjian wrote: Bruce asked if these should be TODOs... >> Index compression is possible in many ways, depending upon the >> situation. All of the following sound similar at a high level, but >> each >> covers a different use case. >> >> * For Long, Similar data e.g. Text we can use Prefix Compression >> We still store one pointer per row, but we reduce the size of the >> index >> by reducing the size of the key values. This requires us to reach >> inside >> datatypes, so isn't a very general solution but is probably an >> important >> one in the future for Text. I think what would be even more useful is doing this within the table itself, and then bubbling that up to the index. >> * For Unique/nearly-Unique indexes we can use Range Compression >> We reduce the size of the index by holding one index pointer per >> range >> of values, thus removing both keys and pointers. It's more efficient >> than prefix compression and isn't datatype-dependant. Definitely. >> * For Highly Non-Unique Data we can use Duplicate Compression >> The latter is the technique used by Bitmap Indexes. Efficient, but >> not >> useful for unique/nearly-unique data Also definitely. This would be hugely useful for things like "status" or "type" fields. >> * Multi-Column Leading Value Compression - if you have a multi-column >> index, then leading columns are usually duplicated between rows >> inserted >> at the same time. Using an on-block dictionary we can remove >> duplicates. >> Only useful for multi-column indexes, possibly overlapping/contained >> subset of the GIT use case. Also useful, though I generally try and put the most diverse values first in indexes to increase the odds of them being used. Perhaps if we had compression this would change. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Well, for these two: > >> * For Highly Non-Unique Data we can use Duplicate Compression > >> * Multi-Column Leading Value Compression - if you have a multi-column You don't need any new logic at all. If _bt_compare says they're equal you can compact them. The long similar case is harder, ISTM there are better ways to handle that anyway (tsearch? hashing?). > There are a lot of tricky bits here, like what do you do on leaf pages? You > have to be able to follow leaf pages down the chain without consulting their > "parent". I was thinking the compression would only occur on a single page, anything else seems to be asking for trouble. Basically, on a single page you compact: A B -> 2 A B -> 3 A C -> 4 A D -> 1 B F -> 2 to: A B -> [2, 3] (1) C -> 4 (1) D -> 1 B F -> 2 Within the context of a single page it's perfectly clear. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Martijn van Oosterhout <kleptog@svana.org> writes: > Well, for these two: > * For Highly Non-Unique Data we can use Duplicate Compression > * Multi-Column Leading Value Compression - if you have a multi-column > You don't need any new logic at all. If _bt_compare says they're equal > you can compact them. Not if you'd like to support index key retrieval. (The equality function that btree is using doesn't necessarily mean "equal for all purposes".) regards, tom lane