Re: the big picture for index-only scans - Mailing list pgsql-hackers
From | Cédric Villemain |
---|---|
Subject | Re: the big picture for index-only scans |
Date | |
Msg-id | BANLkTin1pFF+4oCxiYLNTFtwqqvYFQ6dww@mail.gmail.com Whole thread Raw |
In response to | the big picture for index-only scans (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: the big picture for index-only scans
|
List | pgsql-hackers |
2011/5/10 Robert Haas <robertmhaas@gmail.com>: > So, what do we need in order to find our way to index-only scans? > > 3. Statistics. I believe that in order to accurately estimate the > cost of an index-only scan, we're going to need to know the fraction > of tuples that are on pages whose visibility map bits are set. I > believe it should be fairly straightforward to have ANALYZE collect > this information; and I'm inclined to do that as a separate patch. It > seems like it would also be nice to know what fraction of tuples are > on pages that don't have the visibility map set but where, in fact, > all tuples on the page are visible to all transactions, so it would be > legal to set the bit. A large discrepancy between these two > percentages might be a good reason to auto-vacuum the table (perhaps > using a "really lazy vacuum"[2]). I'm not sure if this can be added > cheaply, though, and in any case, any change to the set of criteria > that will trigger an auto-vacuum is probably a can of worms. > Thoughts? ANALYZE can do the stats job for 'free' on the pages it collects anyway. So that looks like a good idea. I believe the really lazy vacuum is another topic; even if it will improve the performance of the index only scan to have tables already vacuuumed, the stats should expose that and the function cost_index(_only?)() taking care of that. > > 4. Planner and executor changes. In contrast to Heikki's original > implementation, I'm inclined to not to try to split the Index Scan > node into index scan and heap fetch. Since there are many choices for > where to put the heap fetch node (any level of the join tree between > the index scan and the root), this seems likely to result in a > combinatorial explosion of paths[3], and I'm not real sure that the > payback will be adequate. Furthermore, the idea of allowing user code > to see tuples that will only later be determined not to have been > visible to that MVCC snapshot in the first place seems pretty scary > from a security perspective, though certainly there are possible > benefits[4]. Instead, I'm inclined to just have the planner evaluate > whether the necessary columns can be extracted purely from the index. The temptation is high to estimate the cost of an "index_scan(only) + ordered(by ctid) table pages fetch if heap required". (this is what I understood from heikki suggestion 3-4. and it makes sense). It may be easier to implement both at once but I didn't find the branch in the Heikki's git repos. (probably removed since the long time) > If not, we proceed as now. If so, we can use the "index only" > approach of using the visibility map to decide which heap fetches can > be skipped. It's not clear to me whether we need to compare the cost > of the standard approach with the cost of the "index only" approach: > in theory, if there aren't any visibility map bits anyway, the "index > only" approach could be slower. But I'm not sure whether that problem > is significant or common enough to be worth expending a lot of code > on. Either way, the number of actual paths doesn't need to increase, > because in this design, even if we apply a costing model, one approach > will dominate the other. Heikki also suggested considering index > scans in cases where we don't now[4, again] but I'm inclined to leave > this, too, for a later optimization, again because balancing the > increase in paths against the possible performance benefits of using > indexes in more situations seems finicky. In short, for a first cut > at this, I just want to look at this as a way to get cheaper index > scans, and leave everything else to future work. Based on ANALYZE stats for the visibility, I believe cost_index and cost_index_only should be very similar functions (well, atm, I don't see the point to split it in 2 functions). > > Any thoughts welcome. Incidentally, if anyone else feels like working > on this, feel free to let me know and I'm happy to step away, from all > of it or from whatever part someone else wants to tackle. I'm mostly > working on this because it's something that I think we really need to > get done, more than having a burning desire to be the one who does it. Indexonly scans are welcome! I believe I can help on 3 and 4, but (really) not sure for 1 and 2. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > [1] http://archives.postgresql.org/pgsql-hackers/2011-05/msg00292.php > [2] http://archives.postgresql.org/pgsql-hackers/2011-03/msg00946.php > [3] http://archives.postgresql.org/pgsql-hackers/2009-09/msg01379.php > [4] http://archives.postgresql.org/pgsql-hackers/2009-07/msg00675.php > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
pgsql-hackers by date: