Thread: kill_prior_tuple and index scan costing
If I run the regression tests so that the "tenk1" table is available, and then create an index on tenk1.twothousand, I notice that simple "where twothousand = ?" queries have query plans that look like the following sample plan: pg@regression:5432 [17755]=# explain (analyze, buffers, costs off) select * from tenk1 where twothousand = 42; ┌─────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├─────────────────────────────────────────────────────────────────────────────────────────────┤ │ Bitmap Heap Scan on tenk1 (actual time=0.023..0.032 rows=5 loops=1) │ │ Recheck Cond: (twothousand = 42) │ │ Heap Blocks: exact=5 │ │ Buffers: shared hit=7 │ │ -> Bitmap Index Scan on tenk1_twothousand_idx1 (actual time=0.015..0.015 rows=5 loops=1) │ │ Index Cond: (twothousand = 42) │ │ Buffers: shared hit=2 │ │ Planning Time: 0.146 ms │ │ Execution Time: 0.065 ms │ └─────────────────────────────────────────────────────────────────────────────────────────────┘ (9 rows) Seems reasonable. There is a bitmap index scan, more or less due to the uncorrelated table access that would be required by an index scan on tenk1_twothousand_idx1. We return 5 rows, and must access one heap block for each of those 5 rows. I wonder why we don't get the following alternative plan instead, which is slightly faster even with the weak correlation: pg@regression:5432 [17755]=# explain (analyze, buffers, costs off) select * from tenk1 where twothousand = 42; ┌────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├────────────────────────────────────────────────────────────────────────────────────────────┤ │ Index Scan using tenk1_twothousand_idx1 on tenk1 (actual time=0.020..0.030 rows=5 loops=1) │ │ Index Cond: (twothousand = 42) │ │ Buffers: shared hit=7 │ │ Planning Time: 0.134 ms │ │ Execution Time: 0.058 ms │ └────────────────────────────────────────────────────────────────────────────────────────────┘ (5 rows) Both plans are very similar, really. The number of heap accesses and B-Tree index page accesses is exactly the same in each case. But the index scan plan has one non-obvious advantage, that might matter a lot in the real world: it can apply the kill_prior_tuple optimization. (It is never possible to use the kill_prior_tuple optimization during a bitmap index scan.) It makes sense that the planner determines that a bitmap index scan is faster -- or it used to make sense. Before commit dd299df8, which made heap TID a tiebreaker nbtree index column, we might find ourselves accessing the same heap page multiple times, pinning it a second or a third time within the executor (it depended on very unstable implementation details in the nbtree code). These days we should *reliably* access the same number of heap pages (and index pages) with either plan. (There are a couple of caveats that I'm glossing over for now, like pg_upgrade'd indexes.) Is it worth considering the influence of the tiebreaker heap TID column work in the planner, so that we get to use the kill_prior_tuple optimization more often? I'm not planning to work on it myself, but it seems worth considering. FWIW, the planner requires lots of convincing before it will use the index scan plan right now. I find that I need to set random_page_cost to 1.6 before the planner chooses the latter plan (a CLUSTER that uses the twothousand index works too, of course). If I come up with a similar example that returns 10 rows (i.e. that indexes the "thousand" row instead), random_page_cost needs to be reduced to 1.1 to produce an equivalent query plan crossover. -- Peter Geoghegan
Hi, reply largely based on a quick IM conversation between Peter and me. On 2020-03-04 17:13:33 -0800, Peter Geoghegan wrote: > Both plans are very similar, really. The number of heap accesses and > B-Tree index page accesses is exactly the same in each case. Note that bitmap heap scans, currently, have the huge advantage of being able to efficiently prefetch heap data. That can be a *huge* performance boon (I've seen several orders of magnitude on SSDs). There's also some benefits of bitmap heap scans in other ways. For heap the "single tid" path index->heap lookup locks the page once for each tid, whereas bitmap heap scans only do that once - adding more lock cycles obvious can have a noticable performance impact. Not having interspersed io between index and heap can be beneficial too. I thought we had optimized the non-lossy bitmap path for heap (i.e. heapam_scan_bitmap_next_block()) to perform visibility checks more efficiently than single tid fetches (i.e. heapam_index_fetch_tuple()). But both use heap_hot_search_buffer(), even though the number of page locks differs. I'm a bit surprised that neither heap_hot_search_buffer() nor the "lossy path" in heapam_scan_bitmap_next_blocks()'s take advantage of the page's all-visible? I don't really see a good reason for that. The HTSV calls do show up noticably in profiles, in my experience. While your recent btree work ensures that we get the heap tids for an equality lookup in heap order (right?), I don't think we currently have the planner infrastructure to know that that's the case (since other index types don't guarantee that) / take it into account for planning? > But the index scan plan has one non-obvious advantage, that might > matter a lot in the real world: it can apply the kill_prior_tuple > optimization. (It is never possible to use the kill_prior_tuple > optimization during a bitmap index scan.) Indeed. I've seen this cause very significant issues a couple times. Basically whenever the handful of very common queries that touched most of the data switched to bitmap heap scans, the indexes would explode in size. Due to the index sizes involved there was no way normal vacuum could clean up dead tuples quickly enough to prevent bloat, but with kill_prior_tuple that wasn't a problem. I have wondered whether we could "just" add some support for kill_prior_tuple to the bitmap index scan infrastructure. Obviously that'd require some way of calling "back" into the index code once (several?) tuples on a page are found to be dead during a bitmap heap scan. Which would require keeping track of additional metadata for each tuple in the tid bitmap, which is obviously not free and would have to be conditional. I don't really like the kill_prior_tuple interface much. But I don't immediately see how to do better, without increasing the overhead. > It makes sense that the planner determines that a bitmap index scan is > faster -- or it used to make sense. Before commit dd299df8, which made > heap TID a tiebreaker nbtree index column, we might find ourselves > accessing the same heap page multiple times, pinning it a second or a > third time within the executor (it depended on very unstable > implementation details in the nbtree code). These days we should > *reliably* access the same number of heap pages (and index pages) with > either plan. (There are a couple of caveats that I'm glossing over for > now, like pg_upgrade'd indexes.) Leaving the locking difference pointed out above aside, there still is a significant difference in how many times we indirectly call into the index AM, and how much setup work has to be done though? There's at least one index_getnext_tid() call for each result tuple, which each time indirectly has to call btgettuple(). And each btgettuple() has to do checks (do array keys need to be advanced, has _bt_first() been called). Whereas btgetbitmap() can amortize across all tuples. And that's without considering the fact that, to me, btgetbitmap() could be significantly optimized by adding multiple tuples to the bitmap at once, rather than doing so one-by-one. Greetings, Andres Freund
On Sat, Mar 21, 2020 at 07:33:02PM -0700, Andres Freund wrote: > While your recent btree work ensures that we get the heap tids for an > equality lookup in heap order (right?), I think when I tested the TID tiebreaker patch, it didn't help for our case, which is for inequality: (timestamptz >= start AND timestamptz < end). That seems to explain why, although I don't understand why it wouldn't also apply to inequality comparison ? |template1=# CREATE TABLE t(i int,j int); CREATE INDEX ON t(i); INSERT INTO t SELECT (0.0001*a+9*(random()-0.5))::int FROMgenerate_series(1,99999999) a; VACUUM ANALYZE t; |template1=# explain (analyze,buffers) SELECT * FROM t WHERE i BETWEEN 2000 AND 3000; | Index Scan using t_i_idx on t (cost=0.44..277164.86 rows=10026349 width=8) (actual time=0.199..6839.564 rows=10010076loops=1) | Index Cond: ((i >= 2000) AND (i <= 3000)) | Buffers: shared hit=394701 read=52699 vs. |template1=# SET enable_seqscan=off; SET enable_indexscan=off; explain (analyze,buffers) SELECT * FROM t WHERE i BETWEEN2000 AND 3000; | Bitmap Heap Scan on t (cost=135038.52..1977571.10 rows=10026349 width=8) (actual time=743.649..3760.643 rows=10010076loops=1) | Recheck Cond: ((i >= 2000) AND (i <= 3000)) | Heap Blocks: exact=44685 | Buffers: shared read=52700 | -> Bitmap Index Scan on t_i_idx (cost=0.00..132531.93 rows=10026349 width=0) (actual time=726.474..726.475 rows=10010076loops=1) | Index Cond: ((i >= 2000) AND (i <= 3000)) | Buffers: shared read=8015 I'm not concerned with the "actual" time or hit vs cached, but the total buffer pages. Indexscan accessed 450k buffers vs 52k for bitmapscan. > I don't think we currently have > the planner infrastructure to know that that's the case (since other > index types don't guarantee that) / take it into account for planning? Right, since correlation is a property of the table column and not of the index. See also: https://www.postgresql.org/message-id/14438.1512499811@sss.pgh.pa.us Years ago I had a patch to make correlation a property of indexes. -- Justin
Hi, On 2020-03-21 23:53:05 -0500, Justin Pryzby wrote: > On Sat, Mar 21, 2020 at 07:33:02PM -0700, Andres Freund wrote: > > While your recent btree work ensures that we get the heap tids for an > > equality lookup in heap order (right?), > > I think when I tested the TID tiebreaker patch, it didn't help for our case, > which is for inequality: (timestamptz >= start AND timestamptz < end). > > That seems to explain why, although I don't understand why it wouldn't also > apply to inequality comparison ? Because tids are only ordered for a single lookup key. So if you scan across multiple values you could have key:page visited in the order of 1:1 1:2 1:99 2:1 2:7 99:1 or such, i.e. the heap pages would not be monotonically increasing. You can't however have 1:17 1:1, because for a specific key value, the tid is used as an additional comparison value. Greetings, Andres Freund