Re: bitmaps and correlation - Mailing list pgsql-hackers
From | Justin Pryzby |
---|---|
Subject | Re: bitmaps and correlation |
Date | |
Msg-id | 20201106175733.GB9839@telsasoft.com Whole thread Raw |
In response to | Re: bitmaps and correlation (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>) |
Responses |
Re: bitmaps and correlation
|
List | pgsql-hackers |
On Fri, Nov 06, 2020 at 01:51:26PM +0000, Anastasia Lubennikova wrote: > I wonder why this patch hangs so long without a review. Maybe it will help to move discussion forward, if you provide moreexamples of queries that can benefit from this imporovement? Thanks for looking. The explanation is that the planner currently gives index scans a cost "discount" for correlation. Jeff Janes has pointed out that there are two discounts: 1) fewer pages are read; and, 2) lower cost-per-page. This patch aims to give bitmap scans the same benefits. A "dense" bitmap will read fewer pages, more sequentially. Tom pointed out that the "correlation" isn't a perfect metric for this, since the index might be "clumpy" without being well-ordered, which doesn't matter for bitmap scans, which access in physical order anyway. In those cases, the correlation logic would fail to reduce the estimated cost of bitmap scan, even though the actual cost is less (same as now). This is true, but my goal is to give bitmap scans at least the same benefit as index scans, even if there's additional "discounts" which aren't yet being considered. This was an issue for me in the past when the planner chose a to scan a index, but it was slower than projected (for reasons unrelated to this patch). Making bitmap cost account for high correlation was one step towards addressing that. Since then, we switched to brin indexes, which force bitmap scans. https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com Here's an example. CREATE TABLE t AS SELECT a,b FROM generate_series(1,999) a, generate_series(1,999) b ORDER BY a+b/9.9; CREATE INDEX ON t(a); postgres=# SELECT attname, correlation FROM pg_stats WHERE tablename ='t'; a | 0.9951819 b | 0.10415093 postgres=# explain analyze SELECT * FROM t WHERE a BETWEEN 55 AND 77; Index Scan using t_a_idx on t (cost=0.42..810.89 rows=22683 width=8) (actual time=0.292..66.657 rows=22977 loops=1) vs (without my patch, with SET enable_indexscan=off); Bitmap Heap Scan on t (cost=316.93..5073.17 rows=22683 width=8) (actual time=10.810..26.633 rows=22977 loops=1) vs (with my patch, with SET enable_indexscan=off): postgres=# explain analyze SELECT * FROM t WHERE a BETWEEN 55 AND 77; Bitmap Heap Scan on t (cost=316.93..823.84 rows=22683 width=8) (actual time=10.742..33.279 rows=22977 loops=1) So bitmap scan is cheaper, but the cost estimate is a lot higher. My patch improves but doesn't completely "fix" that - bitmap scan is still costed as more expensive, but happens to be. This is probably not even a particularly good example, as it's a small table cached in RAM. There's always going to be cases like this, certainly near the costs where the plan changes "shape". I think a cost difference of 10 here is very reasonable (cpu_oper_cost, probably), but a cost difference of 5x is not. There's not many regression tests changed. Probably partially because bitmap scans have an overhead (the heap scan cannot start until after the index scan finishes), and we avoid large tests. If there's no interest in the patch, I guess we should just close it rather than letting it rot. > The first patch is simply a refactoring and don't see any possible objections against it. > The second patch also looks fine to me. The logic is understandable and the code is neat. > > It wouldn't hurt to add a comment for this computation, though. > + pages_fetched = pages_fetchedMAX + indexCorrelation*indexCorrelation*(pages_fetchedMIN - pages_fetchedMAX); You're right. It's like this: // interpolate between c==0: pages_fetched=max and c==1: pages_fetched=min pages_fetched = min + (max-min)*(1-c**2) pages_fetched = min + max*(1-c**2) - min*(1-c**2) pages_fetched = max*(1-c**2) + min - min*(1-c**2) pages_fetched = max*(1-c**2) + min*(c**2) pages_fetched = max - max*c**2 + min*(c**2) pages_fetched = max + min*(c**2) - max*c**2 pages_fetched = max + c**2 * (min-max) I'm not sure if there's a reason why it's written like that, but (min-max) looks odd, so I wrote it like: pages_fetched = max - c**2 * (max-min) > The new status of this patch is: Waiting on Author
Attachment
pgsql-hackers by date: