Re: POC, WIP: OR-clause support for indexes - Mailing list pgsql-hackers
From | a.rybakina |
---|---|
Subject | Re: POC, WIP: OR-clause support for indexes |
Date | |
Msg-id | 3c539d4b-1c3c-4119-a3c9-e335b81c83cf@postgrespro.ru Whole thread Raw |
In response to | Re: POC, WIP: OR-clause support for indexes (Alexander Korotkov <aekorotkov@gmail.com>) |
List | pgsql-hackers |
Hi! On 15.10.2023 01:34, Alexander Korotkov wrote: > Hi, Alena! > > Thank you for your work on the subject. > > On Wed, Oct 4, 2023 at 10:21 PM a.rybakina <a.rybakina@postgrespro.ru> wrote: >> I fixed the kernel dump issue and all the regression tests were successful, but I discovered another problem when I addedmy own regression tests. >> Some queries that contain "or" expressions do not convert to "ANY". I have described this in more detail using diff asexpected and real results: >> >> diff -U3 /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out >> --- /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out 2023-10-04 21:54:12.496282667 +0300 >> +++ /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out 2023-10-04 21:55:41.665422459 +0300 >> @@ -1925,17 +1925,20 @@ >> EXPLAIN (COSTS OFF) >> SELECT count(*) FROM tenk1 >> WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3) OR thousand = 41; >> - QUERY PLAN >> --------------------------------------------------------------------------------------------------------- >> + QUERY PLAN >> +--------------------------------------------------------------------------------------------------------------------------- >> Aggregate >> -> Bitmap Heap Scan on tenk1 >> - Recheck Cond: (((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[]))) OR (thousand = 41)) >> + Recheck Cond: ((((thousand = 42) AND (tenthous = 1)) OR ((thousand = 42) AND (tenthous = 3))) OR (thousand =41)) >> -> BitmapOr >> - -> Bitmap Index Scan on tenk1_thous_tenthous >> - Index Cond: ((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[]))) >> + -> BitmapOr >> + -> Bitmap Index Scan on tenk1_thous_tenthous >> + Index Cond: ((thousand = 42) AND (tenthous = 1)) >> + -> Bitmap Index Scan on tenk1_thous_tenthous >> + Index Cond: ((thousand = 42) AND (tenthous = 3)) >> -> Bitmap Index Scan on tenk1_thous_tenthous >> Index Cond: (thousand = 41) >> -(8 rows) >> +(11 rows) > I think this query is not converted, because you only convert > top-level ORs in the transform_ors() function. But in the example > given, the target OR lays under AND, which in turn lays under another > OR. I think you need to make transform_ors() recursive to handle > cases like this. > > I wonder about the default value of the parameter or_transform_limit > of 500. In [1] and [2] you show the execution time degradation from 0 > to ~500 OR clauses. I made a simple SQL script with the query "SELECT > * FROM pgbench_accounts a WHERE aid = 1 OR aid = 2 OR ... OR aid = > 100;". The pgbench results for a single connection in prepared mode > are the following. > master: 936 tps > patched (or_transform_limit == 0) :1414 tps > So, transformation to ANY obviously accelerates the execution. > > I think it's important to identify the cases where this patch causes > the degradation. Generally, I don't see why ANY could be executed > slower than the equivalent OR clause. So, the possible degradation > cases are slower plan generation and worse plans. I managed to find > both. > > As you stated before, currently the OR transformation has a quadratic > complexity depending on the number of or-clause-groups. I made a > simple test to evaluate this. containing 10000 or-clause-groups. > SELECT * FROM pgbench_accounts a WHERE aid + 1 * bid = 1 OR aid + 2 * > bid = 1 OR ... OR aid + 10000 * bid = 1; > master: 316ms > patched: 7142ms > Note, that the current or_transform_limit GUC parameter is not capable > of cutting such cases, because it cuts cases lower than the limit not > higher than the limit. In the comment, you mention that we could > invent something like hash to handle this. Hash should be nice, but > the problem is that we currently don't have a generic facility to hash > nodes (or even order them). It would be nice to add this facility, > that would be quite a piece of work. I would propose to limit this > patch for now to handle just a single Var node as a non-const side of > the clause and implement a simple hash for Vars. > > Another problem is the possible generation of worse plans. I made an > example table with two partial indexes. > create table test as (select (random()*10)::int x, (random()*1000) y > from generate_series(1,1000000) i); > create index test_x_1_y on test (y) where x = 1; > create index test_x_2_y on test (y) where x = 2; > vacuum analyze test; > > Without the transformation of ORs to ANY, our planner manages to use > both indexes with a Bitmap scan. > # explain select * from test where (x = 1 or x = 2) and y = 100; > QUERY PLAN > -------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on test (cost=8.60..12.62 rows=1 width=12) > Recheck Cond: (((y = '100'::double precision) AND (x = 1)) OR ((y = > '100'::double precision) AND (x = 2))) > -> BitmapOr (cost=8.60..8.60 rows=1 width=0) > -> Bitmap Index Scan on test_x_1_y (cost=0.00..4.30 rows=1 width=0) > Index Cond: (y = '100'::double precision) > -> Bitmap Index Scan on test_x_2_y (cost=0.00..4.30 rows=1 width=0) > Index Cond: (y = '100'::double precision) > (7 rows) > > With transformation, the planner can't use indexes. > # explain select * from test where (x = 1 or x = 2) and y = 100; > QUERY PLAN > ----------------------------------------------------------------------------- > Gather (cost=1000.00..12690.10 rows=1 width=12) > Workers Planned: 2 > -> Parallel Seq Scan on test (cost=0.00..11690.00 rows=1 width=12) > Filter: ((x = ANY (ARRAY[1, 2])) AND (y = '100'::double precision)) > (4 rows) > > The solution I see would be to tech Bitmap scan to handle ANY clause > in the same way as the OR clause. I think the entry point for the > relevant logic is the choose_bitmap_and() function. > > Regarding the GUC parameter, I don't see we need a limit. It's not > yet clear whether a small number or a large number of OR clauses are > more favorable for transformation. I propose to have just a boolean > enable_or_transformation GUC. > I removed the limit from the hook, left the option to enable it or not. I replaced the data structure so that the groups were formed not in a list, but in a hash table. It seems to work fine, but I haven't figured out yet why in some cases the regression test results are different and the function doesn't work. So far, I have formed a patch for the version where the conversion takes place in parsing, since so far this patch looks the most reliable for me For convenience, I have formed a patch for the very first version so far. I have a suspicion that the problem is in the part where we form a hash from a string. I'm still figuring it out.
Attachment
pgsql-hackers by date: