Re: Properly pathify the union planner - Mailing list pgsql-hackers
From | Richard Guo |
---|---|
Subject | Re: Properly pathify the union planner |
Date | |
Msg-id | CAMbWs49J2jv61DsQ_D=JgnuPMjbd1W3WoMw=-DB=zRpGYRh+yA@mail.gmail.com Whole thread Raw |
In response to | Re: Properly pathify the union planner (David Rowley <dgrowleyml@gmail.com>) |
Responses |
Re: Properly pathify the union planner
Re: Properly pathify the union planner |
List | pgsql-hackers |
On Fri, Nov 24, 2023 at 6:29 AM David Rowley <dgrowleyml@gmail.com> wrote:
I've attached the updated patch. This one is probably ready for
someone to test out. There will be more work to do, however.
I just started reviewing this patch and haven't looked through all the
details yet. Here are some feedbacks that came to my mind. Post them
first so that I don’t forget them after the holidays.
* I think we should update truncate_useless_pathkeys() to account for
the ordering requested by the query's set operation; otherwise we may
not get a subquery's path with the expected pathkeys. For instance,
create table t (a int, b int);
create index on t (a, b);
set enable_hashagg to off;
-- on v1 patch
explain (costs off)
(select * from t order by a) UNION (select * from t order by a);
QUERY PLAN
------------------------------------------------------------
Unique
-> Merge Append
Sort Key: t.a, t.b
-> Incremental Sort
Sort Key: t.a, t.b
Presorted Key: t.a
-> Index Only Scan using t_a_b_idx on t
-> Incremental Sort
Sort Key: t_1.a, t_1.b
Presorted Key: t_1.a
-> Index Only Scan using t_a_b_idx on t t_1
(11 rows)
-- after accounting for setop_pathkeys in truncate_useless_pathkeys()
explain (costs off)
(select * from t order by a) UNION (select * from t order by a);
QUERY PLAN
------------------------------------------------------
Unique
-> Merge Append
Sort Key: t.a, t.b
-> Index Only Scan using t_a_b_idx on t
-> Index Only Scan using t_a_b_idx on t t_1
(5 rows)
details yet. Here are some feedbacks that came to my mind. Post them
first so that I don’t forget them after the holidays.
* I think we should update truncate_useless_pathkeys() to account for
the ordering requested by the query's set operation; otherwise we may
not get a subquery's path with the expected pathkeys. For instance,
create table t (a int, b int);
create index on t (a, b);
set enable_hashagg to off;
-- on v1 patch
explain (costs off)
(select * from t order by a) UNION (select * from t order by a);
QUERY PLAN
------------------------------------------------------------
Unique
-> Merge Append
Sort Key: t.a, t.b
-> Incremental Sort
Sort Key: t.a, t.b
Presorted Key: t.a
-> Index Only Scan using t_a_b_idx on t
-> Incremental Sort
Sort Key: t_1.a, t_1.b
Presorted Key: t_1.a
-> Index Only Scan using t_a_b_idx on t t_1
(11 rows)
-- after accounting for setop_pathkeys in truncate_useless_pathkeys()
explain (costs off)
(select * from t order by a) UNION (select * from t order by a);
QUERY PLAN
------------------------------------------------------
Unique
-> Merge Append
Sort Key: t.a, t.b
-> Index Only Scan using t_a_b_idx on t
-> Index Only Scan using t_a_b_idx on t t_1
(5 rows)
* I understand that we need to sort (full or incremental) the paths of
the subqueries to meet the ordering required for setop_pathkeys, so that
MergeAppend has something to work with. Currently in the v1 patch this
sorting is performed during the planning phase of the subqueries (in
grouping_planner).
And we want to add the subquery's cheapest_total_path as-is to allow
that path to be used in hash-based UNIONs, and we also want to add a
sorted path on top of cheapest_total_path. And then we may encounter
the issue you mentioned earlier regarding add_path() potentially freeing
the cheapest_total_path, leaving the Sort's subpath gone.
I'm thinking that maybe it'd be better to move the work of sorting the
subquery's paths to the outer query level, specifically within the
build_setop_child_paths() function, just before we stick SubqueryScanPath
on top of the subquery's paths. I think this is better because:
1. This minimizes the impact on subquery planning and reduces the
footprint within the grouping_planner() function as much as possible.
2. This can help avoid the aforementioned add_path() issue because the
two involved paths will be structured as:
cheapest_path -> subqueryscan
and
cheapest_path -> sort -> subqueryscan
If the two paths cost fuzzily the same and add_path() decides to keep
the second one due to it having better pathkeys and pfree the first one,
it would not be a problem.
BTW, I haven't looked through the part involving partial paths, but I
think we can do the same to partial paths.
* I noticed that in generate_union_paths() we use a new function
build_setop_pathkeys() to build the 'union_pathkeys'. I wonder why we
don't simply utilize make_pathkeys_for_sortclauses() since we already
have the grouplist for the setop's output columns.
the subqueries to meet the ordering required for setop_pathkeys, so that
MergeAppend has something to work with. Currently in the v1 patch this
sorting is performed during the planning phase of the subqueries (in
grouping_planner).
And we want to add the subquery's cheapest_total_path as-is to allow
that path to be used in hash-based UNIONs, and we also want to add a
sorted path on top of cheapest_total_path. And then we may encounter
the issue you mentioned earlier regarding add_path() potentially freeing
the cheapest_total_path, leaving the Sort's subpath gone.
I'm thinking that maybe it'd be better to move the work of sorting the
subquery's paths to the outer query level, specifically within the
build_setop_child_paths() function, just before we stick SubqueryScanPath
on top of the subquery's paths. I think this is better because:
1. This minimizes the impact on subquery planning and reduces the
footprint within the grouping_planner() function as much as possible.
2. This can help avoid the aforementioned add_path() issue because the
two involved paths will be structured as:
cheapest_path -> subqueryscan
and
cheapest_path -> sort -> subqueryscan
If the two paths cost fuzzily the same and add_path() decides to keep
the second one due to it having better pathkeys and pfree the first one,
it would not be a problem.
BTW, I haven't looked through the part involving partial paths, but I
think we can do the same to partial paths.
* I noticed that in generate_union_paths() we use a new function
build_setop_pathkeys() to build the 'union_pathkeys'. I wonder why we
don't simply utilize make_pathkeys_for_sortclauses() since we already
have the grouplist for the setop's output columns.
To assist the discussion I've attached a diff file that includes all the
changes above.
Thanks
Richard
changes above.
Thanks
Richard
Attachment
pgsql-hackers by date: