Re: PoC: Partial sort - Mailing list pgsql-hackers
From | Alexander Korotkov |
---|---|
Subject | Re: PoC: Partial sort |
Date | |
Msg-id | CAPpHfds7j4_=aAQixSDbWyim1588kboB8LOB8wj+Ei0RbcvozQ@mail.gmail.com Whole thread Raw |
In response to | Re: PoC: Partial sort (Marti Raudsepp <marti@juffo.org>) |
Responses |
Re: PoC: Partial sort
|
List | pgsql-hackers |
On Thu, Feb 6, 2014 at 12:39 PM, Marti Raudsepp <marti@juffo.org> wrote:
------
With best regards,
Alexander Korotkov.
On Thu, Feb 6, 2014 at 5:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:I guess it's because the patch undoes some optimizations in the
> Hmm, sounds a little steep. Why is it so expensive? I'm probably
> missing something here, because I would have thought that planner
> support for partial sorts would consist mostly of considering the same
> sorts we consider today, but with the costs reduced by the batching.
mergejoin planner wrt caching merge clauses and adds a whole lot of
code to find_mergeclauses_for_pathkeys. In other code paths the
overhead does seem to be negligible.
Notice the removal of:
/* Select the right mergeclauses, if we didn't already */
/*
* Avoid rebuilding clause list if we already made one;
* saves memory in big join trees...
*/
This is not only place that worry me about planning overhead. See get_cheapest_fractional_path_for_pathkeys. I had to estimate number of groups for each sorting column in order to get right fractional path. For partial sort path, cost of first batch should be included into initial cost.
If don't do so, optimizer can pick up strange plans basing on assumption that it need only few rows from inner node. See an example.
create table test1 as (
select id,
(random()*100)::int as v1,
(random()*10000)::int as v2
from generate_series(1,1000000) id);
create table test2 as (
select id,
(random()*100)::int as v1,
(random()*10000)::int as v2
from generate_series(1,1000000) id);
create index test1_v1_idx on test1 (v1);
Plan without fraction estimation in get_cheapest_fractional_path_for_pathkeys:
postgres=# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1 order by t1.v1, t1.id limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Limit (cost=198956893.20..198956913.33 rows=10 width=24)
-> Partial sort (cost=198956893.20..19909637942.82 rows=9791031169 width=24)
Sort Key: t1.v1, t1.id
Presorted Key: t1.v1
-> Nested Loop (cost=0.42..19883065506.84 rows=9791031169 width=24)
Join Filter: (t1.v1 = t2.v1)
-> Index Scan using test1_v1_idx on test1 t1 (cost=0.42..47600.84 rows=1000000 width=12)
-> Materialize (cost=0.00..25289.00 rows=1000000 width=12)
-> Seq Scan on test2 t2 (cost=0.00..15406.00 rows=1000000 width=12)
(9 rows)
Current version of patch:
postgres=# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1 order by t1.v1, t1.id limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Limit (cost=3699913.43..3699913.60 rows=10 width=24)
-> Partial sort (cost=3699913.43..173638549.67 rows=9791031169 width=24)
Sort Key: t1.v1, t1.id
Presorted Key: t1.v1
-> Merge Join (cost=150444.79..147066113.70 rows=9791031169 width=24)
Merge Cond: (t1.v1 = t2.v1)
-> Index Scan using test1_v1_idx on test1 t1 (cost=0.42..47600.84 rows=1000000 width=12)
-> Materialize (cost=149244.84..154244.84 rows=1000000 width=12)
-> Sort (cost=149244.84..151744.84 rows=1000000 width=12)
Sort Key: t2.v1
-> Seq Scan on test2 t2 (cost=0.00..15406.00 rows=1000000 width=12)
(11 rows)
I don't compare actual execution times because I didn't wait until first plan execution ends up :-)
But anyway costs are extraordinary and inner sequential scan of 1000000 rows is odd.
------
With best regards,
Alexander Korotkov.
pgsql-hackers by date: