Re: Incremental Sort Cost Estimation Instability - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Incremental Sort Cost Estimation Instability
Date
Msg-id CAPpHfduossatgSzBzTm-ENnzhiGEj-xE=pk8MK79XR2cm9=vDg@mail.gmail.com
Whole thread Raw
In response to Incremental Sort Cost Estimation Instability  (Andrei Lepikhov <lepihov@gmail.com>)
Responses Re: Incremental Sort Cost Estimation Instability
List pgsql-hackers
Hi, Andrei!

On Wed, May 14, 2025 at 1:50 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
> On 9/12/24 16:57, Tomas Vondra wrote:
> > On 9/12/24 12:12, David Rowley wrote:
> >> On Thu, 12 Sept 2024 at 21:51, Andrei Lepikhov <lepihov@gmail.com> wrote:
> >>> Initial problem causes wrong cost_sort estimation. Right now I think
> >>> about providing cost_sort() the sort clauses instead of (or in addition
> >>> to) the pathkeys.
> >>
> >> I'm not quite sure why the sort clauses matter any more than the
> >> EquivalenceClass.  If the EquivalanceClass defines that all members
> >> will have the same value for any given row, then, if we had to choose
> >> any single member to drive the n_distinct estimate from, isn't the
> >> most accurate distinct estimate from the member with the smallest
> >> n_distinct estimate?  (That assumes the less distinct member has every
> >> value the more distinct member has, which might not be true)
> > I'm not sure how to fix this, but it seems estimate_num_groups() needs
> > to do things differently. And I agree looking for the minimum ndistinct
> > seems like the right approach. but doesn't estimate_num_groups()
> > supposed to already do that? The comment says:
> Hi,
>
> I have rewritten the code according to the idea of the
> estimate_num_groups responsibility to adjust estimation according to the EC.
> I haven't changed all the places of ngroups estimation - only where the
> planner can access pathkeys to avoid the overhead of passing through all
> the ECs. And added some cache in the EM.
> The most important case for me here is GROUP-BY estimation and
> create_unique_path because I frequently see them as sides of a JOIN.
> It would be also nice to adjust the cost of memoize rescan, but it may
> be a subject of the future improvement.

Thank you for your work on this subject.  I've the following notes
about the last patch.
1. Now groupExprs argument of estimate_num_groups() might contain
either Expr's or Pathkey's.  I think this should be documented.
2. A new argument relids of estimate_num_groups() also should be documented.
3. I doubt estimate_num_groups() is appropriate place to handle "varno
0".  Should we do this in cost_incremental_sort() as before?  I see
that it currently analyzes only first EquivalenceMember.  But could it
be a different answers between first EquivalenceMember and others?

------
Regards,
Alexander Korotkov
Supabase



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: C11 / VS 2019
Next
From: Alexander Korotkov
Date:
Subject: Re: MergeAppend could consider sorting cheapest child path