Re: Incremental Sort Cost Estimation Instability - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Incremental Sort Cost Estimation Instability
Date
Msg-id CAPpHfdsbVczNq-f50zS1xr0kicyA-dPMcaBs0szTVxo67o8GBw@mail.gmail.com
Whole thread Raw
In response to Re: Incremental Sort Cost Estimation Instability  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers
On Tue, Jun 3, 2025 at 5:02 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Wed, May 14, 2025 at 1:50 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
> > On 9/12/24 16:57, Tomas Vondra wrote:
> > > On 9/12/24 12:12, David Rowley wrote:
> > >> On Thu, 12 Sept 2024 at 21:51, Andrei Lepikhov <lepihov@gmail.com> wrote:
> > >>> Initial problem causes wrong cost_sort estimation. Right now I think
> > >>> about providing cost_sort() the sort clauses instead of (or in addition
> > >>> to) the pathkeys.
> > >>
> > >> I'm not quite sure why the sort clauses matter any more than the
> > >> EquivalenceClass.  If the EquivalanceClass defines that all members
> > >> will have the same value for any given row, then, if we had to choose
> > >> any single member to drive the n_distinct estimate from, isn't the
> > >> most accurate distinct estimate from the member with the smallest
> > >> n_distinct estimate?  (That assumes the less distinct member has every
> > >> value the more distinct member has, which might not be true)
> > > I'm not sure how to fix this, but it seems estimate_num_groups() needs
> > > to do things differently. And I agree looking for the minimum ndistinct
> > > seems like the right approach. but doesn't estimate_num_groups()
> > > supposed to already do that? The comment says:
> > Hi,
> >
> > I have rewritten the code according to the idea of the
> > estimate_num_groups responsibility to adjust estimation according to the EC.
> > I haven't changed all the places of ngroups estimation - only where the
> > planner can access pathkeys to avoid the overhead of passing through all
> > the ECs. And added some cache in the EM.
> > The most important case for me here is GROUP-BY estimation and
> > create_unique_path because I frequently see them as sides of a JOIN.
> > It would be also nice to adjust the cost of memoize rescan, but it may
> > be a subject of the future improvement.
>
> Thank you for your work on this subject.  I've the following notes
> about the last patch.
> 1. Now groupExprs argument of estimate_num_groups() might contain
> either Expr's or Pathkey's.  I think this should be documented.
> 2. A new argument relids of estimate_num_groups() also should be documented.
> 3. I doubt estimate_num_groups() is appropriate place to handle "varno
> 0".  Should we do this in cost_incremental_sort() as before?  I see
> that it currently analyzes only first EquivalenceMember.  But could it
> be a different answers between first EquivalenceMember and others?

4. A new EquivalenceMember.em_default_nd field is not documented.  Is
it necessary to keep a dedicated field for this?  Could we use just
special value of em_ndistinct (which would need to be documented as
well)?
5. I think the algorithm of distinct elements estimation needs more
clarification.  Assuming all equivalence members within the relids
should be equal, the cardinality might be lower than cardinality of
any member.  Do we assume the lowest cardinality among the individual
matching members as the conservative worst-case scenario?

------
Regards,
Alexander Korotkov
Supabase



pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: PG 18 release notes draft committed
Next
From: Noah Misch
Date:
Subject: Re: PG 18 release notes draft committed