Re: Incremental Sort Cost Estimation Instability - Mailing list pgsql-hackers
From | Alexander Korotkov |
---|---|
Subject | Re: Incremental Sort Cost Estimation Instability |
Date | |
Msg-id | CAPpHfdsbVczNq-f50zS1xr0kicyA-dPMcaBs0szTVxo67o8GBw@mail.gmail.com Whole thread Raw |
In response to | Re: Incremental Sort Cost Estimation Instability (Alexander Korotkov <aekorotkov@gmail.com>) |
List | pgsql-hackers |
On Tue, Jun 3, 2025 at 5:02 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Wed, May 14, 2025 at 1:50 PM Andrei Lepikhov <lepihov@gmail.com> wrote: > > On 9/12/24 16:57, Tomas Vondra wrote: > > > On 9/12/24 12:12, David Rowley wrote: > > >> On Thu, 12 Sept 2024 at 21:51, Andrei Lepikhov <lepihov@gmail.com> wrote: > > >>> Initial problem causes wrong cost_sort estimation. Right now I think > > >>> about providing cost_sort() the sort clauses instead of (or in addition > > >>> to) the pathkeys. > > >> > > >> I'm not quite sure why the sort clauses matter any more than the > > >> EquivalenceClass. If the EquivalanceClass defines that all members > > >> will have the same value for any given row, then, if we had to choose > > >> any single member to drive the n_distinct estimate from, isn't the > > >> most accurate distinct estimate from the member with the smallest > > >> n_distinct estimate? (That assumes the less distinct member has every > > >> value the more distinct member has, which might not be true) > > > I'm not sure how to fix this, but it seems estimate_num_groups() needs > > > to do things differently. And I agree looking for the minimum ndistinct > > > seems like the right approach. but doesn't estimate_num_groups() > > > supposed to already do that? The comment says: > > Hi, > > > > I have rewritten the code according to the idea of the > > estimate_num_groups responsibility to adjust estimation according to the EC. > > I haven't changed all the places of ngroups estimation - only where the > > planner can access pathkeys to avoid the overhead of passing through all > > the ECs. And added some cache in the EM. > > The most important case for me here is GROUP-BY estimation and > > create_unique_path because I frequently see them as sides of a JOIN. > > It would be also nice to adjust the cost of memoize rescan, but it may > > be a subject of the future improvement. > > Thank you for your work on this subject. I've the following notes > about the last patch. > 1. Now groupExprs argument of estimate_num_groups() might contain > either Expr's or Pathkey's. I think this should be documented. > 2. A new argument relids of estimate_num_groups() also should be documented. > 3. I doubt estimate_num_groups() is appropriate place to handle "varno > 0". Should we do this in cost_incremental_sort() as before? I see > that it currently analyzes only first EquivalenceMember. But could it > be a different answers between first EquivalenceMember and others? 4. A new EquivalenceMember.em_default_nd field is not documented. Is it necessary to keep a dedicated field for this? Could we use just special value of em_ndistinct (which would need to be documented as well)? 5. I think the algorithm of distinct elements estimation needs more clarification. Assuming all equivalence members within the relids should be equal, the cardinality might be lower than cardinality of any member. Do we assume the lowest cardinality among the individual matching members as the conservative worst-case scenario? ------ Regards, Alexander Korotkov Supabase
pgsql-hackers by date: