Home > mailing lists

Re: POC: GROUP BY optimization - Mailing list pgsql-hackers

From	Richard Guo
Subject	Re: POC: GROUP BY optimization
Date	February 21, 2024 08:08:12
Msg-id	CAMbWs49oMEFUEy6RRaenx5ak4AuZzuDEDzi0otEBYwJzBBYz7Q@mail.gmail.com Whole thread Raw
In response to	Re: POC: GROUP BY optimization (Andrei Lepikhov <a.lepikhov@postgrespro.ru>)
Responses	Re: POC: GROUP BY optimization
List	pgsql-hackers

Tree view

On Fri, Feb 2, 2024 at 12:40 PM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote:

On 2/2/2024 11:06, Richard Guo wrote:
>
> On Fri, Feb 2, 2024 at 11:32 AM Richard Guo <guofenglinux@gmail.com
> <mailto:guofenglinux@gmail.com>> wrote:
>
> On Fri, Feb 2, 2024 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us
> <mailto:tgl@sss.pgh.pa.us>> wrote:
>
> One of the test cases added by this commit has not been very
> stable in the buildfarm. Latest example is here:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2024-02-01%2021%3A28%3A04 <https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2024-02-01%2021%3A28%3A04>
>
> and I've seen similar failures intermittently on other machines.
>
> I'd suggest building this test atop a table that is more stable
> than pg_class. You're just waving a red flag in front of a bull
> if you expect stable statistics from that during a regression run.
> Nor do I see any particular reason for pg_class to be especially
> suited to the test.
>
>
> Yeah, it's not a good practice to use pg_class in this place. While
> looking through the test cases added by this commit, I noticed some
> other minor issues that are not great. Such as
>
> * The table 'btg' is inserted with 10000 tuples, which seems a bit
> expensive for a test. I don't think we need such a big table to test
> what we want.
>
> * I don't see why we need to manipulate GUC max_parallel_workers and
> max_parallel_workers_per_gather.
>
> * I think we'd better write the tests with the keywords being all upper
> or all lower. A mixed use of upper and lower is not great. Such as in
>
> explain (COSTS OFF) SELECT x,y FROM btg GROUP BY x,y,z,w;
>
> * Some comments for the test queries are not easy to read.
>
> * For this statement
>
> CREATE INDEX idx_y_x_z ON btg(y,x,w);
>
> I think the index name would cause confusion. It creates an index on
> columns y, x and w, but the name indicates an index on y, x and z.
>
> I'd like to write a draft patch for the fixes.
>
>
> Here is the draft patch that fixes the issues I complained about in
> upthread.

I passed through the patch. Looks like it doesn't break anything. Why do
you prefer to use count(*) in EXPLAIN instead of plain targetlist, like
"SELECT x,y,..."?

Nothing special. Just making the test cases consistent as much as
possible.

Also, according to the test mentioned by Tom:
1. I see, PG uses IndexScan on (x,y). So, column x will be already
sorted before the MergeJoin. Why not use Incremental Sort on (x,z,w)
instead of full sort?

I think that's because the planner chooses to use (z, w, x) to perform
the mergejoin. I did not delve into the details, but I guess the cost
estimation decides this is cheaper.

Hi Alexander,

What do you think about the revisions for the test cases?

Thanks
Richard

pgsql-hackers by date:

From: Bertrand Drouvot
Date: 21 February 2024, 08:07:47
Subject: Re: Injection points: some tools to wait and wake

From: Michail Nikolaev
Date: 21 February 2024, 08:35:40
Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

Re: POC: GROUP BY optimization - Mailing list pgsql-hackers

Previous

Next