Thread: [HACKERS] Removing useless DISTINCT clauses
In [1] we made a change to process the GROUP BY clause to remove any group by items that are functionally dependent on some other GROUP BY items. This really just checks if a table's PK columns are entirely present in the GROUP BY clause and removes anything else belonging to that table. All this seems to work well, but I totally failed to consider that the exact same thing applies to DISTINCT too. Over in [2], Rui Liu mentions that the planner could do a better job for his case. Using Rui Liu's example: CREATE TABLE test_tbl ( k INT PRIMARY KEY, col text); INSERT into test_tbl select generate_series(1,10000000), 'test'; Master: postgres=# explain analyze verbose select distinct col, k from test_tbl order by k limit 1000; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=1658556.19..1658563.69 rows=1000 width=9) (actual time=8934.962..8935.495 rows=1000 loops=1) Output: col, k -> Unique (cost=1658556.19..1733557.50 rows=10000175 width=9) (actual time=8934.961..8935.460 rows=1000 loops=1) Output: col, k -> Sort (cost=1658556.19..1683556.63 rows=10000175 width=9) (actual time=8934.959..8935.149 rows=1000 loops=1) Output: col, k Sort Key: test_tbl.k, test_tbl.col Sort Method: external merge Disk: 215128kB -> Seq Scan on public.test_tbl (cost=0.00..154056.75 rows=10000175 width=9) (actual time=0.062..1901.728 rows=10000000 loops=1) Output: col, k Planning time: 0.092 ms Execution time: 8958.687 ms (12 rows) Patched: postgres=# explain analyze verbose select distinct col, k from test_tbl order by k limit 1000; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.44..34.31 rows=1000 width=9) (actual time=0.030..0.895 rows=1000 loops=1) Output: col, k -> Unique (cost=0.44..338745.50 rows=10000175 width=9) (actual time=0.029..0.814 rows=1000 loops=1) Output: col, k -> Index Scan using test_tbl_pkey on public.test_tbl (cost=0.44..313745.06 rows=10000175 width=9) (actual time=0.026..0.452 rows=1000 loops=1) Output: col, k Planning time: 0.152 ms Execution time: 0.985 ms (8 rows) A patch to implement this is attached. I'll add it to the Jan commitfest. (I don't expect anyone to look at this before then). [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d4c3a156cb46dcd1f9f97a8011bd94c544079bb5 [2] https://www.postgresql.org/message-id/flat/CAKJS1f9q0j3BgMUsDbtf9%3DecfVLnqvkYB44MXj0gpVuamcN8Xw%40mail.gmail.com#CAKJS1f9q0j3BgMUsDbtf9=ecfVLnqvkYB44MXj0gpVuamcN8Xw@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Nov 6, 2017 at 1:16 AM, David Rowley <david.rowley@2ndquadrant.com> wrote:
Cheers,
In [1] we made a change to process the GROUP BY clause to remove any
group by items that are functionally dependent on some other GROUP BY
items.
This really just checks if a table's PK columns are entirely present
in the GROUP BY clause and removes anything else belonging to that
table.
All this seems to work well, but I totally failed to consider that the
exact same thing applies to DISTINCT too.
Over in [2], Rui Liu mentions that the planner could do a better job
for his case.
Using Rui Liu's example:
CREATE TABLE test_tbl ( k INT PRIMARY KEY, col text);
INSERT into test_tbl select generate_series(1,10000000), 'test';
Master:
postgres=# explain analyze verbose select distinct col, k from
test_tbl order by k limit 1000;
QUERY PLAN
------------------------------------------------------------ ------------------------------ ------------------------------ -------------------------
Limit (cost=1658556.19..1658563.69 rows=1000 width=9) (actual
time=8934.962..8935.495 rows=1000 loops=1)
Output: col, k
-> Unique (cost=1658556.19..1733557.50 rows=10000175 width=9)
(actual time=8934.961..8935.460 rows=1000 loops=1)
Output: col, k
-> Sort (cost=1658556.19..1683556.63 rows=10000175 width=9)
(actual time=8934.959..8935.149 rows=1000 loops=1)
Output: col, k
Sort Key: test_tbl.k, test_tbl.col
Sort Method: external merge Disk: 215128kB
-> Seq Scan on public.test_tbl (cost=0.00..154056.75
rows=10000175 width=9) (actual time=0.062..1901.728 rows=10000000
loops=1)
Output: col, k
Planning time: 0.092 ms
Execution time: 8958.687 ms
(12 rows)
Patched:
postgres=# explain analyze verbose select distinct col, k from
test_tbl order by k limit 1000;
QUERY PLAN
------------------------------------------------------------ ------------------------------ ------------------------------ ------------------------------ ----
Limit (cost=0.44..34.31 rows=1000 width=9) (actual time=0.030..0.895
rows=1000 loops=1)
Output: col, k
-> Unique (cost=0.44..338745.50 rows=10000175 width=9) (actual
time=0.029..0.814 rows=1000 loops=1)
Output: col, k
-> Index Scan using test_tbl_pkey on public.test_tbl
(cost=0.44..313745.06 rows=10000175 width=9) (actual time=0.026..0.452
rows=1000 loops=1)
Output: col, k
Planning time: 0.152 ms
Execution time: 0.985 ms
(8 rows)
A patch to implement this is attached.
Couldn't the Unique node be removed entirely? If k is a primary key, you can't have duplicates in need of removal.
Or would that be a subject for a different patch?
I think remove_functionally_dependant_groupclauses should have a more generic name, like remove_functionally_dependant_clauses.
Jeff
Hi Jeff, Thanks for looking at the patch. On 6 January 2018 at 10:34, Jeff Janes <jeff.janes@gmail.com> wrote: > Couldn't the Unique node be removed entirely? If k is a primary key, you > can't have duplicates in need of removal. It probably could be, if there were no joins, but any join could duplicate those rows, so we'd only be able to do that if there were no joins, and I'm not sure we'd want to special case that. It seems just too naive to be taken seriously. We could do a lot better... > Or would that be a subject for a different patch? Yeah, I'd rather do that fix properly, and do have ideas on how to, just it's a lot more work (basically testing if the joins can duplicate rows or not (unique joins++)). > I think remove_functionally_dependant_groupclauses should have a more > generic name, like remove_functionally_dependant_clauses. It's been a while since I looked at this. I remember thinking something similar at the time but I must have not changed it. I'll look again and post an updated patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 6 January 2018 at 23:08, David Rowley <david.rowley@2ndquadrant.com> wrote: >> I think remove_functionally_dependant_groupclauses should have a more >> generic name, like remove_functionally_dependant_clauses. > > It's been a while since I looked at this. I remember thinking > something similar at the time but I must have not changed it. > > I'll look again and post an updated patch. I've attached a patch that renames the function as you've suggested. Thanks again for the review. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
David Rowley <david.rowley@2ndquadrant.com> writes: > [ remove_useless_distinct_clauses_v2.patch ] This is a cute idea, but I'm troubled by a couple of points: 1. Once you don't have all the tlist items shown in DISTINCT, it really is more like DISTINCT ON, seems like. I am not sure it's a good idea to set hasDistinctOn, because that engages some planner behaviors we probably don't want, but I'm also not sure we can get away with just ignoring the difference. As an example, in allpaths.c there are assorted assumptions that having a distinctClause but !hasDistinctOn means all output columns of a subquery are listed in the distinctClause. 2. There's a comment in planner.c to the effect that * When we have DISTINCT ON, we must sort by the more rigorous of * DISTINCT and ORDER BY, else it won't have the desired behavior. * Also, if we do have to do an explicit sort, we might as well use * the more rigorous ordering to avoid a second sort later. (Note * that the parser will have ensured that one clause is a prefix of * the other.) Removing random elements of the distinctClause will break its correspondence with the sortClause, with probably bad results. I do not remember for sure at the moment, but it may be that this correspondence is only important for the case of DISTINCT ON, in which case we could dodge the problem by not applying the optimization unless it's plain DISTINCT. That doesn't help us with point 1 though. BTW, my dictionary says it's "dependent" not "dependant". regards, tom lane
Hi Tom, Thanks for looking at this. On 10 January 2018 at 09:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: > This is a cute idea, but I'm troubled by a couple of points: > > 1. Once you don't have all the tlist items shown in DISTINCT, it really is > more like DISTINCT ON, seems like. I am not sure it's a good idea to set > hasDistinctOn, because that engages some planner behaviors we probably > don't want, but I'm also not sure we can get away with just ignoring the > difference. As an example, in allpaths.c there are assorted assumptions > that having a distinctClause but !hasDistinctOn means all output columns > of a subquery are listed in the distinctClause. hmm, I see: /* * If subquery has regular DISTINCT (not DISTINCT ON), we're wasting our * time: all its output columns must be used in the distinctClause. */ if (subquery->distinctClause && !subquery->hasDistinctOn) return; I think that particular one would just fail to remove columns from the subquery if there's a distinct clause (not distinct on). That seems more like a missed optimisation rather than a future bug. Although it probably would be a good idea to get rid of the assumption that a non-NIL distinctClause && !hasDistinctOn means that all output target items are part of that distinct clause. I don't see anything else in allpaths.c that would be affected. Keeping in mind we'll never completely remove a distinctClause, just possibly remove some items from the list. Everything else just seems to be checking distinctClause != NIL, or is only doing something if hasDistinctOn == true. I'll do some more analysis on places that distinctClause is being used to check what's safe. > 2. There's a comment in planner.c to the effect that > > * When we have DISTINCT ON, we must sort by the more rigorous of > * DISTINCT and ORDER BY, else it won't have the desired behavior. > * Also, if we do have to do an explicit sort, we might as well use > * the more rigorous ordering to avoid a second sort later. (Note > * that the parser will have ensured that one clause is a prefix of > * the other.) > > Removing random elements of the distinctClause will break its > correspondence with the sortClause, with probably bad results. > > I do not remember for sure at the moment, but it may be that this > correspondence is only important for the case of DISTINCT ON, in which > case we could dodge the problem by not applying the optimization unless > it's plain DISTINCT. That doesn't help us with point 1 though. Yeah, it seems better to only try and apply this for plain DISTINCT. > BTW, my dictionary says it's "dependent" not "dependant". Oops. Thanks. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2018-01-10 11:12:17 +1300, David Rowley wrote: > I'll do some more analysis on places that distinctClause is being used > to check what's safe. This patch has been waiting on author since 2018-01-09, the next & last CF has started. I'm inclined to mark this as returned with feedback. Greetings, Andres Freund
On 2 March 2018 at 09:51, Andres Freund <andres@anarazel.de> wrote: > This patch has been waiting on author since 2018-01-09, the next & last > CF has started. I'm inclined to mark this as returned with feedback. I'm planning on making the required changes at the weekend. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 10 January 2018 at 09:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: > 1. Once you don't have all the tlist items shown in DISTINCT, it really is > more like DISTINCT ON, seems like. I am not sure it's a good idea to set > hasDistinctOn, because that engages some planner behaviors we probably > don't want, but I'm also not sure we can get away with just ignoring the > difference. As an example, in allpaths.c there are assorted assumptions > that having a distinctClause but !hasDistinctOn means all output columns > of a subquery are listed in the distinctClause. Looking through allpaths.c for usages of distinctClauses it seems they're all either interested in distinctClauses being non-NIL, which we'll have no effect on. There's one that only cares about distinctOn only, and then there's: /* * If subquery has regular DISTINCT (not DISTINCT ON), we're wasting our * time: all its output columns must be used in the distinctClause. */ if (subquery->distinctClause && !subquery->hasDistinctOn) return; I don't think removing this fast-path would currently have any benefit as we're not setting the ressortgroupref for the targetlist items which we've removed the distinctClause for. If we did that then, you might think that the following code in remove_unused_subquery_outputs() might be able to remove the targetlist item for a useless distinct item we've removed: /* * If it has a sortgroupref number, it's used in some sort/group * clause so we'd better not remove it. Also, don't remove any * resjunk columns, since their reason for being has nothing to do * with anybody reading the subquery's output. (It's likely that * resjunk columns in a sub-SELECT would always have ressortgroupref * set, but even if they don't, it seems imprudent to remove them.) */ if (tle->ressortgroupref || tle->resjunk) continue; but, it can't because we've not removed them yet. remove_unused_subquery_outputs() is called before subquery_planner() in set_subquery_pathlist(), so remove_useless_distinct_columns() won't even have been called by the time we get to the above code. So it looks like zeroing the ressortgroupref's of the TargetEntry items belonging to the removed distinct items would be wasted effort. I also can't imagine why we'd ever try to remove unused outputs from a subquery that's already been planned since almost all of the benefits are from allowing the planner more flexibility to do things like join removal, so I don't think there's any danger of this code becoming broken later. I looked in other places in the planner which are looking at distinctClauses. Many are just testing distinctClauses == NIL, which are not affected here since we'll always leave at least 1 item in that List. Another place that looks interesting is query_is_distinct_for() in analyzejoins.c. However, the subquery is being analyzed here long before it's getting planned, so all the distinctClauses will be fully intact at that time. > 2. There's a comment in planner.c to the effect that > > * When we have DISTINCT ON, we must sort by the more rigorous of > * DISTINCT and ORDER BY, else it won't have the desired behavior. > * Also, if we do have to do an explicit sort, we might as well use > * the more rigorous ordering to avoid a second sort later. (Note > * that the parser will have ensured that one clause is a prefix of > * the other.) > > Removing random elements of the distinctClause will break its > correspondence with the sortClause, with probably bad results. > > I do not remember for sure at the moment, but it may be that this > correspondence is only important for the case of DISTINCT ON, in which > case we could dodge the problem by not applying the optimization unless > it's plain DISTINCT. That doesn't help us with point 1 though. Good point, although we could probably just apply the same optimization to the ORDER BY to sidestep that. I've left a note in that area as something to think about for another day, although I think the refactoring done in this patch would make it a pretty easy thing to fix. > BTW, my dictionary says it's "dependent" not "dependant". You're right, regardless if that dictionary says Webster's or Oxford on the front. Fixed. I've attached an updated patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed This is a review of the patch to remove useless DISTINCT columns from the DISTINCT clause https://www.postgresql.org/message-id/CAKJS1f8UMJ137sRuVSnEMDDpa57Q71JuLZi4yLCFMekNYVYqaQ%40mail.gmail.com Contents & Purpose ================== This patch removes any additional columns in the DISTINCT clause when the table's primary key columns are entirely present in the DISTINCT clause. This optimization works because the PK columns functionally determine the other columns in the relation. The patch includes regression test cases. Initial Run =========== The patch applies cleanly to HEAD. All tests which are run as part of the `installcheck` target pass correctly (all existing tests pass, all new tests pass with the patch applied and fail without it). All TAP tests pass. All tests which are run as part of the `installcheck-world` target pass except one of the Bloom contrib module tests in `contrib/bloom/sql/bloom.sql`, `CREATE INDEX bloomidx ON tst USING bloom (i, t) WITH (col1 = 3);` which is currently also failing (crashing) for me on master, so it is unrelated to the patch. The test cases seem to cover the new behavior. Functionality ============= Given that this optimization is safe for the GROUP BY clause (you can remove functionally dependent attributes from the clause there as well), the logic does seem to follow for DISTINCT. It seems semantically correct to do this. As for the implementation, the author seems to have reasonably addressed concerns expressed over the distinction between DISTINCT and DISTINCT ON. As far as I can see, this should be fine. Code Review =========== For a small performance hit but an improvement in readability, the length check can be moved from the individual group by and distinct clause checks into the helper function if (list_length(parse->distinctClause) < 2) return; and if (list_length(parse->groupClause) < 2) return; can be moved into `remove_functionally_dependent_clauses` Comments/Documentation ======================= The main helper function that is added `remove_functionally_dependent_clauses` uses one style of comment--with the name of the function and then the rest of the description indented which is found some places in the code, /* * remove_functionally_dependent_clauses * Processes clauselist and removes any items which are deemed to be * functionally dependent on other clauselist items. * * If any item from the list can be removed, then a new list is built which * does not contain the removed items. If no item can be removed then the * original list is returned. */ while other helper functions in the same file use a different style--all lines flush to the side with a space. I'm not sure which is preferred /* * limit_needed - do we actually need a Limit plan node? * * If we have constant-zero OFFSET and constant-null LIMIT, we can skip adding * a Limit node. This is worth checking for because "OFFSET 0" is a common * locution for an optimization fence. (Because other places in the planner ... it looks like the non-indented ones are from older commits (2013 vs 2016). The list length change is totally optional, but I'll leave it as Waiting On Author in case the author wants to make thatchange. Overall, LGTM. The new status of this patch is: Waiting on Author
On 21 March 2018 at 16:29, Melanie Plageman <melanieplageman@gmail.com> wrote: > For a small performance hit but an improvement in readability, the length check > can be moved from the individual group by and distinct clause checks into the > helper function > > if (list_length(parse->distinctClause) < 2) > return; > > and > > if (list_length(parse->groupClause) < 2) > return; > > can be moved into `remove_functionally_dependent_clauses` I remember thinking about this when writing, and I think I ended up doing the check earlier as I thought that having 1 GROUP BY clause item would be the most common case, so it seems like a good idea to abort as early as possible in that case. That's most likely not the case for DISTINCT. I imagine it probably does not make that much difference anyway, so I've moved it into the remove_functionally_dependent_clauses() function, as mentioned. > The main helper function that is added `remove_functionally_dependent_clauses` > uses one style of comment--with the name of the function and then the rest of > the description indented which is found some places in the code, > /* > * remove_functionally_dependent_clauses > * Processes clauselist and removes any items which are deemed to be > * functionally dependent on other clauselist items. > * > * If any item from the list can be removed, then a new list is built which > * does not contain the removed items. If no item can be removed then the > * original list is returned. > */ > > while other helper functions in the same file use a different style--all lines > flush to the side with a space. I'm not sure which is preferred You might have a point, but remove_useless_groupby_columns already does it this way, so I don't think changing that is a good idea. > The new status of this patch is: Waiting on Author Thanks for reviewing this. I've attached an updated patch. I'll set back to waiting for review again. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Melanie Plageman wrote: > Contents & Purpose > ================== > > This patch removes any additional columns in the DISTINCT clause when the > table's primary key columns are entirely present in the DISTINCT clause. This > optimization works because the PK columns functionally determine the other > columns in the relation. The patch includes regression test cases. Would it be very difficult to extend that to "if any unique constraints are contained in the DISTINCT clause"? Yours, Laurenz Albe
On 22 March 2018 at 21:16, Laurenz Albe <laurenz.albe@cybertec.at> wrote: > Would it be very difficult to extend that to "if any unique constraints are > contained in the DISTINCT clause"? Unfortunately, it's quite a bit more work to make that happen. It's not just unique constraints that are required to make this work. We also need to pay attention to NOT NULL constraints too, as we're unable to prove function dependency when the keys have NULLs, as there may be any number of rows containing NULL values. The problem is that in order to properly invalidate cached plans we must record the constraint OIDs which the plan depends on. We can do that for PK and UNIQUE constraints, unfortunately, we can't do it for NOT NULL constraints as, at the moment, these are just properties of pg_attribute. For this to be made to work we'd need to store those in pg_constraint too, which is more work that I'm going to do for this patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Rowley wrote: > > Would it be very difficult to extend that to "if any unique constraints are > > contained in the DISTINCT clause"? > > Unfortunately, it's quite a bit more work to make that happen. It's > not just unique constraints that are required to make this work. We > also need to pay attention to NOT NULL constraints too, as we're > unable to prove function dependency when the keys have NULLs, as there > may be any number of rows containing NULL values. > > The problem is that in order to properly invalidate cached plans we > must record the constraint OIDs which the plan depends on. We can do > that for PK and UNIQUE constraints, unfortunately, we can't do it for > NOT NULL constraints as, at the moment, these are just properties of > pg_attribute. For this to be made to work we'd need to store those in > pg_constraint too, which is more work that I'm going to do for this > patch. That makes sense; thanks for explaining. Yours, Laurenz Albe
David, first of all, I'm delighted to see this change. I've been intending to give you feedback on it for a while, but I've been flat out with work, so I'm sorry. Any improvement in our DISTINCT optimization would be helpful. Note, though, that DISTINCT optimization can include the surrounding query block context, because sometimes the DISTINCT is required, sometimes it is prohibited, and sometimes it is optional. When DISTINCT is optional, you can produce distinct rows if it is efficient and convenient to do so, as you might from an index scan having leading keys that match the DISTINCT, but you Consider if you have a SELECT DISTINCT under a UNION that does a distinct on the same grouping keys. In that case, the UNION will do the distinct operation anyway, so the SELECT DISTINCT can become a SELECT. Please search for and read the old-but-good paper entitled "Extensible/Rule Based Query Optimization in Starburst", by Pirahesh et. al. Distinctness can also be preserved across joins, so if you have a 'snowflake query' type join, where all the joins are to a unique key, then the distinctness of the other side of the join is preserved. For example, a SELECT DISTINCT * FROM fact_table ... that joins from each column in its compound primary key to a unique key of another (dimension) table would remain distinct, and so you could drop the DISTINCT from the query. This is a relatively weak area of PostgreSQL at the moment, so IMHO we really need this. Regarding constraint OIDs and invalidation when a NOT NULL column is modified to remove this constraint, this would be accomplished with an ALTER TABLE <t> ALTER COLUMN <c> <type> NULL. Why wouldn't / can't the ALTER TABLE cause an invalidation of cached plans that depend on table t? -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
On Thu, Mar 22, 2018 at 5:20 PM, David Rowley <david.rowley@2ndquadrant.com> wrote:
I was just wondering, since get_primary_key_attnos opens pg_constraint and just skips attributes with other constraint types than primary, what would be the reason we wouldn't just also open pg_attribute and check for the non-nullness of the non-primary key unique attributes?
Couldn't we add a function which opens both pg_attribute and pg_constraint to get non-null unique attributes?
I suppose that would have a performance impact.
And, I did notice the FIXME in the comment before check_functional_grouping which seems to make the argument that there are other use cases for wanting the non-nullness represented in pg_constraint.
* Currently we only check to see if the rel has a primary key that is a
* subset of the grouping_columns. We could also use plain unique constraints
* if all their columns are known not null, but there's a problem: we need
* to be able to represent the not-null-ness as part of the constraints added
* to *constraintDeps. FIXME whenever not-null constraints get represented
* in pg_constraint.
*/
--
The problem is that in order to properly invalidate cached plans we
must record the constraint OIDs which the plan depends on. We can do
that for PK and UNIQUE constraints, unfortunately, we can't do it for
NOT NULL constraints as, at the moment, these are just properties of
pg_attribute. For this to be made to work we'd need to store those in
pg_constraint too, which is more work that I'm going to do for thispatch.
I was just wondering, since get_primary_key_attnos opens pg_constraint and just skips attributes with other constraint types than primary, what would be the reason we wouldn't just also open pg_attribute and check for the non-nullness of the non-primary key unique attributes?
Couldn't we add a function which opens both pg_attribute and pg_constraint to get non-null unique attributes?
I suppose that would have a performance impact.
And, I did notice the FIXME in the comment before check_functional_grouping which seems to make the argument that there are other use cases for wanting the non-nullness represented in pg_constraint.
* Currently we only check to see if the rel has a primary key that is a
* subset of the grouping_columns. We could also use plain unique constraints
* if all their columns are known not null, but there's a problem: we need
* to be able to represent the not-null-ness as part of the constraints added
* to *constraintDeps. FIXME whenever not-null constraints get represented
* in pg_constraint.
*/
Melanie Plageman
On 22 March 2018 at 18:58, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 21 March 2018 at 16:29, Melanie Plageman <melanieplageman@gmail.com> wrote: >> The new status of this patch is: Waiting on Author > > Thanks for reviewing this. I've attached an updated patch. I'll set > back to waiting for review again. I've attached another version of the patch. The only change is a thinko in a comment which claimed it may be possible to do something with DISTINCT ON clauses, but on further thought, I don't think that's possible, so I've updated the comment to reflect my new thoughts. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 24 March 2018 at 05:55, Melanie Plageman <melanieplageman@gmail.com> wrote: > I was just wondering, since get_primary_key_attnos opens pg_constraint and > just skips attributes with other constraint types than primary, what would > be the reason we wouldn't just also open pg_attribute and check for the > non-nullness of the non-primary key unique attributes? > Couldn't we add a function which opens both pg_attribute and pg_constraint > to get non-null unique attributes? > I suppose that would have a performance impact. > And, I did notice the FIXME in the comment before check_functional_grouping > which seems to make the argument that there are other use cases for wanting > the non-nullness represented in pg_constraint. It's certainly possible to do more here. I'm uncertain what needs to be done in regards to cached plan invalidation, but we'd certainly need to ensure cached plans are invalidated whenever the attnotnull is changed. There would be no need to look at the pg_attribute table directly. A syscache function would just need added named get_attnotnull() which would be akin to something like get_atttypmod(). I'm personally not planning on doing anything in that regard for this patch. We have 1 week to go before PG11 feature freeze. My goals for this patch was to apply the same optimization to DISTINCT as I did for GROUP BY a while back, which is exactly what the patch does. I agree that more would be nice, but unfortunately, time is finite. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 24 March 2018 at 01:42, Jim Finnerty <jfinnert@amazon.com> wrote: > Distinctness can also be preserved across joins, so if you have a 'snowflake > query' type join, where all the joins are to a unique key, then the > distinctness of the other side of the join is preserved. For example, a > SELECT DISTINCT * FROM fact_table ... that joins from each column in its > compound primary key to a unique key of another (dimension) table would > remain distinct, and so you could drop the DISTINCT from the query. I'm aware. It is something I'm interested in but would require several orders of magnitude more work than what I've done for this patch. You may have noticed the other work I did a while back to detect if joins cause row duplicate or not, so it's certainly something I've thought about. If Amazon would like to sponsor work in this area then please see [1]. It certainly would be great to see that happen. [1] https://wiki.postgresql.org/wiki/How_to_sponsor_a_feature -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David,
The updated patch looks good to me.On 6 April 2018 at 03:36, Melanie Plageman <melanieplageman@gmail.com> wrote: > The updated patch looks good to me. > I've changed the status to "ready for committer" Great. Thanks for reviewing this! -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Rowley <david.rowley@2ndquadrant.com> writes: > It's certainly possible to do more here. I'm uncertain what needs to > be done in regards to cached plan invalidation, but we'd certainly > need to ensure cached plans are invalidated whenever the attnotnull is > changed. They already are; any change at all in a pg_class or pg_attribute row results in a relcache inval for the associated table, cf CacheInvalidateHeapTuple. As far as the invalidation machinery is concerned, changing attnotnull is indistinguishable from changing atttypid, which I'm sure you'll agree must force plan regeneration. So this line of thought leads to the conclusion that we were too conservative in not allowing unique constraints to be used along with actual pkeys in remove_useless_groupby_columns. We do have the additional requirement of checking attnotnull, but I think we can assume that a plan inval will occur if that changes. In fact, now that I think about it, I wonder why we're insisting on tracking the constraint OID either. Don't index drops force relcache invals regardless? If not, how do we ensure that an indexscan plan relying on that index gets redone? Part of this probably comes from looking at the parser's check_ungrouped_columns(), but that is operating under different constraints, since it is generating what might be a long-lived parse tree, eg something that gets stored as a view. We do need to be able to register an actual dependency on the uniqueness constraint in that situation. But plans get tossed on the basis of relcache invals, so I think they don't need it. Anyway, that's clearly a future improvement rather than something I'd insist on this patch tackling immediately. My gripe about this as it stands is mainly that it seems like it's going to do a lot of catalog-lookup work twice over, at least in cases that have both DISTINCT and GROUP BY --- or is that too small a set to worry about? regards, tom lane
I wrote: > My gripe about > this as it stands is mainly that it seems like it's going to do > a lot of catalog-lookup work twice over, at least in cases that > have both DISTINCT and GROUP BY --- or is that too small a set to > worry about? I convinced myself that that was a silly objection, and started looking through the patch with intent to commit. However, re-reading the thread re-awakened my concern about whether deleting items from the distinctClause is actually safe. You're right that remove_unused_subquery_outputs isn't really in trouble with this; it might fail to remove some things it could remove, but that seems OK, and at best deserving of a comment. However, that doesn't mean we're out of the woods. See also preprocess_groupclause, particularly this comment: * Note: we need no comparable processing of the distinctClause because * the parser already enforced that that matches ORDER BY. as well as the interesting assumptions in create_distinct_paths about what it means to compare the lengths of the pathkey lists for these clauses. Once I noticed that, I became convinced that this patch actually breaks planning of sort/unique-based DISTINCT: if we have a pkey a,b, but the ORDER BY list is a,c,b, then we will sort by a,c,b but the Unique node will be told it only needs to unique-ify on the first two sort columns. That led me to try this test case, and it blew up even better than I expected: regression=# create table t1 (a int,b int, c int, d int, primary key(a,b)); CREATE TABLE regression=# explain verbose select distinct * from t1 order by a,c,b; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The reason is we're hitting this assert, a bit further down in create_distinct_paths: /* Assert checks that parser didn't mess up... */ Assert(pathkeys_contained_in(root->distinct_pathkeys, needed_pathkeys)); I don't think that that makes this patch unsalvageable. The way forward probably involves remembering that we removed some distinctClause items (so that we know the correspondence to the sortClause no longer holds) and then working harder in create_distinct_paths when that's the case. However, that's not something that's going to happen on the last day of the commitfest. So I'm going to mark this Returned With Feedback and encourage you to return to the matter in v12. In the meantime, attached is the version of the patch that I was about to commit before getting cold feet. It has some cosmetic improvements over yours, notably comment additions in allpaths.c. regards, tom lane diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 65a34a2..345ed17 100644 *** a/src/backend/optimizer/path/allpaths.c --- b/src/backend/optimizer/path/allpaths.c *************** remove_unused_subquery_outputs(Query *su *** 3306,3311 **** --- 3306,3315 ---- /* * If subquery has regular DISTINCT (not DISTINCT ON), we're wasting our * time: all its output columns must be used in the distinctClause. + * (Note: the latter is not necessarily true anymore, because planner.c + * might have found some of the DISTINCT columns to be redundant and + * dropped them. But they'd still have sortgroupref markings, so unless + * we improve the heuristic below, we would not recognize them as unused.) */ if (subquery->distinctClause && !subquery->hasDistinctOn) return; *************** remove_unused_subquery_outputs(Query *su *** 3348,3358 **** /* * If it has a sortgroupref number, it's used in some sort/group ! * clause so we'd better not remove it. Also, don't remove any ! * resjunk columns, since their reason for being has nothing to do ! * with anybody reading the subquery's output. (It's likely that ! * resjunk columns in a sub-SELECT would always have ressortgroupref ! * set, but even if they don't, it seems imprudent to remove them.) */ if (tle->ressortgroupref || tle->resjunk) continue; --- 3352,3365 ---- /* * If it has a sortgroupref number, it's used in some sort/group ! * clause so we'd better not remove it. (This is a conservative ! * heuristic, since it might not actually be used by any surviving ! * sort/group clause; but we don't bother to expend the cycles needed ! * for a more accurate test.) Also, don't remove any resjunk columns, ! * since their reason for being has nothing to do with anybody reading ! * the subquery's output. (It's likely that resjunk columns in a ! * sub-SELECT would always have ressortgroupref set, but even if they ! * don't, it seems imprudent to remove them.) */ if (tle->ressortgroupref || tle->resjunk) continue; diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 008492b..a9ccfed 100644 *** a/src/backend/optimizer/plan/planner.c --- b/src/backend/optimizer/plan/planner.c *************** static double preprocess_limit(PlannerIn *** 125,130 **** --- 125,133 ---- int64 *offset_est, int64 *count_est); static bool limit_needed(Query *parse); static void remove_useless_groupby_columns(PlannerInfo *root); + static void remove_useless_distinct_columns(PlannerInfo *root); + static List *remove_dependent_grouping_clauses(PlannerInfo *root, + List *clauselist); static List *preprocess_groupclause(PlannerInfo *root, List *force); static List *extract_rollup_sets(List *groupingSets); static List *reorder_grouping_sets(List *groupingSets, List *sortclause); *************** subquery_planner(PlannerGlobal *glob, Qu *** 965,970 **** --- 968,976 ---- /* Remove any redundant GROUP BY columns */ remove_useless_groupby_columns(root); + /* Likewise for redundant DISTINCT columns */ + remove_useless_distinct_columns(root); + /* * If we have any outer joins, try to reduce them to plain inner joins. * This step is most easily done after we've done expression *************** static void *** 2941,2967 **** remove_useless_groupby_columns(PlannerInfo *root) { Query *parse = root->parse; - Bitmapset **groupbyattnos; - Bitmapset **surplusvars; - ListCell *lc; - int relid; - - /* No chance to do anything if there are less than two GROUP BY items */ - if (list_length(parse->groupClause) < 2) - return; /* Don't fiddle with the GROUP BY clause if the query has grouping sets */ if (parse->groupingSets) return; /* ! * Scan the GROUP BY clause to find GROUP BY items that are simple Vars. ! * Fill groupbyattnos[k] with a bitmapset of the column attnos of RTE k ! * that are GROUP BY items. */ ! groupbyattnos = (Bitmapset **) palloc0(sizeof(Bitmapset *) * ! (list_length(parse->rtable) + 1)); ! foreach(lc, parse->groupClause) { SortGroupClause *sgc = lfirst_node(SortGroupClause, lc); TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList); --- 2947,3022 ---- remove_useless_groupby_columns(PlannerInfo *root) { Query *parse = root->parse; /* Don't fiddle with the GROUP BY clause if the query has grouping sets */ if (parse->groupingSets) return; + parse->groupClause = + remove_dependent_grouping_clauses(root, parse->groupClause); + } + + /* + * remove_useless_distinct_columns + * Like remove_useless_groupby_columns, but for the DISTINCT clause + * + * If we have both a multi-member GROUP BY clause and a multi-member DISTINCT + * clause, this will do a lot of the same catalog lookup work that + * remove_useless_groupby_columns already did. That seems like an unlikely + * case, so for now we don't worry about it, but eventually it might be good + * to refactor to avoid that. + */ + static void + remove_useless_distinct_columns(PlannerInfo *root) + { + Query *parse = root->parse; + /* ! * Don't try to remove anything from a DISTINCT ON clause. For this case, ! * the distinctClauses are closely entwined with the ORDER BY clause, so ! * we'd better not meddle with them. You might think that we can just ! * apply the same optimizations to the ORDER BY too, but we can't since ! * removing an item there could affect the order of the query results. */ ! if (parse->hasDistinctOn) ! return; ! ! parse->distinctClause = ! remove_dependent_grouping_clauses(root, parse->distinctClause); ! } ! ! /* ! * remove_dependent_grouping_clauses ! * Process clauselist (a list of SortGroupClause) and remove any items ! * that can be proven to be functionally dependent on other items. ! * We will have the same grouping semantics without them. ! * ! * If any item from the list can be removed, then a new list is built which ! * does not contain the removed items. If nothing can be removed then the ! * original list is returned. ! */ ! static List * ! remove_dependent_grouping_clauses(PlannerInfo *root, ! List *clauselist) ! { ! Query *parse = root->parse; ! Bitmapset **clauseattnos; ! Bitmapset **surplusvars; ! ListCell *lc; ! int relid; ! ! /* No chance of removing anything if there are fewer than two items */ ! if (list_length(clauselist) < 2) ! return clauselist; ! ! /* ! * Scan the clauselist to find items that are simple Vars. Fill ! * clauseattnos[k] with a bitmapset of the column attnos of RTE k that ! * appear in the clauselist. ! */ ! clauseattnos = (Bitmapset **) palloc0(sizeof(Bitmapset *) * ! (list_length(parse->rtable) + 1)); ! foreach(lc, clauselist) { SortGroupClause *sgc = lfirst_node(SortGroupClause, lc); TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList); *************** remove_useless_groupby_columns(PlannerIn *** 2971,2979 **** * Ignore non-Vars and Vars from other query levels. * * XXX in principle, stable expressions containing Vars could also be ! * removed, if all the Vars are functionally dependent on other GROUP ! * BY items. But it's not clear that such cases occur often enough to ! * be worth troubling over. */ if (!IsA(var, Var) || var->varlevelsup > 0) --- 3026,3034 ---- * Ignore non-Vars and Vars from other query levels. * * XXX in principle, stable expressions containing Vars could also be ! * removed, if all the Vars are functionally dependent on other items ! * in the clauselist. But it's not clear that such cases occur often ! * enough to be worth troubling over. */ if (!IsA(var, Var) || var->varlevelsup > 0) *************** remove_useless_groupby_columns(PlannerIn *** 2982,2996 **** /* OK, remember we have this Var */ relid = var->varno; Assert(relid <= list_length(parse->rtable)); ! groupbyattnos[relid] = bms_add_member(groupbyattnos[relid], ! var->varattno - FirstLowInvalidHeapAttributeNumber); } /* * Consider each relation and see if it is possible to remove some of its ! * Vars from GROUP BY. For simplicity and speed, we do the actual removal ! * in a separate pass. Here, we just fill surplusvars[k] with a bitmapset ! * of the column attnos of RTE k that are removable GROUP BY items. */ surplusvars = NULL; /* don't allocate array unless required */ relid = 0; --- 3037,3055 ---- /* OK, remember we have this Var */ relid = var->varno; Assert(relid <= list_length(parse->rtable)); ! clauseattnos[relid] = bms_add_member(clauseattnos[relid], ! var->varattno - FirstLowInvalidHeapAttributeNumber); } /* * Consider each relation and see if it is possible to remove some of its ! * Vars from the clauselist. We can do so if they are functionally ! * dependent on other Vars from the same relation that are also in the ! * clauselist (independently of any other relations that are mentioned). ! * ! * For simplicity and speed, we do the actual removal in a separate pass. ! * Here, we just fill surplusvars[k] with a bitmapset of the column attnos ! * of RTE k that are removable clauselist items. */ surplusvars = NULL; /* don't allocate array unless required */ relid = 0; *************** remove_useless_groupby_columns(PlannerIn *** 3007,3014 **** if (rte->rtekind != RTE_RELATION) continue; ! /* Nothing to do unless this rel has multiple Vars in GROUP BY */ ! relattnos = groupbyattnos[relid]; if (bms_membership(relattnos) != BMS_MULTIPLE) continue; --- 3066,3073 ---- if (rte->rtekind != RTE_RELATION) continue; ! /* Nothing to do unless this rel has multiple Vars in clauselist */ ! relattnos = clauseattnos[relid]; if (bms_membership(relattnos) != BMS_MULTIPLE) continue; *************** remove_useless_groupby_columns(PlannerIn *** 3022,3028 **** /* * If the primary key is a proper subset of relattnos then we have ! * some items in the GROUP BY that can be removed. */ if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1) { --- 3081,3087 ---- /* * If the primary key is a proper subset of relattnos then we have ! * some items in the clauselist that can be removed. */ if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1) { *************** remove_useless_groupby_columns(PlannerIn *** 3044,3058 **** } /* ! * If we found any surplus Vars, build a new GROUP BY clause without them. * (Note: this may leave some TLEs with unreferenced ressortgroupref * markings, but that's harmless.) */ if (surplusvars != NULL) { ! List *new_groupby = NIL; ! foreach(lc, parse->groupClause) { SortGroupClause *sgc = lfirst_node(SortGroupClause, lc); TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList); --- 3103,3117 ---- } /* ! * If we found any surplus Vars, build a new clause list without them. * (Note: this may leave some TLEs with unreferenced ressortgroupref * markings, but that's harmless.) */ if (surplusvars != NULL) { ! List *new_clauselist = NIL; ! foreach(lc, clauselist) { SortGroupClause *sgc = lfirst_node(SortGroupClause, lc); TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList); *************** remove_useless_groupby_columns(PlannerIn *** 3066,3076 **** var->varlevelsup > 0 || !bms_is_member(var->varattno - FirstLowInvalidHeapAttributeNumber, surplusvars[var->varno])) ! new_groupby = lappend(new_groupby, sgc); } ! parse->groupClause = new_groupby; } } /* --- 3125,3138 ---- var->varlevelsup > 0 || !bms_is_member(var->varattno - FirstLowInvalidHeapAttributeNumber, surplusvars[var->varno])) ! new_clauselist = lappend(new_clauselist, sgc); } ! return new_clauselist; } + + /* nothing to change, just return the old list */ + return clauselist; } /* diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out index f85e913..2976c4b 100644 *** a/src/test/regress/expected/aggregates.out --- b/src/test/regress/expected/aggregates.out *************** explain (costs off) select * from t3 gro *** 1017,1022 **** --- 1017,1081 ---- -> Seq Scan on t3 (3 rows) + -- + -- Test removal of redundant DISTINCT columns + -- + -- Non-primary-key columns can be removed from DISTINCT clause + explain (costs off) select distinct a,b,c,d from t1; + QUERY PLAN + ---------------------- + HashAggregate + Group Key: a, b + -> Seq Scan on t1 + (3 rows) + + -- No removal can happen if the complete PK is not present in DISTINCT clause + explain (costs off) select distinct a,c,d from t1; + QUERY PLAN + ---------------------- + HashAggregate + Group Key: a, c, d + -> Seq Scan on t1 + (3 rows) + + -- Test removal across multiple relations + explain (costs off) select distinct t1.a,t1.b,t1.c,t1.d,t2.x,t2.y,t2.z + from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y; + QUERY PLAN + ------------------------------------------------------ + HashAggregate + Group Key: t1.a, t1.b, t2.x, t2.y + -> Hash Join + Hash Cond: ((t2.x = t1.a) AND (t2.y = t1.b)) + -> Seq Scan on t2 + -> Hash + -> Seq Scan on t1 + (7 rows) + + -- Test case where t1 can be optimized but not t2 + explain (costs off) select distinct t1.a,t1.b,t1.c,t1.d,t2.x,t2.z + from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y; + QUERY PLAN + ------------------------------------------------------ + HashAggregate + Group Key: t1.a, t1.b, t2.x, t2.z + -> Hash Join + Hash Cond: ((t2.x = t1.a) AND (t2.y = t1.b)) + -> Seq Scan on t2 + -> Hash + -> Seq Scan on t1 + (7 rows) + + -- Ensure we don't remove DISTINCT ON items + explain (costs off) select distinct on (a,b,c) d from t1 order by a,b,c,d; + QUERY PLAN + ------------------------------ + Unique + -> Sort + Sort Key: a, b, c, d + -> Seq Scan on t1 + (4 rows) + drop table t1; drop table t2; drop table t3; diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out index 84c6e9b..fd03559 100644 *** a/src/test/regress/expected/join.out --- b/src/test/regress/expected/join.out *************** select d.* from d left join (select * fr *** 4126,4147 **** -> Seq Scan on d (8 rows) ! -- similarly, but keying off a DISTINCT clause explain (costs off) select d.* from d left join (select distinct * from b) s on d.a = s.id; ! QUERY PLAN ! -------------------------------------- ! Merge Right Join ! Merge Cond: (b.id = d.a) ! -> Unique ! -> Sort ! Sort Key: b.id, b.c_id ! -> Seq Scan on b ! -> Sort ! Sort Key: d.a ! -> Seq Scan on d ! (9 rows) -- check join removal works when uniqueness of the join condition is enforced -- by a UNION --- 4126,4147 ---- -> Seq Scan on d (8 rows) ! -- similarly, but keying off a DISTINCT clause (again, removal of b.c_id ! -- from the DISTINCT step happens too late for join removal) explain (costs off) select d.* from d left join (select distinct * from b) s on d.a = s.id; ! QUERY PLAN ! --------------------------------------- ! Hash Left Join ! Hash Cond: (d.a = s.id) ! -> Seq Scan on d ! -> Hash ! -> Subquery Scan on s ! -> HashAggregate ! Group Key: b.id ! -> Seq Scan on b ! (8 rows) -- check join removal works when uniqueness of the join condition is enforced -- by a UNION diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql index 506d044..1ee26ca 100644 *** a/src/test/regress/sql/aggregates.sql --- b/src/test/regress/sql/aggregates.sql *************** group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z; *** 362,367 **** --- 362,387 ---- -- Cannot optimize when PK is deferrable explain (costs off) select * from t3 group by a,b,c; + -- + -- Test removal of redundant DISTINCT columns + -- + -- Non-primary-key columns can be removed from DISTINCT clause + explain (costs off) select distinct a,b,c,d from t1; + + -- No removal can happen if the complete PK is not present in DISTINCT clause + explain (costs off) select distinct a,c,d from t1; + + -- Test removal across multiple relations + explain (costs off) select distinct t1.a,t1.b,t1.c,t1.d,t2.x,t2.y,t2.z + from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y; + + -- Test case where t1 can be optimized but not t2 + explain (costs off) select distinct t1.a,t1.b,t1.c,t1.d,t2.x,t2.z + from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y; + + -- Ensure we don't remove DISTINCT ON items + explain (costs off) select distinct on (a,b,c) d from t1 order by a,b,c,d; + drop table t1; drop table t2; drop table t3; diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql index b1e05a3..31fd397 100644 *** a/src/test/regress/sql/join.sql --- b/src/test/regress/sql/join.sql *************** explain (costs off) *** 1360,1366 **** select d.* from d left join (select * from b group by b.id, b.c_id) s on d.a = s.id; ! -- similarly, but keying off a DISTINCT clause explain (costs off) select d.* from d left join (select distinct * from b) s on d.a = s.id; --- 1360,1367 ---- select d.* from d left join (select * from b group by b.id, b.c_id) s on d.a = s.id; ! -- similarly, but keying off a DISTINCT clause (again, removal of b.c_id ! -- from the DISTINCT step happens too late for join removal) explain (costs off) select d.* from d left join (select distinct * from b) s on d.a = s.id;
On 8 April 2018 at 08:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: > regression=# create table t1 (a int,b int, c int, d int, primary key(a,b)); > CREATE TABLE > regression=# explain verbose select distinct * from t1 order by a,c,b; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. Ouch! > The reason is we're hitting this assert, a bit further down in > create_distinct_paths: > > /* Assert checks that parser didn't mess up... */ > Assert(pathkeys_contained_in(root->distinct_pathkeys, > needed_pathkeys)); > > > I don't think that that makes this patch unsalvageable. The way forward > probably involves remembering that we removed some distinctClause items > (so that we know the correspondence to the sortClause no longer holds) > and then working harder in create_distinct_paths when that's the case. > > However, that's not something that's going to happen on the last day > of the commitfest. So I'm going to mark this Returned With Feedback > and encourage you to return to the matter in v12. > > In the meantime, attached is the version of the patch that I was about to > commit before getting cold feet. It has some cosmetic improvements > over yours, notably comment additions in allpaths.c. Thanks a lot for spending time on this. I'll no doubt have some time over this coming winter to see if it can be fixed and re-submitted in the PG12 cycle. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
distinct_optimization_v6.patch <http://www.postgresql-archive.org/file/t348990/distinct_optimization_v6.patch> Here is an update to this thread, for potential inclusion in v12. I couldn't get the most recent 'v5' patch to apply cleanly, so I recreated a v6 patch on PG10.5 by hand, and made a few changes and improvements: - (a) If there is only one relation and the PK is present in the SELECT list, then the distinctClause can be removed. Thus, a query of the form SELECT DISTINCT pk, * FROM t; recognizes that the body of the SELECT is already distinct on pk, and doesn't do an unnecessary DISTINCT operation - (b) If there is only one relation, the PK is present, and there are no aggregates, then the groupByClause can similarly be removed - (c) If DISTINCT ON is specified, but no ORDER BY is specified, then it acts like a regular DISTINCT - (d) If the distinct clause is modified, this fact is recorded in a new bool in the Query struct so that when the distinct was modified, create_distinct_paths no longer Asserts that the needed pathkeys are contained in the distinct_pathkeys. - (e) Since the new bool is in the Query struct, copyfuncs/equalfuncs/outfuncs/readfuncs support is also provided. This will cause a problem unless the pg_rewrite catalog is regenerated. The pg_rewrite catalog contains a serialized representation of the Query node in its ev_action column. If there is a way to recreate the contents of the pg_rewrite relation without bumping the catversion, can someone please explain how? If not, then this change is incomplete and would require a new catalog version (catversion.h) too. Additional work on this patch would be desirable. It should check for unique + not null, in addition to just the pk constraint. The DISTINCT could be eliminated in cases with multiple relations if all the joins are 1:1, although that would arguably belong in a different patch. /Jim p.s. the v6 patch works for the problem case that Tom Lane reported with the v5 patch -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
On 23 August 2018 at 08:11, Jim Finnerty <jfinnert@amazon.com> wrote: > Here is an update to this thread, for potential inclusion in v12. I > couldn't get the most recent 'v5' patch to apply cleanly, so I recreated a > v6 patch on PG10.5 by hand, and made a few changes and improvements: Well, the patch I wrote was never intended for v10. Why did you go to the effort of doing that when you're proposing the patch for v12? It seems my v5 patch and Tom's v6 version apply fine to today's master without any rejected hunks. > The pg_rewrite catalog contains a serialized representation of the Query > node in its ev_action column. If there is a way to recreate the contents of > the pg_rewrite relation without bumping the catversion, can someone please > explain how? If not, then this change is incomplete and would require a new > catalog version (catversion.h) too. If you're proposing for v12, why do you care about that? catversion changes are commonplace during major version development. Or are you proposing this gets committed to PostgreSQL 10? ... That's not going to happen... Plus, it does not seem well aligned with the intent of this thread to discuss ways to do that either. > Additional work on this patch would be desirable. It should check for > unique + not null, in addition to just the pk constraint. The DISTINCT > could be eliminated in cases with multiple relations if all the joins are > 1:1, although that would arguably belong in a different patch. Certainly. Patches are welcome, but I think I've mentioned to you previously that I'm not planning that for this patch. > p.s. the v6 patch works for the problem case that Tom Lane reported with the > v5 patch That would be less ambiguous if there wasn't now two v6 version of the patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
I was thinking about this last night and I realized that putting the new hasModifiedDistinct flag on the PlannerInfo struct eliminates the need to deal with the serialization issues, and makes it simpler. Here's a new patch (v7) that puts the bool on the PlannerInfo struct, and adds a couple of tests. re: why did you apply the patch on v10? I'm developing on a v10.5 codebase at the moment, though this will change soon. If the v7 patch doesn't apply cleanly on later versions, please let me know and I'll fix it. re: if you're proposing the patch for v12, why do you care about catversion? Only because it would be a problem to test the patch on 10.5 with a catversion change that wouldn't come until v12. With the v7 patch this issue is moot because it no longer requires a catversion change. cheers, /Jim distinct_opt_v7.patch <http://www.postgresql-archive.org/file/t348990/distinct_opt_v7.patch> -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
On 2018-Aug-23, Jim Finnerty wrote: > distinct_opt_v7.patch > <http://www.postgresql-archive.org/file/t348990/distinct_opt_v7.patch> Are you posting this via postgresql-archive.org? Please don't do that. These posts of yours are mangled by postgresql-archive and do not contain the file you're attaching. Use direct posting by email to pgsql-hackers@lists.postgresql.org instead. Thanks -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thank you Álvaro. Here's the distinct_opt_v7 patch.
Attachment
On 24 August 2018 at 03:29, Finnerty, Jim <jfinnert@amazon.com> wrote: > Thank you Álvaro. Here's the distinct_opt_v7 patch. Determining if there's just 1 base relation by checking the length the rtable is not a good way. If you want to see why, try creating a primary key on a partitioned table. My personal opinion of only being able to completely remove the DISTINCT when there's a single item in the rtable (or a single base table) is that it's just too poor to bother with. I think such a solution is best left to another patch and it should be much more complete and be able to track the unique properties through joins. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Greetings, * David Rowley (david.rowley@2ndquadrant.com) wrote: > My personal opinion of only being able to completely remove the > DISTINCT when there's a single item in the rtable (or a single base > table) is that it's just too poor to bother with. I think such a > solution is best left to another patch and it should be much more > complete and be able to track the unique properties through joins. Doesn't that move the goalposts for this pretty far off? Is there some reason that doing this for the single-item case will make it difficult to implement the additional improvements you're suggesting later? Given the cost of a DISTINCT and that we know pretty easily if one exists or not, trying to eliminate it, even in the single-item case, seems pretty worthwhile to me. If it makes it difficult to improve on this later then I could see pushing back on it, but this patch doesn't seem like its doing that. Thanks! Stephen
Attachment
On 24 August 2018 at 11:34, Stephen Frost <sfrost@snowman.net> wrote: > Greetings, > > * David Rowley (david.rowley@2ndquadrant.com) wrote: >> My personal opinion of only being able to completely remove the >> DISTINCT when there's a single item in the rtable (or a single base >> table) is that it's just too poor to bother with. I think such a >> solution is best left to another patch and it should be much more >> complete and be able to track the unique properties through joins. > > Doesn't that move the goalposts for this pretty far off? Is there > some reason that doing this for the single-item case will make it > difficult to implement the additional improvements you're suggesting > later? The DISTINCT clause removal would likely need to wait until around where the distinct paths are created and all the join processing has been done. If the optimisation is deemed worthwhile, then maybe that would be the place to put this code along with a comment that explains how it can be improved to support the removal with multiple relations. > Given the cost of a DISTINCT and that we know pretty easily if one > exists or not, trying to eliminate it, even in the single-item case, > seems pretty worthwhile to me. If it makes it difficult to improve on > this later then I could see pushing back on it, but this patch doesn't > seem like its doing that. If people truly think it's worthwhile, then maybe it is, but if it's being done because it can be done then perhaps it's best not to bother. I was just thinking of the bug reports we'd get when people see it does not work with multiple relations. There's already no shortage of bug reports that are not bugs. I also wanted to keep this patch just doing the simple case that we already apply to GROUP BY, but control over the patch may well be out of my hands now. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Greetings, * Jim Finnerty (jfinnert@amazon.com) wrote: > I was thinking about this last night and I realized that putting the new > hasModifiedDistinct flag on the PlannerInfo struct eliminates the need to > deal with the serialization issues, and makes it simpler. > > Here's a new patch (v7) that puts the bool on the PlannerInfo struct, and > adds a couple of tests. > > re: why did you apply the patch on v10? > > I'm developing on a v10.5 codebase at the moment, though this will change > soon. If the v7 patch doesn't apply cleanly on later versions, please let > me know and I'll fix it. Just going back through this thread to check that I didn't miss anything, I saw this and felt it deserved a comment- please don't post feature patches like this against existing released versions and then ask everyone else to see if it works against current head or not. We're all for having new contributors of features, but those contributions should be against the current HEAD. > re: if you're proposing the patch for v12, why do you care about catversion? > > Only because it would be a problem to test the patch on 10.5 with a > catversion change that wouldn't come until v12. With the v7 patch this > issue is moot because it no longer requires a catversion change. This would be an example of concerns that really shouldn't be getting raised on this list for new features. The project has a very clear policy regarding features vs. bug-fixes and I really don't think it's necessary or desirable to have patches being floated with subsequent threads here which go directly against that policy. Thanks! Stephen
Attachment
Greetings, * David Rowley (david.rowley@2ndquadrant.com) wrote: > On 24 August 2018 at 11:34, Stephen Frost <sfrost@snowman.net> wrote: > > * David Rowley (david.rowley@2ndquadrant.com) wrote: > >> My personal opinion of only being able to completely remove the > >> DISTINCT when there's a single item in the rtable (or a single base > >> table) is that it's just too poor to bother with. I think such a > >> solution is best left to another patch and it should be much more > >> complete and be able to track the unique properties through joins. > > > > Doesn't that move the goalposts for this pretty far off? Is there > > some reason that doing this for the single-item case will make it > > difficult to implement the additional improvements you're suggesting > > later? > > The DISTINCT clause removal would likely need to wait until around > where the distinct paths are created and all the join processing has > been done. > > If the optimisation is deemed worthwhile, then maybe that would be the > place to put this code along with a comment that explains how it can > be improved to support the removal with multiple relations. Hm, so you're suggesting that this isn't the right place for this optimization to be implemented, even now, with the single-relation caveat? I do agree that including comments about how it could/should be improved would be good for future developers. > > Given the cost of a DISTINCT and that we know pretty easily if one > > exists or not, trying to eliminate it, even in the single-item case, > > seems pretty worthwhile to me. If it makes it difficult to improve on > > this later then I could see pushing back on it, but this patch doesn't > > seem like its doing that. > > If people truly think it's worthwhile, then maybe it is, but if it's > being done because it can be done then perhaps it's best not to > bother. I was just thinking of the bug reports we'd get when people > see it does not work with multiple relations. There's already no > shortage of bug reports that are not bugs. I also wanted to keep this > patch just doing the simple case that we already apply to GROUP BY, > but control over the patch may well be out of my hands now. I'm not really sure I'm following along this part of the discussion. As long as it isn't making the code particularly ugly or violating the seperation of various components or making later hacking in this area particularly worse, then the question primairly comes down to making sure that the code is of quality and that the performance optimization is worth the cycles to check for the opportunity and that it's not too expensive compared to the cost of the operation to implement it, all of which can be pretty objectively assessed and provide the answer to if it's worthwhile or not. Given the cost of a Unique node, and that we will only check for this optimization when we are going to have to introduce one, it certainly seems like it's worthwhile (again, proviso on code quality, et al, above, but I don't think that's what you were arguing a concern about here). In short, I don't buy the concern about having to deal with "bug reports that aren't bug reports" or that they should be a barrier to feature work. We already see complaints pretty regularly about optimizations we don't have and this would at least be a step in the right direction for dealing with these kinds of queries and that does seem better than doing nothing, imv. Thanks! Stephen
Attachment
Stephen Frost <sfrost@snowman.net> writes: > * David Rowley (david.rowley@2ndquadrant.com) wrote: >> On 24 August 2018 at 11:34, Stephen Frost <sfrost@snowman.net> wrote: >>> * David Rowley (david.rowley@2ndquadrant.com) wrote: >>>> My personal opinion of only being able to completely remove the >>>> DISTINCT when there's a single item in the rtable (or a single base >>>> table) is that it's just too poor to bother with. > Hm, so you're suggesting that this isn't the right place for this > optimization to be implemented, even now, with the single-relation > caveat? There is no case where planner optimizations should depend on the length of the rtable. Full stop. It could make sense to optimize if there is just one baserel in the join tree --- although even that is best checked only after join removal. As an example of the difference, such an optimization should be able to optimize "select * from view" if the view contains just one base table. The rtable will list both the view and the base table, but the view is only hanging around for permissions-checking purposes; it should not affect the planner's behavior. I've not read the patch, but David's reaction makes it sound like its processing is done too early. There are right places and wrong places to do most everything in the planner, and I do not wish to accept a patch that does something in the wrong place. regards, tom lane
Greetings, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > * David Rowley (david.rowley@2ndquadrant.com) wrote: > >> On 24 August 2018 at 11:34, Stephen Frost <sfrost@snowman.net> wrote: > >>> * David Rowley (david.rowley@2ndquadrant.com) wrote: > >>>> My personal opinion of only being able to completely remove the > >>>> DISTINCT when there's a single item in the rtable (or a single base > >>>> table) is that it's just too poor to bother with. > > > Hm, so you're suggesting that this isn't the right place for this > > optimization to be implemented, even now, with the single-relation > > caveat? > > There is no case where planner optimizations should depend on the length > of the rtable. Full stop. > > It could make sense to optimize if there is just one baserel in the join > tree --- although even that is best checked only after join removal. Hm, that's certainly a fair point. > As an example of the difference, such an optimization should be able to > optimize "select * from view" if the view contains just one base table. > The rtable will list both the view and the base table, but the view > is only hanging around for permissions-checking purposes; it should not > affect the planner's behavior. This is happening at the same time as some optimizations around GROUP BY, so either there's something different about what's happening there and I didn't appreciate it, or does that optimization suffer from a similar issue? > I've not read the patch, but David's reaction makes it sound like its > processing is done too early. There are right places and wrong places > to do most everything in the planner, and I do not wish to accept a > patch that does something in the wrong place. Right, I definitely agree with you there. This seemed like a reasonable place given the similar optimization (at least in appearance to me) being done there for the GROUP BY case. I'm happy to admit that I haven't looked at it in very much depth (hence my question to David) and I'm not an expert in this area, but I did want to bring up that the general idea and the relative trade-offs at least sounded reasonable. I'll also note that I didn't see these concerned raised earlier on the thread when I re-read your remarks on it, so I'm a bit concerned that perhaps either this isn't an actual concern to be realized or perhaps it was missed previously. Thanks! Stephen
Attachment
On 24 August 2018 at 14:12, Stephen Frost <sfrost@snowman.net> wrote: > This is happening at the same time as some optimizations around GROUP > BY, so either there's something different about what's happening there > and I didn't appreciate it, or does that optimization suffer from a > similar issue? There are two separate optimisations in Jim's version of the patch: 1) Attempting to remove useless DISTINCT clause items in cases where a true subset of the clauses make up the entire table's primary key. This can be applied no matter how many joins are in the query. If a table's PK is a,b then it's not going to be any more distinct if we distinctify on a,b,c. Distification on a,b is the most we'll never need. We already apply this optimisation to the GROUP BY clause. 2) Attempt to remove the entire DISTINCT clause if the query is to a single base table when the DISTINCT clause contains the entire table's PK. Here there's need uniquify the results. I started this thread for #1 and Jim would like to add #2 as an additional optimisation. I feel that #2 is different enough from #1 that they should be considered independently of each other. The area where #2 should be implemented is far away at the other end of planning from where #1 is being done. I believe that two separate patches are the best way forward here as it does not really make sense to reject #1 because there are concerns with #2, vice versa. I think if Jim would like to propose #2, then I think a separate thread is a better option than tagging along on this one. This might also reduce confusion. I'd also rather not turn this thread into people discussing/reviewing Jim's work that he plans to implement in some fork taken from Postgres 10.5. I don't really think even this mailing list is the right place for that, let alone this thread. > > I've not read the patch, but David's reaction makes it sound like its > > processing is done too early. There are right places and wrong places > > to do most everything in the planner, and I do not wish to accept a > > patch that does something in the wrong place. > > Right, I definitely agree with you there. This seemed like a reasonable > place given the similar optimization (at least in appearance to me) > being done there for the GROUP BY case. I'm happy to admit that I > haven't looked at it in very much depth (hence my question to David) and > I'm not an expert in this area, but I did want to bring up that the > general idea and the relative trade-offs at least sounded reasonable. You might be confusing #1 and #2. My concern is with #2. The existing GROUP BY clause optimisation is almost identical to #1. I just wanted to also apply it to the DISTINCT clause. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
I feel strongly that eliminating the entire DISTINCT or GROUP BY clause (when there are no aggs) is an important optimization,especially when the incremental cost to test for it is so tiny. I'm happy to submit that as a separate thread. My goal here was to move the original proposal along and contribute a little something back to the community in the process. DISTINCT optimization is currently quite poor compared to the leading commercial RDBMS alternatives, and doing unnecessaryDISTINCT in the single-table case is an example of that. There are other missing DISTINCT optimizations. I'll explore a proper way to test that it's in the single-relation case, and will post a separate thread for the 'removeunnecessary DISTINCT' optimization. Cheers, /Jim On 8/23/18, 11:12 PM, "David Rowley" <david.rowley@2ndquadrant.com> wrote: You might be confusing #1 and #2. My concern is with #2. The existing GROUP BY clause optimisation is almost identical to #1. I just wanted to also apply it to the DISTINCT clause. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 25 August 2018 at 04:05, Finnerty, Jim <jfinnert@amazon.com> wrote: > I'll explore a proper way to test that it's in the single-relation case, and will post a separate thread for the 'removeunnecessary DISTINCT' optimization. See: has_multiple_baserels() in allpaths.c -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services