Re: [HACKERS] distinct + order by - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: [HACKERS] distinct + order by |
Date | |
Msg-id | 19087.910541797@sss.pgh.pa.us Whole thread Raw |
In response to | distinct + order by (t-ishii@sra.co.jp (Tatsuo Ishii)) |
Responses |
Re: [HACKERS] distinct + order by
|
List | pgsql-hackers |
t-ishii@sra.co.jp (Tatsuo Ishii) writes: > 6.4 allows to use ORDER BY with a column that does not appear > in a target list. This is great. However, following result seems > strange to me. Sometimes DISTINCT does not remove duplication. > Is this normal behavior? I think there's something funny going on, but it's not clear that this kind of query is sensible anyway. It's easy to see what's happening. DISTINCT relies on sorting by the "distinct" column first, so that a pass of a uniq(1)-like duplicate detector can get rid of the duplicates: tree=> explain select distinct i from t; NOTICE: QUERY PLAN: Unique (cost=545.65 size=0 width=0) -> Sort (cost=545.65 size=0 width=0) -> Seq Scan on t (cost=545.65 size=11262width=4) EXPLAIN Now, if we add an ORDER BY: tree=> explain select distinct i from t order by j; NOTICE: QUERY PLAN: Unique (cost=545.65 size=0 width=0) -> Sort (cost=545.65 size=0 width=0) -> Seq Scan on t (cost=545.65 size=11262width=12) EXPLAIN Oops. The sort is evidently being done on the order-by column, thus there's no guarantee that the duplicate i values will be brought together. Still, I'd expect the Unique pass to get rid of duplicates that did manage to be adjacent. Your example shows that that's not happening. So Unique has some unexpected dependency on how the prior sort pass is done. (Maybe it assumes that the sort pass is on *exactly* the columns that were requested to be distinct, and re-uses the sort's comparison results somehow?) It seems clear that the query tree builder and the Unique executor have different assumptions about how this is supposed to work. That probably will yield bugs, so it needs to be looked at. But I'm not convinced that the example you give is a sensible query at all. If you have done a SELECT DISTINCT on column A, what does it mean to specify that the results are sorted by column B? A single row of the result might represent many table rows with different values of column B --- how should the system know which of those B values to use for sorting the result? My guess is that if you look into the SQL standard, this combination is forbidden. Thus the *real* bug is that Postgres isn't rejecting it. I'd still recommend that someone take a close look at how this example is behaving now, because even if it should be kicked out as a user error, there might be related legal cases that are also having trouble. If we did want to make this example behave in a rational way, then probably the right implementation is something like * sort by i,j* distinct-filter on i only, being careful to keep first row in each set of duplicates* sort by j This would ensure that the final sort by j uses, for each distinct i, the lowest of the j-values associated with that i. This is a totally arbitrary decision, but at least it will give reproducible results. regards, tom lane
pgsql-hackers by date: