Thread: Seemingly inconsistent ORDER BY behavior
Consider the following SQL:
---------------------------------------------------
D. Richard Hipp
drh@sqlite.org
---------------------------------------------------
CREATE TABLE t1(m VARCHAR(4));
INSERT INTO t1 VALUES('az');
INSERT INTO t1 VALUES('by');
INSERT INTO t1 VALUES('cx');
SELECT '1', substr(m,2) AS m
FROM t1
ORDER BY m;
SELECT '2', substr(m,2) AS m
FROM t1
ORDER BY lower(m);
---------------------------------------------------
Using PostgreSQL 9.1.5, the first query returns x-y-z while the second returns z-y-x. Is this "correct"? It certainly is surprising to me.
I'm asking because the same question has come up in SQLite and whenever there is a language dispute in SQLite, our first instinct is to find out what PostgreSQL does and try to do the same thing. SQLite version 3.7.15 was behaving the same way as PostgreSQL 9.1.5 (unbeknownst to us at the time). Then a bug was written about the inconsistent behavior of ORDER BY. We fixed that bug so that the latest SQLite answers x-y-z in both cases. Now someone is complaining that the "fix" was really a 'break". Is it? Or is there an equivalent bug in PostgreSQL?
There are, of course, many ways to resolve the ambiguity (such as using a unique label for the result column, or by saying "t1.m" instead of just "m" when you mean the column of the table). But that is not really the point here. The question is, how should symbolic names in the ORDER BY clause be resolved? Should column names in the source table take precedence over result column name, or should it be the other way around?
Any insights are appreciated. Please advise if a different mailing list would be more appropriate for this question.
-- INSERT INTO t1 VALUES('az');
INSERT INTO t1 VALUES('by');
INSERT INTO t1 VALUES('cx');
SELECT '1', substr(m,2) AS m
FROM t1
ORDER BY m;
SELECT '2', substr(m,2) AS m
FROM t1
ORDER BY lower(m);
---------------------------------------------------
Using PostgreSQL 9.1.5, the first query returns x-y-z while the second returns z-y-x. Is this "correct"? It certainly is surprising to me.
I'm asking because the same question has come up in SQLite and whenever there is a language dispute in SQLite, our first instinct is to find out what PostgreSQL does and try to do the same thing. SQLite version 3.7.15 was behaving the same way as PostgreSQL 9.1.5 (unbeknownst to us at the time). Then a bug was written about the inconsistent behavior of ORDER BY. We fixed that bug so that the latest SQLite answers x-y-z in both cases. Now someone is complaining that the "fix" was really a 'break". Is it? Or is there an equivalent bug in PostgreSQL?
There are, of course, many ways to resolve the ambiguity (such as using a unique label for the result column, or by saying "t1.m" instead of just "m" when you mean the column of the table). But that is not really the point here. The question is, how should symbolic names in the ORDER BY clause be resolved? Should column names in the source table take precedence over result column name, or should it be the other way around?
Any insights are appreciated. Please advise if a different mailing list would be more appropriate for this question.

D. Richard Hipp
drh@sqlite.org
On Wed, Aug 14, 2013 at 12:01 PM, Richard Hipp <drh@sqlite.org> wrote: > > Consider the following SQL: > > --------------------------------------------------- > CREATE TABLE t1(m VARCHAR(4)); > INSERT INTO t1 VALUES('az'); > INSERT INTO t1 VALUES('by'); > INSERT INTO t1 VALUES('cx'); > > SELECT '1', substr(m,2) AS m > FROM t1 > ORDER BY m; > > SELECT '2', substr(m,2) AS m > FROM t1 > ORDER BY lower(m); > --------------------------------------------------- > > Using PostgreSQL 9.1.5, the first query returns x-y-z while the second returns z-y-x. Is this "correct"? It certainlyis surprising to me. > > I'm asking because the same question has come up in SQLite and whenever there is a language dispute in SQLite, our firstinstinct is to find out what PostgreSQL does and try to do the same thing. SQLite version 3.7.15 was behaving the sameway as PostgreSQL 9.1.5 (unbeknownst to us at the time). Then a bug was written about the inconsistent behavior of ORDERBY. We fixed that bug so that the latest SQLite answers x-y-z in both cases. Now someone is complaining that the "fix"was really a 'break". Is it? Or is there an equivalent bug in PostgreSQL? > > There are, of course, many ways to resolve the ambiguity (such as using a unique label for the result column, or by saying"t1.m" instead of just "m" when you mean the column of the table). But that is not really the point here. The questionis, how should symbolic names in the ORDER BY clause be resolved? Should column names in the source table take precedenceover result column name, or should it be the other way around? > > Any insights are appreciated. Please advise if a different mailing list would be more appropriate for this question. My guess without testing is that order by lower(m) is not what you think it is. substr(m,2) as m is bad form. Always use a new and unique alias, like m1. How does this work: SELECT '2', substr(m,2) AS m1 FROM t1 ORDER BY lower(m1);
On Wed, Aug 14, 2013 at 2:28 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
substr(m,2) as m
is bad form. Always use a new and unique alias, like m1. How does this work:
SELECT '2', substr(m,2) AS m1
FROM t1
ORDER BY lower(m1);
Tnx. I think everybody agrees that "substr(m,2) as m" is bad form. And all the database engines get the same consistent answer when you avoid the bad form and use "substr(m,2) as m1" instead. The question is, what should the database engine do when the programmer disregards sounds advice and uses the bad form anyhow?
--
D. Richard Hipp
drh@sqlite.org
On Wed, Aug 14, 2013 at 12:31 PM, Richard Hipp <drh@sqlite.org> wrote: > > > On Wed, Aug 14, 2013 at 2:28 PM, Scott Marlowe <scott.marlowe@gmail.com> > wrote: >> >> On Wed, Aug 14, 2013 at 12:01 PM, Richard Hipp <drh@sqlite.org> wrote: >> >> substr(m,2) as m >> >> is bad form. Always use a new and unique alias, like m1. How does this >> work: >> >> SELECT '2', substr(m,2) AS m1 >> FROM t1 >> ORDER BY lower(m1); > > > Tnx. I think everybody agrees that "substr(m,2) as m" is bad form. And all > the database engines get the same consistent answer when you avoid the bad > form and use "substr(m,2) as m1" instead. The question is, what should the > database engine do when the programmer disregards sounds advice and uses the > bad form anyhow? My guess is that either the SQL spec says it's system determined OR that the way postgres does it is right. And I'm leaning towards the second. Someone with the spec hand would have to look it up.
Richard Hipp <drh@sqlite.org> writes: > There are, of course, many ways to resolve the ambiguity (such as using a > unique label for the result column, or by saying "t1.m" instead of just "m" > when you mean the column of the table). But that is not really the point > here. The question is, how should symbolic names in the ORDER BY clause be > resolved? Should column names in the source table take precedence over > result column name, or should it be the other way around? Our interpretation is that a bare column name ("ORDER BY foo") is resolved first as an output-column label, or failing that as an input-column name. However, as soon as you embed a name in an expression, it will be treated *only* as an input column name. The SQL standard is not a lot of help here. In SQL92, the only allowed forms of ORDER BY arguments were an output column name or an output column number. SQL99 and later dropped that definition (acknowledging that they were being incompatible) and substituted some fairly impenetrable verbiage that seems to boil down to allowing input column names that can be within expressions. At least that's how we've chosen to read it. Our current behavior is a compromise that tries to support both editions of the spec. regards, tom lane
On Wed, Aug 14, 2013 at 2:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Our interpretation is that a bare column name ("ORDER BY foo") is resolved
first as an output-column label, or failing that as an input-column name.
However, as soon as you embed a name in an expression, it will be treated
*only* as an input column name.
The SQL standard is not a lot of help here. In SQL92, the only allowed
forms of ORDER BY arguments were an output column name or an output column
number. SQL99 and later dropped that definition (acknowledging that they
were being incompatible) and substituted some fairly impenetrable verbiage
that seems to boil down to allowing input column names that can be within
expressions. At least that's how we've chosen to read it. Our current
behavior is a compromise that tries to support both editions of the spec.
Thanks for the explanation, Tom.
Just to be clear, you intend that a COLLATE clause in the ORDER BY is treated as an expression, right? So that the two queries in the following SQL output rows in the opposite order:
------------------------
CREATE TABLE t1(m VARCHAR(4));
INSERT INTO t1 VALUES('az');
INSERT INTO t1 VALUES('by');
INSERT INTO t1 VALUES('cx');
SELECT '1', substr(m,2) AS m FROM t1 ORDER BY m;
SELECT '2', substr(m,2) AS m FROM t1 ORDER BY m COLLATE "POSIX";
------------------------
Just to be clear, you intend that a COLLATE clause in the ORDER BY is treated as an expression, right? So that the two queries in the following SQL output rows in the opposite order:
------------------------
CREATE TABLE t1(m VARCHAR(4));
INSERT INTO t1 VALUES('az');
INSERT INTO t1 VALUES('by');
INSERT INTO t1 VALUES('cx');
SELECT '1', substr(m,2) AS m FROM t1 ORDER BY m;
SELECT '2', substr(m,2) AS m FROM t1 ORDER BY m COLLATE "POSIX";
------------------------
If that is not correct, please let me know because I am about to change SQLite to work exactly as PostgreSQL does.
--
D. Richard Hipp
drh@sqlite.org
On Wed, 2013-08-14 at 14:01 -0400, Richard Hipp wrote: > CREATE TABLE t1(m VARCHAR(4)); > INSERT INTO t1 VALUES('az'); > INSERT INTO t1 VALUES('by'); > INSERT INTO t1 VALUES('cx'); > > SELECT '1', substr(m,2) AS m > FROM t1 > ORDER BY m; > > SELECT '2', substr(m,2) AS m > FROM t1 > ORDER BY lower(m); You cannot cast your ORDER BY column value. Instead:- SELECT '2', LOWER(substr(m,2)) AS m FROM t1 ORDER BY m; will have the desired effect.
Richard Hipp <drh@sqlite.org> writes: > Just to be clear, you intend that a COLLATE clause in the ORDER BY is > treated as an expression, right? Yeah, that is the current behavior, although I think that it probably fell out rather than being consciously thought about. Arguably it's the wrong thing, because it looks to me like SQL92 allowed select ... order by 1 collate "something" which our current code will interpret as ordering by a constant (and then complain that you can't collate an int...). Don't know if it's worth changing --- we have our own backwards-compatibility concerns to think about, and there haven't been any field complaints about this. I doubt there's any pressure at this point to become more compatible with SQL92, as opposed to more compatible with later editions. regards, tom lane
On Wed, Aug 14, 2013 at 2:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Our interpretation is that a bare column name ("ORDER BY foo") is resolved
first as an output-column label, or failing that as an input-column name.
However, as soon as you embed a name in an expression, it will be treated
*only* as an input column name.
The SQL standard is not a lot of help here. In SQL92, the only allowed
forms of ORDER BY arguments were an output column name or an output column
number. SQL99 and later dropped that definition (acknowledging that they
were being incompatible) and substituted some fairly impenetrable verbiage
that seems to boil down to allowing input column names that can be within
expressions. At least that's how we've chosen to read it. Our current
behavior is a compromise that tries to support both editions of the spec.
Asking as a comparative know-nothing who would like to be more informed, is there something wrong with the notion of throwing an error that m in the ORDER BY clause is ambiguous here? As near as I can tell, it really is ambiguous as long as both input or output columns are accepted, so either way is essentially a total guess about what the user wants. It seems to me that throwing an error would be the most intuitive and clearly defined way of handling this case.
On Fri, Aug 16, 2013 at 8:37 PM, BladeOfLight16 <bladeoflight16@gmail.com> wrote: > On Wed, Aug 14, 2013 at 2:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Our interpretation is that a bare column name ("ORDER BY foo") is resolved >> first as an output-column label, or failing that as an input-column name. >> However, as soon as you embed a name in an expression, it will be treated >> *only* as an input column name. >> >> The SQL standard is not a lot of help here. In SQL92, the only allowed >> forms of ORDER BY arguments were an output column name or an output column >> number. SQL99 and later dropped that definition (acknowledging that they >> were being incompatible) and substituted some fairly impenetrable verbiage >> that seems to boil down to allowing input column names that can be within >> expressions. At least that's how we've chosen to read it. Our current >> behavior is a compromise that tries to support both editions of the spec. > > > Asking as a comparative know-nothing who would like to be more informed, is > there something wrong with the notion of throwing an error that m in the > ORDER BY clause is ambiguous here? As near as I can tell, it really is > ambiguous as long as both input or output columns are accepted, so either > way is essentially a total guess about what the user wants. It seems to me > that throwing an error would be the most intuitive and clearly defined way > of handling this case. Well it's not likely that the current behaviour will be changed since there are likely apps that rely on it working (sort of) the way it is. A warning or notice might make sense then. -- To understand recursion, one must first understand recursion.