Thread: Fixing row comparison semantics

Fixing row comparison semantics

From

Tom Lane

Date:

23 December 2005, 16:18:23

I've gotten interested again in the issue of row comparisons, eg(a, b, c) >= (1, 2, 3)
We've discussed this before, the most comprehensive thread being
http://archives.postgresql.org/pgsql-performance/2004-07/msg00188.php
but nothing's gotten done.  Unless someone's already working on this
I think I will take it up.

I said in that thread that I'd like to see the implementation handle
nonstandard operators, for example(a, b) ~<~ ('foo', 'bar')
ought to match the sorting semantics of an index using text_pattern_ops.
This means that we need some way of figuring out the semantics to
associate with the row comparison.  I think the individual pairwise
comparison operators could be looked up in the normal way based on the
datatypes of the left and right inputs, but the row comparison itself
has to understand whether it's doing "=", "<", etc in order to apply the
operators in the right way.  We can't just rely on the operator name
if we are to support nonstandard operators.

I think the right way to deal with it is to look up the pairwise
operators in pg_amop to see if they are members of btree index
opclasses.  This means that the row-wise construct would fail for any
operators that are not in opclasses, but that doesn't seem like a
serious problem in practice.  (Note: we can handle <> by seeing if the
operator has a negator that is btree equality.  All this logic exists
already in predtest.c.)

The tricky part is what if the operators are found in more than one
opclass --- for example, suppose someone has installed a reverse-sort
opclass?  I suggest the following rules:

1. Determine which interpretations (btree strategy numbers) exist for
each pairwise operator.  There must be at least one interpretation that
is common to all the operators, else fail (for instance, it doesn't help
if we can identify one operator as "<" and another as ">").

2. If there is more than one common interpretation, prefer the one that
uses the largest number of default opclasses.  If there's a tie, we
could either reject the construct as ambiguous, or select one of the
possibilities arbitrarily ... any thoughts about that?

3. A given operator could have the selected interpretation in more than
one opclass.  Prefer the default opclass if any; otherwise, again we
have the choice of rejecting or making an arbitrary choice.

Notice that I'm assuming we need to identify a specific opclass for each
pairwise operator.  This is not strictly necessary for the = and <>
cases, because there you just need the given operator; but it is
necessary for the < <= > >= cases, where you must have a notion of
equality as well as the particular inequality operator.  It may be worth
stopping once we've identified the common interpretation if it's = or <>.

Comments?
        regards, tom lane

Re: Fixing row comparison semantics

From

Christopher Kings-Lynne

Date:

24 December 2005, 05:25:38

> I've gotten interested again in the issue of row comparisons, eg
>     (a, b, c) >= (1, 2, 3)
> We've discussed this before, the most comprehensive thread being
> http://archives.postgresql.org/pgsql-performance/2004-07/msg00188.php
> but nothing's gotten done.  Unless someone's already working on this
> I think I will take it up.

Can someone explain to me how:

(a, b) < (1, 2)

is different to

a < 1 and b < 2

?

Chris

Re: Fixing row comparison semantics

From

Martijn van Oosterhout

Date:

24 December 2005, 05:55:52

On Fri, Dec 23, 2005 at 03:18:21PM -0500, Tom Lane wrote:
> I've gotten interested again in the issue of row comparisons, eg
>     (a, b, c) >= (1, 2, 3)
> We've discussed this before, the most comprehensive thread being
> http://archives.postgresql.org/pgsql-performance/2004-07/msg00188.php
> but nothing's gotten done.  Unless someone's already working on this
> I think I will take it up.

<snip>

Since this is related to the COLLATE stuff I'm working on I'd like to
make a few comments.

> 1. Determine which interpretations (btree strategy numbers) exist for
> each pairwise operator.  There must be at least one interpretation that
> is common to all the operators, else fail (for instance, it doesn't help
> if we can identify one operator as "<" and another as ">").

One thing my COLLATE patch does is distinguish between collations and
operator classes. So the reverse operator class issue disappears
because it's just a collation and doesn't need a operator class
(although it won't break anything, see below).

> 2. If there is more than one common interpretation, prefer the one that
> uses the largest number of default opclasses.  If there's a tie, we
> could either reject the construct as ambiguous, or select one of the
> possibilities arbitrarily ... any thoughts about that?

In standard SQL, each node in a query has a collation. Columns use the
collation they were given when the table was created, constants use the
default for the type. It's a little more complicated than that, see the
standard for details.

Anyway, a collation identifies a btree operator class so this problem
solves itself. For each pair of values you are comparing, determine the
collation and look up the operator class to ensure you're using the
same strategy type. There are minor details relating to reverse
collations but they're minor.

The only problem reverse operator classes bring here is that the system
won't realise it and thus won't know that the index is usable. Unless
the user specifies the collation as part of the query.

> 3. A given operator could have the selected interpretation in more than
> one opclass.  Prefer the default opclass if any; otherwise, again we
> have the choice of rejecting or making an arbitrary choice.

If there's a problem, bail. The standard allows you to specify the
collation on a per node basis so any ambiguities can be resolved by the
user.

So something like:

(a COLLATE hungarian, b COLLATE posix, c COLLATE ignorecase) >= ('x','y','z')

Would know exactly what to do (and if you could use an index)...

Now, since COLLATE support is still in progress, I'm not sure how much
any of this helps you. I'm up to modifying the scankeys but it's hard
when you jave to keep rgrepping the tree to work out what is called
from where...

For other people reading this thread, the reason why it can't be
decomposed into (a>=1 AND b>=2 AND c>=3) is because the standard treats
the row as a unit, checking left to right, so:

(4,0,0) < (5,0,0)
(1,2,3) > (0,7,8)

So it needs a new node type and needs to know which index to use.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Fixing row comparison semantics

From

Christopher Kings-Lynne

Date:

24 December 2005, 06:06:01

> Now, since COLLATE support is still in progress, I'm not sure how much
> any of this helps you. I'm up to modifying the scankeys but it's hard
> when you jave to keep rgrepping the tree to work out what is called
> from where...

src/tools/make_ctags is your friend...

Chris

Re: Fixing row comparison semantics

From

Martijn van Oosterhout

Date:

24 December 2005, 06:46:39

On Sat, Dec 24, 2005 at 06:05:58PM +0800, Christopher Kings-Lynne wrote:
> >Now, since COLLATE support is still in progress, I'm not sure how much
> >any of this helps you. I'm up to modifying the scankeys but it's hard
> >when you jave to keep rgrepping the tree to work out what is called
> >from where...
>
> src/tools/make_ctags is your friend...

That just shows you where a symbol is defined, not where it's called
from. When you change the parameters of a function, you need to make
sure you found all the places that used it...

IOW, it's good for going down the call tree, but not up it.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Fixing row comparison semantics

From

Peter Eisentraut

Date:

24 December 2005, 07:25:58

Am Samstag, 24. Dezember 2005 11:46 schrieb Martijn van Oosterhout:
> That just shows you where a symbol is defined, not where it's called
> from.

I've never used ctags, but etags certainly do what you ask for.

Re: Fixing row comparison semantics

From

Martijn van Oosterhout

Date:

24 December 2005, 07:50:32

On Sat, Dec 24, 2005 at 12:25:41PM +0100, Peter Eisentraut wrote:
> Am Samstag, 24. Dezember 2005 11:46 schrieb Martijn van Oosterhout:
> > That just shows you where a symbol is defined, not where it's called
> > from.
>
> I've never used ctags, but etags certainly do what you ask for.

Really? I've searched the emacs documentation but don't see it. M-.
finds the definition of a tag, but how do you find usages of a tag?

Thanks in advance,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Fixing row comparison semantics

From

Tom Lane

Date:

24 December 2005, 10:38:22

Martijn van Oosterhout <kleptog@svana.org> writes:
> One thing my COLLATE patch does is distinguish between collations and
> operator classes. So the reverse operator class issue disappears
> because it's just a collation and doesn't need a operator class
> (although it won't break anything, see below).

Are you suggesting that COLLATE will impose comparison semantics on
all datatypes including non-string types?  If so, I'd be interested
to know what you have in mind.  If not, claiming that it makes the
issue go away is nonsensical.
        regards, tom lane

Re: Fixing row comparison semantics

From

Tom Lane

Date:

24 December 2005, 10:45:38

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
>> Now, since COLLATE support is still in progress, I'm not sure how much
>> any of this helps you. I'm up to modifying the scankeys but it's hard
>> when you jave to keep rgrepping the tree to work out what is called
>> from where...

> src/tools/make_ctags is your friend...

If grep -r is too slow for you, there's a package called glimpse that
I've used for years.  It builds a full-text index of any specified
collection of files, and then does grep-like searches nearly
instantaneously.  The output format is the same as grep so you can
teach emacs to visit all the hits.
        regards, tom lane

Re: Fixing row comparison semantics

From

Tom Lane

Date:

24 December 2005, 10:53:11

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
> Can someone explain to me how:
> (a, b) < (1, 2)
> is different to
> a < 1 and b < 2

Right at the moment our code interprets it that way, but this behavior
is wrong per spec.  It should be an ordered column-by-column comparison,
so that the equivalent simple expression is
(a < 1) OR (a = 1 AND b < 2)
        regards, tom lane

Re: Fixing row comparison semantics

From

Alvaro Herrera

Date:

24 December 2005, 20:56:06

Martijn van Oosterhout wrote:

> > src/tools/make_ctags is your friend...
> 
> That just shows you where a symbol is defined, not where it's called
> from. When you change the parameters of a function, you need to make
> sure you found all the places that used it...
> 
> IOW, it's good for going down the call tree, but not up it.

I use cscope for that.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Fixing row comparison semantics

From

Martijn van Oosterhout

Date:

25 December 2005, 09:10:15

On Sat, Dec 24, 2005 at 09:38:23AM -0500, Tom Lane wrote:
> Are you suggesting that COLLATE will impose comparison semantics on
> all datatypes including non-string types?  If so, I'd be interested
> to know what you have in mind.  If not, claiming that it makes the
> issue go away is nonsensical.

Well, yes, on all data types. It needs to be done for string types and
it would be nice for user-defined data types, so you may as well do it
for all types. It avoids adding special cases, which is a good thing,
IMHO.

Every data type has at least two collations, ascending and descending.
So instead of all the current stuff with reverse operator classes,
you'll just be able to declare your index as:

CREATE INDEX blah ON foo (a, b COLLATE DESC);

And it'll be able to be used for queries using ORDER BY a, b DESC.

String data types are just the obvious example of types that have many
different collations and they do have the most possibilities. But I
think that user-defined collations would be a powerful idea. All they
need to do is create a btree operator class that describes the basic
order and then they can use this as a collation anywhere they like.

I hope you are not thinking of restricting collations to just string
types, because the special cases would be dreadful. Doing it this way
just means that most places dealing with order only need to worry about
the collation eg pathkeys and not the implementation.

Technically speaking, NULLS FIRST/LAST are also a form of collation but
I'm not going to touch those until I can at least replicate current
functionality, and they are not relevent for row comparisons anyway.
Collations to operator classes are a many-to-one relationship. I can
see situations where you would have 20 collations using a single
operator class.

Locale specific ordering is really just a subset of collation. At the
moment it just uses the xlocale support present in glibc/MacOS X/Win32
but my hope is that it will be pluggable in the sense that you'll be
able to say:

CREATE LOCALE hungarian AS 'hu_HU' USING glibc;
CREATE LOCALE serbian_us AS 'sr_Latn_YU_REVISED@currency=USD' USING icu;

(The latter being: Serbian (Latin, Yugoslavia, Revised Orthography,
Currency=US Dollar. Example taken from ICU website).

Then you can use these in column declarations and have them
automatically use that locale for comparisons. It isn't as hard as it
looks but it does touch a lot of different places in the backend.

If you want technical details I can do that too (the summary on
pg-patches a while ago is now wildly out of date). Currently I'm trying
to get up to speed on pathkeys and indexes before the tree drifts too
far...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Fixing row comparison semantics

From

Bruce Momjian

Date:

26 December 2005, 07:54:52

Alvaro Herrera wrote:
> Martijn van Oosterhout wrote:
> 
> > > src/tools/make_ctags is your friend...
> > 
> > That just shows you where a symbol is defined, not where it's called
> > from. When you change the parameters of a function, you need to make
> > sure you found all the places that used it...
> > 
> > IOW, it's good for going down the call tree, but not up it.
> 
> I use cscope for that.

There is also id-utils, which is mentioned in our developer FAQ.  That's
what I use, but it is strictly command-line or 'edit all files with X',
while I think glimpse and cscope are more like applications, though they
probably have command-line interfaces too.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Fixing row comparison semantics

From

Bruce Momjian

Date:

26 December 2005, 07:57:52

Tom Lane wrote:
> Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
> > Can someone explain to me how:
> > (a, b) < (1, 2)
> > is different to
> > a < 1 and b < 2
> 
> Right at the moment our code interprets it that way, but this behavior
> is wrong per spec.  It should be an ordered column-by-column comparison,
> so that the equivalent simple expression is
> 
>     (a < 1) OR (a = 1 AND b < 2)

TODO updated:
* %Make row-wise comparisons work per SQL spec  Right now, '(a, b) < (1, 2)' is processed as 'a < 1 and b < 2', but
theSQL standard requires it to be processed as a column-by-column  comparison, so the proper comparison is '(a < 1) OR
(a= 1 AND b < 2)'
 


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Fixing row comparison semantics

From

"Pavel Stehule"

Date:

26 December 2005, 08:29:21

>
>TODO updated:
>
>    * %Make row-wise comparisons work per SQL spec
>
>      Right now, '(a, b) < (1, 2)' is processed as 'a < 1 and b < 2', but
>      the SQL standard requires it to be processed as a column-by-column
>      comparison, so the proper comparison is '(a < 1) OR (a = 1 AND b < 2)'
>
>

Can we save current behave (with small modification) with other operator, 
like <*

(1,1) <* (1,2) = true
(1,2) <* (2,1) is NULL
(2,3) <* (1,2) = false

it's usefull for multicriterial optimalisation

Regards
Pavel Stehule

_________________________________________________________________
Chcete sdilet sve obrazky a hudbu s prateli? http://messenger.msn.cz/

Re: Fixing row comparison semantics

From

Martijn van Oosterhout

Date:

26 December 2005, 08:40:51

On Mon, Dec 26, 2005 at 01:29:19PM +0100, Pavel Stehule wrote:
> Can we save current behave (with small modification) with other operator,
> like <*
>
> (1,1) <* (1,2) = true
> (1,2) <* (2,1) is NULL
> (2,3) <* (1,2) = false
>
> it's usefull for multicriterial optimalisation

That's strange. That's not just doing less-than, but also
distinguishing between equal-to and greater-than. So at the very least
you've have to choose an operator like <=>.

Seems to me you should just define your own operator on an array type
and use that. I don't think the above could use an index scan for
speeding it up so there's no point trying to treat it specially.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Fixing row comparison semantics

From

Tom Lane

Date:

26 December 2005, 11:04:59

Martijn van Oosterhout <kleptog@svana.org> writes:
> If you want technical details I can do that too (the summary on
> pg-patches a while ago is now wildly out of date). Currently I'm trying
> to get up to speed on pathkeys and indexes before the tree drifts too
> far...

I've given this advice before to other people: trying to develop a large
patch "in hiding" is doomed to failure.  And the sort of patch you are
talking about isn't large ... it's massive.  Combine that with the fact
that you don't even seem to have gotten any pghackers buy-in yet on what
you are doing, and you are setting yourself up to have the patch
rejected, in the unlikely scenario that it's ever completed.  My bet is
that you by yourself will be unable to complete it, simply because the
tree will drift under you faster than you can respond.  (Case in point:
my current project on row-wise comparisons is going to affect ScanKeys.
I'm not sure how yet, but in designing that I won't be considering what
impact it might have on you, because I have no idea what you might be
trying to do in that area.)

I would recommend posting some fairly detailed design discussions
concerning what you see as the new semantics, API, and catalog
representation for operators and operator classes.  If you haven't
got buy-in at that level from the hackers list, it's premature to be
writing any code at all.

I would further recommend that you ask for help rather than trying to
complete the project by yourself.
        regards, tom lane

Re: Fixing row comparison semantics

From

Tom Lane

Date:

26 December 2005, 12:10:31

"Pavel Stehule" <pavel.stehule@hotmail.com> writes:
>> Right now, '(a, b) < (1, 2)' is processed as 'a < 1 and b < 2', but
>> the SQL standard requires it to be processed as a column-by-column
>> comparison, so the proper comparison is '(a < 1) OR (a = 1 AND b < 2)'

> Can we save current behave (with small modification) with other operator, 
> like <*

Huh?  The only "current behavior" with other operators is failure:

regression=# select (1,1) <* (1,2);
ERROR:  operator <* is not supported for row expressions

In any case, you can get the equivalent of the current behavior by
writing out1 <* 1 AND 1 <* 2
so I don't see any strong need to support non-SQL-spec behaviors here.
        regards, tom lane

Re: Fixing row comparison semantics

From

"Pavel Stehule"

Date:

26 December 2005, 15:51:32

>
>Huh?  The only "current behavior" with other operators is failure:

you didn't understand me. I know so operator <* isn't supported now.
I prefere SQL spec behave too. But what I wont:

a <* b   ~ ai <= bi and one ai < bi => true ; if one ai > bi => NULL; else 
false

but this behave is from some views really chaotic. This comparation is used 
in operation research, but propably is far to ideas ANSI SQL. It was  only 
idea.

>
>regression=# select (1,1) <* (1,2);
>ERROR:  operator <* is not supported for row expressions
>
>In any case, you can get the equivalent of the current behavior by
>writing out
>    1 <* 1 AND 1 <* 2
>so I don't see any strong need to support non-SQL-spec behaviors here.
>
>            regards, tom lane

_________________________________________________________________
Najdete si svou lasku a nove pratele na Match.com. http://www.msn.cz/

Fixing row comparison semantics

From

Gregory Maxwell

Date:

26 December 2005, 16:12:51

On 12/26/05, Pavel Stehule <pavel.stehule@hotmail.com> wrote:
> (1,1) <* (1,2) = true
> (1,2) <* (2,1) is NULL
> (2,3) <* (1,2) = false
>
> it's usefull for multicriterial optimalisation

This is indeed a sane and useful function which should be adopted by
the SQL standard.. in postgresql this would easily enough be
implemented as a user function so I'm not sure we need special support
for it.

The idea is that in a multidimension comparison you can only sometimes
say when one tuple is strictly less than (or greater than) another
because differing dimensions are incomparable.  So, like his example,
we can not say if (1,2) is lesser or greater than (2,1) because saying
so would require some priority of the dimensions which may not be
known or may not exist, it is only clear that they are not equal..

Re: Fixing row comparison semantics

From

Martijn van Oosterhout

Date:

26 December 2005, 17:48:03

Greetings to all,

On Mon, Dec 26, 2005 at 10:04:59AM -0500, Tom Lane wrote:
> I would recommend posting some fairly detailed design discussions
> concerning what you see as the new semantics, API, and catalog
> representation for operators and operator classes.  If you haven't
> got buy-in at that level from the hackers list, it's premature to be
> writing any code at all.

Well, the problem works the other way around too. Unless you're
familiar with postgres internals you can't talk sensebly about the
changes required. Until two days ago I wasn't sure that my
representation was sufficient for everything I wanted it to do.

Anyway, the details posted before to -patches [1] and to -hackers [2]
produced much discussion about locales but didn't consider the actual
patch itself. I thought they were quite detailed and got exactly one
useful reaction: don't confuse collations and locales. Good advice and
the patch is better for it but I figured I needed to come up with
something more substantial for people to take notice.

Also, don't be confused about the time interval, there was zero
development on it from September until a few days ago, the changes
relative to then are not that large.

Now I can (almost) initdb again and have no more catalog changes I was
thinking of posting another patch with summary and see how far it gets
this time. BTW, my stuff won't change how ScanKeys work, they'll
probably just provide another way to initialise them.

Have a nice day,

[1] http://archives.postgresql.org/pgsql-patches/2005-09/msg00022.php
[2] http://archives.postgresql.org/pgsql-hackers/2005-09/msg00110.php

PS. The web archive for pg-patches seems to be lagging about 4 days
at the moment.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Fixing row comparison semantics

From

Bruno Wolff III

Date:

28 December 2005, 15:22:33

On Mon, Dec 26, 2005 at 15:12:48 -0500, Gregory Maxwell <gmaxwell@gmail.com> wrote:
> On 12/26/05, Pavel Stehule <pavel.stehule@hotmail.com> wrote:
> > (1,1) <* (1,2) = true
> > (1,2) <* (2,1) is NULL
> > (2,3) <* (1,2) = false
> >
> > it's usefull for multicriterial optimalisation
> 
> This is indeed a sane and useful function which should be adopted by
> the SQL standard.. in postgresql this would easily enough be
> implemented as a user function so I'm not sure we need special support
> for it.
> 
> The idea is that in a multidimension comparison you can only sometimes
> say when one tuple is strictly less than (or greater than) another
> because differing dimensions are incomparable.  So, like his example,
> we can not say if (1,2) is lesser or greater than (2,1) because saying
> so would require some priority of the dimensions which may not be
> known or may not exist, it is only clear that they are not equal..

That's normally called a partial order. From CS classes, I don't remember
ever seeing < and > combined that way. Normally you just looked at whether
or not < was true or whether or not > was true.
Is it common in practice for people to want the answer to both of these at
once? Also if you are really dealing with <= and >= (as your example above
seems to imply), then there are 4 possible states (the new one being a = b)
and you can't represent that with just 3 possible results.