Thread: Fixing row comparison semantics
I've gotten interested again in the issue of row comparisons, eg(a, b, c) >= (1, 2, 3) We've discussed this before, the most comprehensive thread being http://archives.postgresql.org/pgsql-performance/2004-07/msg00188.php but nothing's gotten done. Unless someone's already working on this I think I will take it up. I said in that thread that I'd like to see the implementation handle nonstandard operators, for example(a, b) ~<~ ('foo', 'bar') ought to match the sorting semantics of an index using text_pattern_ops. This means that we need some way of figuring out the semantics to associate with the row comparison. I think the individual pairwise comparison operators could be looked up in the normal way based on the datatypes of the left and right inputs, but the row comparison itself has to understand whether it's doing "=", "<", etc in order to apply the operators in the right way. We can't just rely on the operator name if we are to support nonstandard operators. I think the right way to deal with it is to look up the pairwise operators in pg_amop to see if they are members of btree index opclasses. This means that the row-wise construct would fail for any operators that are not in opclasses, but that doesn't seem like a serious problem in practice. (Note: we can handle <> by seeing if the operator has a negator that is btree equality. All this logic exists already in predtest.c.) The tricky part is what if the operators are found in more than one opclass --- for example, suppose someone has installed a reverse-sort opclass? I suggest the following rules: 1. Determine which interpretations (btree strategy numbers) exist for each pairwise operator. There must be at least one interpretation that is common to all the operators, else fail (for instance, it doesn't help if we can identify one operator as "<" and another as ">"). 2. If there is more than one common interpretation, prefer the one that uses the largest number of default opclasses. If there's a tie, we could either reject the construct as ambiguous, or select one of the possibilities arbitrarily ... any thoughts about that? 3. A given operator could have the selected interpretation in more than one opclass. Prefer the default opclass if any; otherwise, again we have the choice of rejecting or making an arbitrary choice. Notice that I'm assuming we need to identify a specific opclass for each pairwise operator. This is not strictly necessary for the = and <> cases, because there you just need the given operator; but it is necessary for the < <= > >= cases, where you must have a notion of equality as well as the particular inequality operator. It may be worth stopping once we've identified the common interpretation if it's = or <>. Comments? regards, tom lane
> I've gotten interested again in the issue of row comparisons, eg > (a, b, c) >= (1, 2, 3) > We've discussed this before, the most comprehensive thread being > http://archives.postgresql.org/pgsql-performance/2004-07/msg00188.php > but nothing's gotten done. Unless someone's already working on this > I think I will take it up. Can someone explain to me how: (a, b) < (1, 2) is different to a < 1 and b < 2 ? Chris
On Fri, Dec 23, 2005 at 03:18:21PM -0500, Tom Lane wrote: > I've gotten interested again in the issue of row comparisons, eg > (a, b, c) >= (1, 2, 3) > We've discussed this before, the most comprehensive thread being > http://archives.postgresql.org/pgsql-performance/2004-07/msg00188.php > but nothing's gotten done. Unless someone's already working on this > I think I will take it up. <snip> Since this is related to the COLLATE stuff I'm working on I'd like to make a few comments. > 1. Determine which interpretations (btree strategy numbers) exist for > each pairwise operator. There must be at least one interpretation that > is common to all the operators, else fail (for instance, it doesn't help > if we can identify one operator as "<" and another as ">"). One thing my COLLATE patch does is distinguish between collations and operator classes. So the reverse operator class issue disappears because it's just a collation and doesn't need a operator class (although it won't break anything, see below). > 2. If there is more than one common interpretation, prefer the one that > uses the largest number of default opclasses. If there's a tie, we > could either reject the construct as ambiguous, or select one of the > possibilities arbitrarily ... any thoughts about that? In standard SQL, each node in a query has a collation. Columns use the collation they were given when the table was created, constants use the default for the type. It's a little more complicated than that, see the standard for details. Anyway, a collation identifies a btree operator class so this problem solves itself. For each pair of values you are comparing, determine the collation and look up the operator class to ensure you're using the same strategy type. There are minor details relating to reverse collations but they're minor. The only problem reverse operator classes bring here is that the system won't realise it and thus won't know that the index is usable. Unless the user specifies the collation as part of the query. > 3. A given operator could have the selected interpretation in more than > one opclass. Prefer the default opclass if any; otherwise, again we > have the choice of rejecting or making an arbitrary choice. If there's a problem, bail. The standard allows you to specify the collation on a per node basis so any ambiguities can be resolved by the user. So something like: (a COLLATE hungarian, b COLLATE posix, c COLLATE ignorecase) >= ('x','y','z') Would know exactly what to do (and if you could use an index)... Now, since COLLATE support is still in progress, I'm not sure how much any of this helps you. I'm up to modifying the scankeys but it's hard when you jave to keep rgrepping the tree to work out what is called from where... For other people reading this thread, the reason why it can't be decomposed into (a>=1 AND b>=2 AND c>=3) is because the standard treats the row as a unit, checking left to right, so: (4,0,0) < (5,0,0) (1,2,3) > (0,7,8) So it needs a new node type and needs to know which index to use. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
> Now, since COLLATE support is still in progress, I'm not sure how much > any of this helps you. I'm up to modifying the scankeys but it's hard > when you jave to keep rgrepping the tree to work out what is called > from where... src/tools/make_ctags is your friend... Chris
On Sat, Dec 24, 2005 at 06:05:58PM +0800, Christopher Kings-Lynne wrote: > >Now, since COLLATE support is still in progress, I'm not sure how much > >any of this helps you. I'm up to modifying the scankeys but it's hard > >when you jave to keep rgrepping the tree to work out what is called > >from where... > > src/tools/make_ctags is your friend... That just shows you where a symbol is defined, not where it's called from. When you change the parameters of a function, you need to make sure you found all the places that used it... IOW, it's good for going down the call tree, but not up it. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Am Samstag, 24. Dezember 2005 11:46 schrieb Martijn van Oosterhout: > That just shows you where a symbol is defined, not where it's called > from. I've never used ctags, but etags certainly do what you ask for.
On Sat, Dec 24, 2005 at 12:25:41PM +0100, Peter Eisentraut wrote: > Am Samstag, 24. Dezember 2005 11:46 schrieb Martijn van Oosterhout: > > That just shows you where a symbol is defined, not where it's called > > from. > > I've never used ctags, but etags certainly do what you ask for. Really? I've searched the emacs documentation but don't see it. M-. finds the definition of a tag, but how do you find usages of a tag? Thanks in advance, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Martijn van Oosterhout <kleptog@svana.org> writes: > One thing my COLLATE patch does is distinguish between collations and > operator classes. So the reverse operator class issue disappears > because it's just a collation and doesn't need a operator class > (although it won't break anything, see below). Are you suggesting that COLLATE will impose comparison semantics on all datatypes including non-string types? If so, I'd be interested to know what you have in mind. If not, claiming that it makes the issue go away is nonsensical. regards, tom lane
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: >> Now, since COLLATE support is still in progress, I'm not sure how much >> any of this helps you. I'm up to modifying the scankeys but it's hard >> when you jave to keep rgrepping the tree to work out what is called >> from where... > src/tools/make_ctags is your friend... If grep -r is too slow for you, there's a package called glimpse that I've used for years. It builds a full-text index of any specified collection of files, and then does grep-like searches nearly instantaneously. The output format is the same as grep so you can teach emacs to visit all the hits. regards, tom lane
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > Can someone explain to me how: > (a, b) < (1, 2) > is different to > a < 1 and b < 2 Right at the moment our code interprets it that way, but this behavior is wrong per spec. It should be an ordered column-by-column comparison, so that the equivalent simple expression is (a < 1) OR (a = 1 AND b < 2) regards, tom lane
Martijn van Oosterhout wrote: > > src/tools/make_ctags is your friend... > > That just shows you where a symbol is defined, not where it's called > from. When you change the parameters of a function, you need to make > sure you found all the places that used it... > > IOW, it's good for going down the call tree, but not up it. I use cscope for that. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Sat, Dec 24, 2005 at 09:38:23AM -0500, Tom Lane wrote: > Are you suggesting that COLLATE will impose comparison semantics on > all datatypes including non-string types? If so, I'd be interested > to know what you have in mind. If not, claiming that it makes the > issue go away is nonsensical. Well, yes, on all data types. It needs to be done for string types and it would be nice for user-defined data types, so you may as well do it for all types. It avoids adding special cases, which is a good thing, IMHO. Every data type has at least two collations, ascending and descending. So instead of all the current stuff with reverse operator classes, you'll just be able to declare your index as: CREATE INDEX blah ON foo (a, b COLLATE DESC); And it'll be able to be used for queries using ORDER BY a, b DESC. String data types are just the obvious example of types that have many different collations and they do have the most possibilities. But I think that user-defined collations would be a powerful idea. All they need to do is create a btree operator class that describes the basic order and then they can use this as a collation anywhere they like. I hope you are not thinking of restricting collations to just string types, because the special cases would be dreadful. Doing it this way just means that most places dealing with order only need to worry about the collation eg pathkeys and not the implementation. Technically speaking, NULLS FIRST/LAST are also a form of collation but I'm not going to touch those until I can at least replicate current functionality, and they are not relevent for row comparisons anyway. Collations to operator classes are a many-to-one relationship. I can see situations where you would have 20 collations using a single operator class. Locale specific ordering is really just a subset of collation. At the moment it just uses the xlocale support present in glibc/MacOS X/Win32 but my hope is that it will be pluggable in the sense that you'll be able to say: CREATE LOCALE hungarian AS 'hu_HU' USING glibc; CREATE LOCALE serbian_us AS 'sr_Latn_YU_REVISED@currency=USD' USING icu; (The latter being: Serbian (Latin, Yugoslavia, Revised Orthography, Currency=US Dollar. Example taken from ICU website). Then you can use these in column declarations and have them automatically use that locale for comparisons. It isn't as hard as it looks but it does touch a lot of different places in the backend. If you want technical details I can do that too (the summary on pg-patches a while ago is now wildly out of date). Currently I'm trying to get up to speed on pathkeys and indexes before the tree drifts too far... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Alvaro Herrera wrote: > Martijn van Oosterhout wrote: > > > > src/tools/make_ctags is your friend... > > > > That just shows you where a symbol is defined, not where it's called > > from. When you change the parameters of a function, you need to make > > sure you found all the places that used it... > > > > IOW, it's good for going down the call tree, but not up it. > > I use cscope for that. There is also id-utils, which is mentioned in our developer FAQ. That's what I use, but it is strictly command-line or 'edit all files with X', while I think glimpse and cscope are more like applications, though they probably have command-line interfaces too. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom Lane wrote: > Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > > Can someone explain to me how: > > (a, b) < (1, 2) > > is different to > > a < 1 and b < 2 > > Right at the moment our code interprets it that way, but this behavior > is wrong per spec. It should be an ordered column-by-column comparison, > so that the equivalent simple expression is > > (a < 1) OR (a = 1 AND b < 2) TODO updated: * %Make row-wise comparisons work per SQL spec Right now, '(a, b) < (1, 2)' is processed as 'a < 1 and b < 2', but theSQL standard requires it to be processed as a column-by-column comparison, so the proper comparison is '(a < 1) OR (a= 1 AND b < 2)' -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> >TODO updated: > > * %Make row-wise comparisons work per SQL spec > > Right now, '(a, b) < (1, 2)' is processed as 'a < 1 and b < 2', but > the SQL standard requires it to be processed as a column-by-column > comparison, so the proper comparison is '(a < 1) OR (a = 1 AND b < 2)' > > Can we save current behave (with small modification) with other operator, like <* (1,1) <* (1,2) = true (1,2) <* (2,1) is NULL (2,3) <* (1,2) = false it's usefull for multicriterial optimalisation Regards Pavel Stehule _________________________________________________________________ Chcete sdilet sve obrazky a hudbu s prateli? http://messenger.msn.cz/
On Mon, Dec 26, 2005 at 01:29:19PM +0100, Pavel Stehule wrote: > Can we save current behave (with small modification) with other operator, > like <* > > (1,1) <* (1,2) = true > (1,2) <* (2,1) is NULL > (2,3) <* (1,2) = false > > it's usefull for multicriterial optimalisation That's strange. That's not just doing less-than, but also distinguishing between equal-to and greater-than. So at the very least you've have to choose an operator like <=>. Seems to me you should just define your own operator on an array type and use that. I don't think the above could use an index scan for speeding it up so there's no point trying to treat it specially. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Martijn van Oosterhout <kleptog@svana.org> writes: > If you want technical details I can do that too (the summary on > pg-patches a while ago is now wildly out of date). Currently I'm trying > to get up to speed on pathkeys and indexes before the tree drifts too > far... I've given this advice before to other people: trying to develop a large patch "in hiding" is doomed to failure. And the sort of patch you are talking about isn't large ... it's massive. Combine that with the fact that you don't even seem to have gotten any pghackers buy-in yet on what you are doing, and you are setting yourself up to have the patch rejected, in the unlikely scenario that it's ever completed. My bet is that you by yourself will be unable to complete it, simply because the tree will drift under you faster than you can respond. (Case in point: my current project on row-wise comparisons is going to affect ScanKeys. I'm not sure how yet, but in designing that I won't be considering what impact it might have on you, because I have no idea what you might be trying to do in that area.) I would recommend posting some fairly detailed design discussions concerning what you see as the new semantics, API, and catalog representation for operators and operator classes. If you haven't got buy-in at that level from the hackers list, it's premature to be writing any code at all. I would further recommend that you ask for help rather than trying to complete the project by yourself. regards, tom lane
"Pavel Stehule" <pavel.stehule@hotmail.com> writes: >> Right now, '(a, b) < (1, 2)' is processed as 'a < 1 and b < 2', but >> the SQL standard requires it to be processed as a column-by-column >> comparison, so the proper comparison is '(a < 1) OR (a = 1 AND b < 2)' > Can we save current behave (with small modification) with other operator, > like <* Huh? The only "current behavior" with other operators is failure: regression=# select (1,1) <* (1,2); ERROR: operator <* is not supported for row expressions In any case, you can get the equivalent of the current behavior by writing out1 <* 1 AND 1 <* 2 so I don't see any strong need to support non-SQL-spec behaviors here. regards, tom lane
> >Huh? The only "current behavior" with other operators is failure: you didn't understand me. I know so operator <* isn't supported now. I prefere SQL spec behave too. But what I wont: a <* b ~ ai <= bi and one ai < bi => true ; if one ai > bi => NULL; else false but this behave is from some views really chaotic. This comparation is used in operation research, but propably is far to ideas ANSI SQL. It was only idea. > >regression=# select (1,1) <* (1,2); >ERROR: operator <* is not supported for row expressions > >In any case, you can get the equivalent of the current behavior by >writing out > 1 <* 1 AND 1 <* 2 >so I don't see any strong need to support non-SQL-spec behaviors here. > > regards, tom lane _________________________________________________________________ Najdete si svou lasku a nove pratele na Match.com. http://www.msn.cz/
On 12/26/05, Pavel Stehule <pavel.stehule@hotmail.com> wrote: > (1,1) <* (1,2) = true > (1,2) <* (2,1) is NULL > (2,3) <* (1,2) = false > > it's usefull for multicriterial optimalisation This is indeed a sane and useful function which should be adopted by the SQL standard.. in postgresql this would easily enough be implemented as a user function so I'm not sure we need special support for it. The idea is that in a multidimension comparison you can only sometimes say when one tuple is strictly less than (or greater than) another because differing dimensions are incomparable. So, like his example, we can not say if (1,2) is lesser or greater than (2,1) because saying so would require some priority of the dimensions which may not be known or may not exist, it is only clear that they are not equal..
Greetings to all, On Mon, Dec 26, 2005 at 10:04:59AM -0500, Tom Lane wrote: > I would recommend posting some fairly detailed design discussions > concerning what you see as the new semantics, API, and catalog > representation for operators and operator classes. If you haven't > got buy-in at that level from the hackers list, it's premature to be > writing any code at all. Well, the problem works the other way around too. Unless you're familiar with postgres internals you can't talk sensebly about the changes required. Until two days ago I wasn't sure that my representation was sufficient for everything I wanted it to do. Anyway, the details posted before to -patches [1] and to -hackers [2] produced much discussion about locales but didn't consider the actual patch itself. I thought they were quite detailed and got exactly one useful reaction: don't confuse collations and locales. Good advice and the patch is better for it but I figured I needed to come up with something more substantial for people to take notice. Also, don't be confused about the time interval, there was zero development on it from September until a few days ago, the changes relative to then are not that large. Now I can (almost) initdb again and have no more catalog changes I was thinking of posting another patch with summary and see how far it gets this time. BTW, my stuff won't change how ScanKeys work, they'll probably just provide another way to initialise them. Have a nice day, [1] http://archives.postgresql.org/pgsql-patches/2005-09/msg00022.php [2] http://archives.postgresql.org/pgsql-hackers/2005-09/msg00110.php PS. The web archive for pg-patches seems to be lagging about 4 days at the moment. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
On Mon, Dec 26, 2005 at 15:12:48 -0500, Gregory Maxwell <gmaxwell@gmail.com> wrote: > On 12/26/05, Pavel Stehule <pavel.stehule@hotmail.com> wrote: > > (1,1) <* (1,2) = true > > (1,2) <* (2,1) is NULL > > (2,3) <* (1,2) = false > > > > it's usefull for multicriterial optimalisation > > This is indeed a sane and useful function which should be adopted by > the SQL standard.. in postgresql this would easily enough be > implemented as a user function so I'm not sure we need special support > for it. > > The idea is that in a multidimension comparison you can only sometimes > say when one tuple is strictly less than (or greater than) another > because differing dimensions are incomparable. So, like his example, > we can not say if (1,2) is lesser or greater than (2,1) because saying > so would require some priority of the dimensions which may not be > known or may not exist, it is only clear that they are not equal.. That's normally called a partial order. From CS classes, I don't remember ever seeing < and > combined that way. Normally you just looked at whether or not < was true or whether or not > was true. Is it common in practice for people to want the answer to both of these at once? Also if you are really dealing with <= and >= (as your example above seems to imply), then there are 4 possible states (the new one being a = b) and you can't represent that with just 3 possible results.