Re: sortsupport for text - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: sortsupport for text |
Date | |
Msg-id | CAEYLb_WJMs2oK6rPNjBckQ-tg6mQO4pc_dwTMvfqFZQNGytxGA@mail.gmail.com Whole thread Raw |
In response to | Re: sortsupport for text (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: sortsupport for text
Re: sortsupport for text Re: sortsupport for text |
List | pgsql-hackers |
On 17 June 2012 21:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Sure, and in general we only expect that "=" operators mean equivalency; > a concrete example is float8 "=", which on IEEE-spec machines will say > that zero and minus zero are equal. Right; the spec says that, and we punt to the spec. No one sensible thinks that minus zero and zero floats are not equal, by virtue of the fact that if you compare them in C using ==, it evaluates to true. Equivalency as I use the term can't really be talked about without talking about sorting or less than operators. > The trick for hashing such datatypes is to be able to guarantee that > "equal" values hash to the same hash code, which is typically possible > as long as you know the equality rules well enough. We could possibly > do that for text with pure-strcoll equality if we knew all the details > of what strcoll would consider "equal", but we do not. I see. I am tentatively suggesting that we don't change the definition of equal, but allow that equivalent text values may not be equal. > See also citext for an example of a datatype where we can manage to > treat distinct textual values as equal and still hash them. I'm not talking about textually distinct values being equal/equivalent. That doesn't quite capture it (although I suppose a corollary of what I am trying to express is that equivalent though non-equal values should invariably have different textual representations in Postgres). I think a good example that illustrates the difference between equivalency and equality is the German ß character (esszet), which is regarded as equivalent to 'ss' (two 's' characters). The following German words are sorted correctly as required by de_DE.UTF-8: Arg Ärgerlich Arm Assistant Aßlar Assoziation So if you take the word "Aßlar" here - that is equivalent to "Asslar", and so strcoll("Aßlar", "Asslar") will return 0 if you have the right LC_COLLATE (if you tried this out for yourself and found that I was actually lying through my teeth, pretend I said Hungarian instead of German and "some really obscure character" rather than ß). It seems a bit unsatisfactory to me that a unique constraint will happily accept these two equivalent strings because they're not bitwise identical, when the collation was supposed to have that normalisation baked in. At the same time, I find it intuitively obvious that "Aßlar" == "Asslar" is false, and at the very least I think that very few people would not grant that it is a not unreasonable state of affairs for both of those two things to hold, at least on the face of it (presuming that I wasn't actually lying about the particulars of this, which I was, since this is apparently confined to less popular locales and I'm too lazy to research the exact details). Now, I do realise that there is what might appear to be a tension in what I'm saying; it is precisely the fact that we can traverse a btree index using comparators that allows a btree to satisfy an equality condition (or an equivalency condition; however you choose to characterise whatever it is that the '=' operator does). To restate the problem: The '=' operator implies equivalency and not equality. Or does it? Look at this code from nbtutils.c's _bt_checkkeys() function: test = FunctionCall2Coll(&key->sk_func, key->sk_collation, datum, key->sk_argument); if (!DatumGetBool(test)) { /* * Tuple fails this qual. If it's a required qual for the current * scan direction, then we can conclude no further tuples will * pass, either. * * Note: becausewe stop the scan as soon as any required equality * qual fails, it is critical that equality quals be usedfor the * initial positioning in _bt_first() when they are available. See * comments in _bt_first(). */ ***SNIP*** The qual is verified for the index tuple itself (the test return value lets us know if it matches) - the equality operator is actually called, and is actually re-verified via texteq(). So what appears to happen is that the btree code finds equivalent tuples, and then, knowing that all pairs of equal tuples are equivalent (but not necessarily the inverse) checks that it actually has a tuple that's equal/satisfies the qual. Makes sense, and by this standard I'd judge that '=' was actually an equality operator that sometimes took advantage of equivalency purely as an implementation detail, but I'm not familiar enough with that part of the code to have any degree of confidence that I haven't made a leap here that I shouldn't have. ISTM if '=' was really a mere equivalency operator, we'd only every check (a < b && b < a) in the btree code. It occurs to me that we might also have a new equivalency operator (=== or something) so that we actually could answer questions like "show me the tuples with the word Aßlar in some non-unique column - both spellings will do". Maybe that's a bit off the wall, I'm not sure. Simple question: if you were to just remove the strcmp tie-breaker for strcoll() in varstr_cmp(), but not touch anything else, would Postgres exhibit objectively incorrect behaviour? Incidentally, I think it's interesting that the SQL-exposed interface to all of this merely presents the user with a boolean: postgres=# select '4'::text < '4'::text;?column? ----------f (1 row) This is pretty much what C++ does; operator< and friends conventionally return a bool, not an int. This might have something to do with the need to make a break from the qsort() convention of returning an int that may indicate equality. So the formal definition of equivalency there may be that two equivalent values will always be either less than each other or not less than each other - you have no business expecting them to come out of a sort in any particular order relative to each other (except with a stable sort). Which is not the same as equal or "textually distinct but equal", which I think is how I'd classify your IEEE 754 example. We can decree that equivalency implies equality, or make all this internal (which, perversely, I suppose the C++ committee people cannot). So, I may have lost sight of why I starting on about equivalency, which is that it sure would be nice if we could use strxfrm to prepare strings for sorting, which looks to be a fairly significant win. Does anyone have any simpler ideas, assuming this one is no good? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
pgsql-hackers by date: