Thread: machine-dependent hash_any vs the regression tests
So the proposed changes in hash_any make its hash values different between big-endian and little-endian machines (at least for string keys; for keys that are really arrays of int, I think the changes will unify the behavior). This means that the hash_seq_search traversal order for an internal hash table changes, and it turns out this breaks at least two regression tests: portals and dblink. The portals test is easy to fix by adding a couple of ORDER BYs, but the problem with dblink is here: SELECT dblink_get_connections(); dblink_get_connections ------------------------ ! {dtest1,dtest2,dtest3} (1 row) SELECT dblink_is_busy('dtest1'); --- 714,720 ---- SELECT dblink_get_connections(); dblink_get_connections ------------------------ ! {dtest1,dtest3,dtest2} (1 row) SELECT dblink_is_busy('dtest1'); and right offhand I can't think of a simple way to force those array elements into a consistent order. No doubt that can be worked around, but does anyone wish to argue that this whole thing is a bad path to be headed down? We're not going to gain a *whole* lot of speedup from the word-wide-hashing change, and so maybe this type of headache isn't worth the trouble. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > No doubt that can be worked around, but does anyone wish to argue that > this whole thing is a bad path to be headed down? We're not going to > gain a *whole* lot of speedup from the word-wide-hashing change, and > so maybe this type of headache isn't worth the trouble. I have to admit to some hesitation about it. But as you point out, regarding arrays of integers, most of the things being hashed are themselves platform dependent and have different hashes. So I can't really come up with any good reason to try to keep hashes consistent across platforms. I suppose for strings it means you can't use hashtext() in user-space code if you ever think you might switch database server architectures. You're probably better off using crc32 (for which we don't provide a function :( ) for user code anyways. Why do we have this hash function anyways? Is hashany faster than a decent crc32 implementation? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
Gregory Stark <stark@enterprisedb.com> writes: > Why do we have this hash function anyways? Is hashany faster than a decent > crc32 implementation? Yes, significantly. Times to hash 32K bytes 100000 times on a Xeon EM64T: hash_crc(32K): 11.388755 s hash_any_old(32K): 4.401945 s hash_any(32K): 3.862427 s hash_crc is our src/include/utils/pg_crc.h code, hash_any_old is current CVS HEAD, hash_any is the word-wide version. For just 8 bytes (100M repetitions) hash_crc(8 bytes): 2.587647 s hash_any_old(8 bytes): 1.581826 s hash_any(8 bytes): 1.294480 s so in both setup and per-byte terms CRC is more expensive. But the bigger problem is that CRC isn't necessarily designed to have the properties we need, in particular that all bits of the hash are about equally random. It's designed to attack other problems. regards, tom lane
On Sat, Apr 05, 2008 at 05:57:35PM -0400, Tom Lane wrote: > So the proposed changes in hash_any make its hash values different > between big-endian and little-endian machines (at least for string keys; > for keys that are really arrays of int, I think the changes will > unify the behavior). This means that the hash_seq_search traversal > order for an internal hash table changes, and it turns out this breaks > at least two regression tests: portals and dblink. The portals test > is easy to fix by adding a couple of ORDER BYs, but the problem with > dblink is here: > > SELECT dblink_get_connections(); > dblink_get_connections > ------------------------ > ! {dtest1,dtest2,dtest3} > (1 row) > > SELECT dblink_is_busy('dtest1'); > --- 714,720 ---- > SELECT dblink_get_connections(); > dblink_get_connections > ------------------------ > ! {dtest1,dtest3,dtest2} > (1 row) > > SELECT dblink_is_busy('dtest1'); > > and right offhand I can't think of a simple way to force those array > elements into a consistent order. > > No doubt that can be worked around, but does anyone wish to argue that > this whole thing is a bad path to be headed down? We're not going to > gain a *whole* lot of speedup from the word-wide-hashing change, and > so maybe this type of headache isn't worth the trouble. > > regards, tom lane > It may be just me, but it is a little bit surprising that the order of a sequential search traversal should matter. It smacks of the row ordering being non-deterministic without specifying an "order by". As long as all of the values are returned, it would make sense not to have the regression tests depend on that ordering. It will make it easier to evaluate new hash functions if they do not break the regression tests in such an unintuitive way -- my two cents. Regards, Ken Marshall
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > SELECT dblink_get_connections(); > dblink_get_connections > ------------------------ > ! {dtest1,dtest3,dtest2} > (1 row) > > SELECT dblink_is_busy('dtest1'); > > and right offhand I can't think of a simple way to force those array > elements into a consistent order. You could do something like: postgres=# select (information_schema._pg_expandarray('{1,5,3}'::int[])).x order by x;x ---135 (3 rows) That would be a whole lot less unappetising if the function wasn't an internal function that someone might want to change without breaking random regression tests. It seems likely the day that happens will be the day that we provide an SQL standard UNNEST anyways though. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!