Thread: machine-dependent hash_any vs the regression tests

machine-dependent hash_any vs the regression tests

From

Tom Lane

Date:

05 April 2008, 18:57:42

So the proposed changes in hash_any make its hash values different
between big-endian and little-endian machines (at least for string keys;
for keys that are really arrays of int, I think the changes will
unify the behavior).  This means that the hash_seq_search traversal
order for an internal hash table changes, and it turns out this breaks
at least two regression tests: portals and dblink.  The portals test
is easy to fix by adding a couple of ORDER BYs, but the problem with
dblink is here:
 SELECT dblink_get_connections();  dblink_get_connections  ------------------------
!  {dtest1,dtest2,dtest3} (1 row)  SELECT dblink_is_busy('dtest1');
--- 714,720 ---- SELECT dblink_get_connections();  dblink_get_connections  ------------------------
!  {dtest1,dtest3,dtest2} (1 row)  SELECT dblink_is_busy('dtest1');

and right offhand I can't think of a simple way to force those array
elements into a consistent order.

No doubt that can be worked around, but does anyone wish to argue that
this whole thing is a bad path to be headed down?  We're not going to
gain a *whole* lot of speedup from the word-wide-hashing change, and
so maybe this type of headache isn't worth the trouble.
        regards, tom lane

Re: machine-dependent hash_any vs the regression tests

From

Gregory Stark

Date:

05 April 2008, 19:28:14

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> No doubt that can be worked around, but does anyone wish to argue that
> this whole thing is a bad path to be headed down?  We're not going to
> gain a *whole* lot of speedup from the word-wide-hashing change, and
> so maybe this type of headache isn't worth the trouble.

I have to admit to some hesitation about it. But as you point out, regarding
arrays of integers, most of the things being hashed are themselves platform
dependent and have different hashes. So I can't really come up with any good
reason to try to keep hashes consistent across platforms.

I suppose for strings it means you can't use hashtext() in user-space code if
you ever think you might switch database server architectures. You're probably
better off using crc32 (for which we don't provide a function :( ) for user
code anyways.

Why do we have this hash function anyways? Is hashany faster than a decent
crc32 implementation?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!

Re: machine-dependent hash_any vs the regression tests

From

Tom Lane

Date:

05 April 2008, 19:44:13

Gregory Stark <stark@enterprisedb.com> writes:
> Why do we have this hash function anyways? Is hashany faster than a decent
> crc32 implementation?

Yes, significantly.  Times to hash 32K bytes 100000 times on a Xeon EM64T:

hash_crc(32K): 11.388755 s
hash_any_old(32K): 4.401945 s
hash_any(32K): 3.862427 s

hash_crc is our src/include/utils/pg_crc.h code, hash_any_old is current
CVS HEAD, hash_any is the word-wide version.  For just 8 bytes (100M
repetitions)

hash_crc(8 bytes): 2.587647 s
hash_any_old(8 bytes): 1.581826 s
hash_any(8 bytes): 1.294480 s

so in both setup and per-byte terms CRC is more expensive.  But the
bigger problem is that CRC isn't necessarily designed to have the
properties we need, in particular that all bits of the hash are about
equally random.  It's designed to attack other problems.
        regards, tom lane

Re: machine-dependent hash_any vs the regression tests

From

Kenneth Marshall

Date:

06 April 2008, 12:45:23

On Sat, Apr 05, 2008 at 05:57:35PM -0400, Tom Lane wrote:
> So the proposed changes in hash_any make its hash values different
> between big-endian and little-endian machines (at least for string keys;
> for keys that are really arrays of int, I think the changes will
> unify the behavior).  This means that the hash_seq_search traversal
> order for an internal hash table changes, and it turns out this breaks
> at least two regression tests: portals and dblink.  The portals test
> is easy to fix by adding a couple of ORDER BYs, but the problem with
> dblink is here:
> 
>   SELECT dblink_get_connections();
>    dblink_get_connections 
>   ------------------------
> !  {dtest1,dtest2,dtest3}
>   (1 row)
>   
>   SELECT dblink_is_busy('dtest1');
> --- 714,720 ----
>   SELECT dblink_get_connections();
>    dblink_get_connections 
>   ------------------------
> !  {dtest1,dtest3,dtest2}
>   (1 row)
>   
>   SELECT dblink_is_busy('dtest1');
> 
> and right offhand I can't think of a simple way to force those array
> elements into a consistent order.
> 
> No doubt that can be worked around, but does anyone wish to argue that
> this whole thing is a bad path to be headed down?  We're not going to
> gain a *whole* lot of speedup from the word-wide-hashing change, and
> so maybe this type of headache isn't worth the trouble.
> 
>             regards, tom lane
> 
It may be just me, but it is a little bit surprising that the order of
a sequential search traversal should matter. It smacks of the row ordering
being non-deterministic without specifying an "order by". As long as all
of the values are returned, it would make sense not to have the regression
tests depend on that ordering. It will make it easier to evaluate new hash
functions if they do not break the regression tests in such an unintuitive
way -- my two cents.

Regards,
Ken Marshall

Re: machine-dependent hash_any vs the regression tests

From

Gregory Stark

Date:

06 April 2008, 14:41:51

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

>   SELECT dblink_get_connections();
>    dblink_get_connections 
>   ------------------------
> !  {dtest1,dtest3,dtest2}
>   (1 row)
>   
>   SELECT dblink_is_busy('dtest1');
>
> and right offhand I can't think of a simple way to force those array
> elements into a consistent order.

You could do something like:

postgres=# select (information_schema._pg_expandarray('{1,5,3}'::int[])).x order by x;x 
---135
(3 rows)

That would be a whole lot less unappetising if the function wasn't an internal
function that someone might want to change without breaking random regression
tests. It seems likely the day that happens will be the day that we provide an
SQL standard UNNEST anyways though.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!