Thread: Saner interval hash function

Saner interval hash function

From

Tom Lane

Date:

03 April 2009, 18:46:09

The present implementation of interval_hash() is very carefully designed
and coded ... to meet the wrong specification :-(.  What it should
be doing is producing equal hashcodes for values that interval_eq()
considers equal.  The error is exhibited in the recent bug report #4748.

Attached is a prototype fix for this, which I have verified fixes #4748.
The only problem with applying it is that it changes the contents of
hash indexes on interval columns.  This is no problem for HEAD, seeing
that we've already whacked around hash indexes in far more fundamental
ways than that since 8.3.  However, since interval hashing is broken
in this way in every active branch, we really ought to back-patch.

I don't think there's a whole lot of choice in the matter.  We have to
patch this, and put in the next release notes "if you have any hash
indexes on interval columns, REINDEX them after updating".  Does anyone
see it differently, or have some brilliant idea for another solution?

            regards, tom lane

Index: timestamp.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/timestamp.c,v
retrieving revision 1.197
diff -c -r1.197 timestamp.c
*** timestamp.c    15 Mar 2009 20:31:19 -0000    1.197
--- timestamp.c    3 Apr 2009 21:34:51 -0000
***************
*** 2047,2052 ****
--- 2047,2053 ----
      TimeOffset    span1,
                  span2;

+     /* If you change this, see also interval_hash() */
      span1 = interval1->time;
      span2 = interval2->time;

***************
*** 2131,2158 ****
  Datum
  interval_hash(PG_FUNCTION_ARGS)
  {
!     Interval   *key = PG_GETARG_INTERVAL_P(0);
      uint32        thash;
-     uint32        mhash;

      /*
!      * To avoid any problems with padding bytes in the struct, we figure the
!      * field hashes separately and XOR them.  This also provides a convenient
!      * framework for dealing with the fact that the time field might be either
!      * double or int64.
       */
  #ifdef HAVE_INT64_TIMESTAMP
      thash = DatumGetUInt32(DirectFunctionCall1(hashint8,
!                                                Int64GetDatumFast(key->time)));
  #else
      thash = DatumGetUInt32(DirectFunctionCall1(hashfloat8,
!                                              Float8GetDatumFast(key->time)));
  #endif
!     thash ^= DatumGetUInt32(hash_uint32(key->day));
!     /* Shift so "k days" and "k months" don't hash to the same thing */
!     mhash = DatumGetUInt32(hash_uint32(key->month));
!     thash ^= mhash << 24;
!     thash ^= mhash >> 8;
      PG_RETURN_UINT32(thash);
  }

--- 2132,2162 ----
  Datum
  interval_hash(PG_FUNCTION_ARGS)
  {
!     Interval   *interval = PG_GETARG_INTERVAL_P(0);
!     TimeOffset    span;
      uint32        thash;

      /*
!      * We must produce equal hashvals for values that interval_cmp_internal()
!      * considers equal.  So, compute the net span the same way it does,
!      * and then hash that, using either int64 or float8 hashing.
       */
+     span = interval->time;
+
  #ifdef HAVE_INT64_TIMESTAMP
+     span += interval->month * INT64CONST(30) * USECS_PER_DAY;
+     span += interval->day * INT64CONST(24) * USECS_PER_HOUR;
+
      thash = DatumGetUInt32(DirectFunctionCall1(hashint8,
!                                                Int64GetDatumFast(span)));
  #else
+     span += interval->month * ((double) DAYS_PER_MONTH * SECS_PER_DAY);
+     span += interval->day * ((double) HOURS_PER_DAY * SECS_PER_HOUR);
+
      thash = DatumGetUInt32(DirectFunctionCall1(hashfloat8,
!                                                Float8GetDatumFast(span)));
  #endif
!
      PG_RETURN_UINT32(thash);
  }

Re: Saner interval hash function

From

Jaime Casanova

Date:

03 April 2009, 19:41:18

On Fri, Apr 3, 2009 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I don't think there's a whole lot of choice in the matter.  We have to
> patch this, and put in the next release notes "if you have any hash
> indexes on interval columns, REINDEX them after updating".  Does anyone
> see it differently, or have some brilliant idea for another solution?
>

no better idea... but i don't think is really an issue: in all active
branches hash indexes are not recommended (at least the docs says
there is no evidence that they will perform better than a btree and
establish that they are not crash-safe) so, if really there are some
in use and in an interval column (a very low combination i think) they
should be used to execute REINDEX anyway

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

Re: Saner interval hash function

From

Greg Stark

Date:

04 April 2009, 11:12:59

On Fri, Apr 3, 2009 at 10:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The present implementation of interval_hash() is very carefully designed
> and coded ... to meet the wrong specification :-(.  What it should
> be doing is producing equal hashcodes for values that interval_eq()
> considers equal.  The error is exhibited in the recent bug report #4748.

It would be nice if we had a way to generate a lot of similar values
for a every data type. Then we could have a regression test which
checks for each data type that the hash function matches the equality
operator -- and for that matter that the various inequality operators
are also consistent.

I'm not sure how to generate values though. For a lot of data types it
would be hard to generate values densely enough to trigger any bugs.

--
greg

Re: Saner interval hash function

From

Tom Lane

Date:

04 April 2009, 12:16:35

Greg Stark <stark@enterprisedb.com> writes:
> On Fri, Apr 3, 2009 at 10:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The present implementation of interval_hash() is very carefully designed
>> and coded ... to meet the wrong specification :-(. What it should
>> be doing is producing equal hashcodes for values that interval_eq()
>> considers equal. The error is exhibited in the recent bug report #4748.

> It would be nice if we had a way to generate a lot of similar values
> for a every data type. Then we could have a regression test which
> checks for each data type that the hash function matches the equality
> operator -- and for that matter that the various inequality operators
> are also consistent.

> I'm not sure how to generate values though. For a lot of data types it
> would be hard to generate values densely enough to trigger any bugs.

Yeah.  I did add a regression test for the specific case of '30 days'
vs '1 month', which we know is a pain point for this particular data
type.  Generating values at random doesn't seem like it's really likely
to teach us much though.
        regards, tom lane

Re: Saner interval hash function

From

Greg Stark

Date:

04 April 2009, 12:43:06

On Sat, Apr 4, 2009 at 4:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah.  I did add a regression test for the specific case of '30 days'
> vs '1 month', which we know is a pain point for this particular data
> type.  Generating values at random doesn't seem like it's really likely
> to teach us much though.

Yeah, I was going to suggest random values but came to the same conclusion.

I think each datatype would have to provide some list of "interesting"
values. But given that list it would be possible to do the same
regression tests for every data type.

I suppose it would require a plpgsql function using PERFORM to
implement though. ick.

--
greg