Thread: Saner interval hash function
The present implementation of interval_hash() is very carefully designed and coded ... to meet the wrong specification :-(. What it should be doing is producing equal hashcodes for values that interval_eq() considers equal. The error is exhibited in the recent bug report #4748. Attached is a prototype fix for this, which I have verified fixes #4748. The only problem with applying it is that it changes the contents of hash indexes on interval columns. This is no problem for HEAD, seeing that we've already whacked around hash indexes in far more fundamental ways than that since 8.3. However, since interval hashing is broken in this way in every active branch, we really ought to back-patch. I don't think there's a whole lot of choice in the matter. We have to patch this, and put in the next release notes "if you have any hash indexes on interval columns, REINDEX them after updating". Does anyone see it differently, or have some brilliant idea for another solution? regards, tom lane Index: timestamp.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/adt/timestamp.c,v retrieving revision 1.197 diff -c -r1.197 timestamp.c *** timestamp.c 15 Mar 2009 20:31:19 -0000 1.197 --- timestamp.c 3 Apr 2009 21:34:51 -0000 *************** *** 2047,2052 **** --- 2047,2053 ---- TimeOffset span1, span2; + /* If you change this, see also interval_hash() */ span1 = interval1->time; span2 = interval2->time; *************** *** 2131,2158 **** Datum interval_hash(PG_FUNCTION_ARGS) { ! Interval *key = PG_GETARG_INTERVAL_P(0); uint32 thash; - uint32 mhash; /* ! * To avoid any problems with padding bytes in the struct, we figure the ! * field hashes separately and XOR them. This also provides a convenient ! * framework for dealing with the fact that the time field might be either ! * double or int64. */ #ifdef HAVE_INT64_TIMESTAMP thash = DatumGetUInt32(DirectFunctionCall1(hashint8, ! Int64GetDatumFast(key->time))); #else thash = DatumGetUInt32(DirectFunctionCall1(hashfloat8, ! Float8GetDatumFast(key->time))); #endif ! thash ^= DatumGetUInt32(hash_uint32(key->day)); ! /* Shift so "k days" and "k months" don't hash to the same thing */ ! mhash = DatumGetUInt32(hash_uint32(key->month)); ! thash ^= mhash << 24; ! thash ^= mhash >> 8; PG_RETURN_UINT32(thash); } --- 2132,2162 ---- Datum interval_hash(PG_FUNCTION_ARGS) { ! Interval *interval = PG_GETARG_INTERVAL_P(0); ! TimeOffset span; uint32 thash; /* ! * We must produce equal hashvals for values that interval_cmp_internal() ! * considers equal. So, compute the net span the same way it does, ! * and then hash that, using either int64 or float8 hashing. */ + span = interval->time; + #ifdef HAVE_INT64_TIMESTAMP + span += interval->month * INT64CONST(30) * USECS_PER_DAY; + span += interval->day * INT64CONST(24) * USECS_PER_HOUR; + thash = DatumGetUInt32(DirectFunctionCall1(hashint8, ! Int64GetDatumFast(span))); #else + span += interval->month * ((double) DAYS_PER_MONTH * SECS_PER_DAY); + span += interval->day * ((double) HOURS_PER_DAY * SECS_PER_HOUR); + thash = DatumGetUInt32(DirectFunctionCall1(hashfloat8, ! Float8GetDatumFast(span))); #endif ! PG_RETURN_UINT32(thash); }
On Fri, Apr 3, 2009 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I don't think there's a whole lot of choice in the matter. We have to > patch this, and put in the next release notes "if you have any hash > indexes on interval columns, REINDEX them after updating". Does anyone > see it differently, or have some brilliant idea for another solution? > no better idea... but i don't think is really an issue: in all active branches hash indexes are not recommended (at least the docs says there is no evidence that they will perform better than a btree and establish that they are not crash-safe) so, if really there are some in use and in an interval column (a very low combination i think) they should be used to execute REINDEX anyway -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
On Fri, Apr 3, 2009 at 10:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The present implementation of interval_hash() is very carefully designed > and coded ... to meet the wrong specification :-(. What it should > be doing is producing equal hashcodes for values that interval_eq() > considers equal. The error is exhibited in the recent bug report #4748. It would be nice if we had a way to generate a lot of similar values for a every data type. Then we could have a regression test which checks for each data type that the hash function matches the equality operator -- and for that matter that the various inequality operators are also consistent. I'm not sure how to generate values though. For a lot of data types it would be hard to generate values densely enough to trigger any bugs. -- greg
Greg Stark <stark@enterprisedb.com> writes: > On Fri, Apr 3, 2009 at 10:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The present implementation of interval_hash() is very carefully designed >> and coded ... to meet the wrong specification :-(. What it should >> be doing is producing equal hashcodes for values that interval_eq() >> considers equal. The error is exhibited in the recent bug report #4748. > It would be nice if we had a way to generate a lot of similar values > for a every data type. Then we could have a regression test which > checks for each data type that the hash function matches the equality > operator -- and for that matter that the various inequality operators > are also consistent. > I'm not sure how to generate values though. For a lot of data types it > would be hard to generate values densely enough to trigger any bugs. Yeah. I did add a regression test for the specific case of '30 days' vs '1 month', which we know is a pain point for this particular data type. Generating values at random doesn't seem like it's really likely to teach us much though. regards, tom lane
On Sat, Apr 4, 2009 at 4:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah. I did add a regression test for the specific case of '30 days' > vs '1 month', which we know is a pain point for this particular data > type. Generating values at random doesn't seem like it's really likely > to teach us much though. Yeah, I was going to suggest random values but came to the same conclusion. I think each datatype would have to provide some list of "interesting" values. But given that list it would be possible to do the same regression tests for every data type. I suppose it would require a plpgsql function using PERFORM to implement though. ick. -- greg