Re: Change GUC hashtable to use simplehash? - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Change GUC hashtable to use simplehash? |
Date | |
Msg-id | 20231122223432.lywt4yz2bn7tlp27@awork3.anarazel.de Whole thread Raw |
In response to | Re: Change GUC hashtable to use simplehash? (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Change GUC hashtable to use simplehash?
|
List | pgsql-hackers |
Hi, On 2023-11-22 16:27:56 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-11-22 15:56:21 -0500, Tom Lane wrote: > >> GUC names are just about always short, though, so I'm not sure you've > >> made your point? > > > With short I meant <= 6 characters (32 / 5 = 6.x). After that you're > > overwriting bits that you previously set, without dispersing the "overwritten" > > bits throughout the hash state. > > I'm less than convinced about the "overwrite" part: > > + /* Merge into hash ... not very bright, but it needn't be */ > + result = pg_rotate_left32(result, 5); > + result ^= (uint32) ch; > > Rotating a 32-bit value 5 bits at a time doesn't result in successive > characters lining up exactly, and even once they do, XOR is not > "overwrite". I didn't know what word to use, hence the air quotes. Yes, xor doesn't just set the bits to the right hand side in, but it just affects data on a per-bit basis, which easily can be cancelled out. My understanding of writing hash functions is that every added bit mixed in should have a ~50% chance of causing each other bit to flip. The proposed function obviously doesn't get there. It's worth noting that the limited range of the input values means that there's a lot of bias toward some bits being set ('a' to 'z' all start with 0b011). > I'm pretty dubious that we need something better than this. Well, we know that the current attempt at a dedicated hashfunctions for this does result in substantial amounts of conflicts. And it's hard to understand such cases when you hit them, so I think it's better to avoid exposing ourselves to such dangers, without a distinct need. And I don't really see the need here to risk it, even if we are somewhat confident it's fine. If, which I mildly doubt, we can't afford to call murmurhash32 for every character, we could just call it for 32/5 input characters together. Or we could just load up to 8 characters into an 64bit integer, can call murmurhash64. Something roughly like uint64 result; while (*name) { uint64 value = 0; for (int i = 0; i < 8 && *name; i++) { char ch = *name++; value |= *name; value = value << 8; } result = hash_combine64(result, murmurhash64(value)); } The hash_combine use isn't quite right either, we should use the full accumulator state of a proper hash function, but it's seems very unlikely to matter here. The fact that string_hash() is slow due to the strlen(), which causes us to process the input twice and which is optimized to also handle very long strings which typically string_hash() doesn't encounter, seems problematic far beyond this case. We use string_hash() in a *lot* of places, and that strlen() does regularly show up in profiles. We should fix that. The various hash functions being external functions also shows up in a bunch of profiles too. It's particularly ridiculous for cases like tag_hash(), where the caller typically knows the lenght, but calls a routine in a different translation unit, which obviously can't be optimized for a specific length. I think we ought to adjust our APIs around this: 1) The accumulator state of the hash functions should be exposed, so one can accumulate values into the hash state, without reducing the internal state to a single 32/64 bit variable. 2) For callers that know the length of data, we should use a static inline hash function, rather than an external function call. This should include special cased inline functions for adding 32/64bit of data to the hash state. Perhaps with a bit of logic to *not* use the inline version if the hashed input is long (and thus the call overhead doesn't matter). Something like if (__builtin_constant_p(len) && len < 128) /* call inline implementation */ else /* call out of line implementation, not worth the code bloat */ We know that hash functions should have the split into init/process data*/finish steps, as e.g. evidenced by pg_crc.h/pg_crc32.h. With something like that, you could write a function that lowercases characters inline without incurring unnecessary overhead. hash32_state hs; hash32_init(&hs); while (*name) { char ch = *name++; /* crappy lowercase for this situation */ ch |= 0x20; hash32_process_byte(&hs, ch); } return hash32_finish(&hs); Perhaps with some additional optimization for processing the input string in 32/64 bit quantities. Greetings, Andres Freund
pgsql-hackers by date: