Thread: store narrow values in hash indexes?
Currently, hash indexes always store the hash code in the index, but not the actual Datum. It's recently been noted that this can make a hash index smaller than the corresponding btree index would be if the column is wide. However, if the index is being built on a fixed-width column with a typlen <= sizeof(Datum), we could store the original value in the hash index rather than the hash code without using any more space. That would complicate the code, but I bet it would be faster: we wouldn't need to set xs_recheck, we could rule out hash collisions without visiting the heap, and we could support index-only scans in such cases. Another thought is that hash codes are 32 bits, but a Datum is 64 bits wide on most current platforms. So we're wasting 4 bytes per index tuple storing nothing. If we generated 64-bit hash codes we could store as many bits of it as a Datum will hold and reduce hash collisions. Alternatively, we could try to stick some other useful information in those bytes, like an abbreviated abbreviated key. Not sure if these are good ideas. They're just ideas. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Another thought is that hash codes are 32 bits, but a Datum is 64 bits > wide on most current platforms. So we're wasting 4 bytes per index > tuple storing nothing. Datum is not a concept that exists on-disk. What's stored is the 32-bit hash value. You're right that we waste space if the platform's MAXALIGN is 8, but that's the fault of the alignment requirement not the index definition. > If we generated 64-bit hash codes we could > store as many bits of it as a Datum will hold and reduce hash > collisions. I think there is considerable merit in trying to move to 64-bit hash codes (at least for data types that are more than 4 bytes to begin with), but that's largely in the hope of reducing hash collisions in very large indexes, not because we'd avoid wasting alignment pad space. If we do support that, I think we would do it regardless of the platform MAXALIGN. regards, tom lane
On Sat, Sep 24, 2016 at 1:02 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Currently, hash indexes always store the hash code in the index, but > not the actual Datum. It's recently been noted that this can make a > hash index smaller than the corresponding btree index would be if the > column is wide. However, if the index is being built on a fixed-width > column with a typlen <= sizeof(Datum), we could store the original > value in the hash index rather than the hash code without using any > more space. That would complicate the code, but I bet it would be > faster: we wouldn't need to set xs_recheck, we could rule out hash > collisions without visiting the heap, and we could support index-only > scans in such cases. > What exactly you mean by Datum? Is it for datatypes that fits into 64 bits like integer. I think if we are able to support index only scans for hash indexes for some data types, that will be a huge plus. Surely, there is some benefit without index only scans as well, which is we can avoid recheck, but not sure if that alone can give us any big performance boost. As, you say, it might lead to some complication in code, but I think it is worth trying. Won't it add some requirements for pg_upgrade as well? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Sep 24, 2016 at 10:33:01AM +0530, Amit Kapila wrote: > On Sat, Sep 24, 2016 at 1:02 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > Currently, hash indexes always store the hash code in the index, but > > not the actual Datum. It's recently been noted that this can make a > > hash index smaller than the corresponding btree index would be if the > > column is wide. However, if the index is being built on a fixed-width > > column with a typlen <= sizeof(Datum), we could store the original > > value in the hash index rather than the hash code without using any > > more space. That would complicate the code, but I bet it would be > > faster: we wouldn't need to set xs_recheck, we could rule out hash > > collisions without visiting the heap, and we could support index-only > > scans in such cases. > > > > What exactly you mean by Datum? Is it for datatypes that fits into 64 > bits like integer. I think if we are able to support index only scans > for hash indexes for some data types, that will be a huge plus. > Surely, there is some benefit without index only scans as well, which > is we can avoid recheck, but not sure if that alone can give us any > big performance boost. As, you say, it might lead to some > complication in code, but I think it is worth trying. > > Won't it add some requirements for pg_upgrade as well? Yes, pg_upgrade will mark the indexes as invalid and supply a script to reindex them. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Sat, Sep 24, 2016 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sat, Sep 24, 2016 at 1:02 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Currently, hash indexes always store the hash code in the index, but >> not the actual Datum. It's recently been noted that this can make a >> hash index smaller than the corresponding btree index would be if the >> column is wide. However, if the index is being built on a fixed-width >> column with a typlen <= sizeof(Datum), we could store the original >> value in the hash index rather than the hash code without using any >> more space. That would complicate the code, but I bet it would be >> faster: we wouldn't need to set xs_recheck, we could rule out hash >> collisions without visiting the heap, and we could support index-only >> scans in such cases. > > What exactly you mean by Datum? Is it for datatypes that fits into 64 > bits like integer. Yeah, I mean whatever is small enough to fit into the space currently being used to store the hashcode, along with any accompanying padding bytes that we can also use. > I think if we are able to support index only scans > for hash indexes for some data types, that will be a huge plus. > Surely, there is some benefit without index only scans as well, which > is we can avoid recheck, but not sure if that alone can give us any > big performance boost. As, you say, it might lead to some > complication in code, but I think it is worth trying. Yeah, the recheck is probably not that expensive if we have to retrieve the heap page anyway. > Won't it add some requirements for pg_upgrade as well? I have nothing to add to what Bruce already said. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company