Thread: store narrow values in hash indexes?

store narrow values in hash indexes?

From

Robert Haas

Date:

23 September 2016, 19:32:54

Currently, hash indexes always store the hash code in the index, but
not the actual Datum.  It's recently been noted that this can make a
hash index smaller than the corresponding btree index would be if the
column is wide.  However, if the index is being built on a fixed-width
column with a typlen <= sizeof(Datum), we could store the original
value in the hash index rather than the hash code without using any
more space.  That would complicate the code, but I bet it would be
faster: we wouldn't need to set xs_recheck, we could rule out hash
collisions without visiting the heap, and we could support index-only
scans in such cases.

Another thought is that hash codes are 32 bits, but a Datum is 64 bits
wide on most current platforms.  So we're wasting 4 bytes per index
tuple storing nothing.  If we generated 64-bit hash codes we could
store as many bits of it as a Datum will hold and reduce hash
collisions.  Alternatively, we could try to stick some other useful
information in those bytes, like an abbreviated abbreviated key.

Not sure if these are good ideas.  They're just ideas.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: store narrow values in hash indexes?

From

Tom Lane

Date:

23 September 2016, 20:16:29

Robert Haas <robertmhaas@gmail.com> writes:
> Another thought is that hash codes are 32 bits, but a Datum is 64 bits
> wide on most current platforms.  So we're wasting 4 bytes per index
> tuple storing nothing.

Datum is not a concept that exists on-disk.  What's stored is the 32-bit
hash value.  You're right that we waste space if the platform's MAXALIGN
is 8, but that's the fault of the alignment requirement not the index
definition.

> If we generated 64-bit hash codes we could
> store as many bits of it as a Datum will hold and reduce hash
> collisions.

I think there is considerable merit in trying to move to 64-bit hash
codes (at least for data types that are more than 4 bytes to begin with),
but that's largely in the hope of reducing hash collisions in very large
indexes, not because we'd avoid wasting alignment pad space.  If we do
support that, I think we would do it regardless of the platform MAXALIGN.
        regards, tom lane

Re: store narrow values in hash indexes?

From

Amit Kapila

Date:

24 September 2016, 05:03:12

On Sat, Sep 24, 2016 at 1:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Currently, hash indexes always store the hash code in the index, but
> not the actual Datum.  It's recently been noted that this can make a
> hash index smaller than the corresponding btree index would be if the
> column is wide.  However, if the index is being built on a fixed-width
> column with a typlen <= sizeof(Datum), we could store the original
> value in the hash index rather than the hash code without using any
> more space.  That would complicate the code, but I bet it would be
> faster: we wouldn't need to set xs_recheck, we could rule out hash
> collisions without visiting the heap, and we could support index-only
> scans in such cases.
>

What exactly you mean by Datum?  Is it for datatypes that fits into 64
bits like integer.  I think if we are able to support index only scans
for hash indexes for some data types, that will be a huge plus.
Surely, there is some benefit without index only scans as well, which
is we can avoid recheck, but not sure if that alone can give us any
big performance boost.  As, you say, it might lead to some
complication in code, but I think it is worth trying.

Won't it add some requirements for pg_upgrade as well?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: store narrow values in hash indexes?

From

Bruce Momjian

Date:

24 September 2016, 14:17:39

On Sat, Sep 24, 2016 at 10:33:01AM +0530, Amit Kapila wrote:
> On Sat, Sep 24, 2016 at 1:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > Currently, hash indexes always store the hash code in the index, but
> > not the actual Datum.  It's recently been noted that this can make a
> > hash index smaller than the corresponding btree index would be if the
> > column is wide.  However, if the index is being built on a fixed-width
> > column with a typlen <= sizeof(Datum), we could store the original
> > value in the hash index rather than the hash code without using any
> > more space.  That would complicate the code, but I bet it would be
> > faster: we wouldn't need to set xs_recheck, we could rule out hash
> > collisions without visiting the heap, and we could support index-only
> > scans in such cases.
> >
> 
> What exactly you mean by Datum?  Is it for datatypes that fits into 64
> bits like integer.  I think if we are able to support index only scans
> for hash indexes for some data types, that will be a huge plus.
> Surely, there is some benefit without index only scans as well, which
> is we can avoid recheck, but not sure if that alone can give us any
> big performance boost.  As, you say, it might lead to some
> complication in code, but I think it is worth trying.
> 
> Won't it add some requirements for pg_upgrade as well?

Yes, pg_upgrade will mark the indexes as invalid and supply a script to
reindex them.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +

Re: store narrow values in hash indexes?

From

Robert Haas

Date:

26 September 2016, 01:52:03

On Sat, Sep 24, 2016 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Sep 24, 2016 at 1:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Currently, hash indexes always store the hash code in the index, but
>> not the actual Datum.  It's recently been noted that this can make a
>> hash index smaller than the corresponding btree index would be if the
>> column is wide.  However, if the index is being built on a fixed-width
>> column with a typlen <= sizeof(Datum), we could store the original
>> value in the hash index rather than the hash code without using any
>> more space.  That would complicate the code, but I bet it would be
>> faster: we wouldn't need to set xs_recheck, we could rule out hash
>> collisions without visiting the heap, and we could support index-only
>> scans in such cases.
>
> What exactly you mean by Datum?  Is it for datatypes that fits into 64
> bits like integer.

Yeah, I mean whatever is small enough to fit into the space currently
being used to store the hashcode, along with any accompanying padding
bytes that we can also use.

> I think if we are able to support index only scans
> for hash indexes for some data types, that will be a huge plus.
> Surely, there is some benefit without index only scans as well, which
> is we can avoid recheck, but not sure if that alone can give us any
> big performance boost.  As, you say, it might lead to some
> complication in code, but I think it is worth trying.

Yeah, the recheck is probably not that expensive if we have to
retrieve the heap page anyway.

> Won't it add some requirements for pg_upgrade as well?

I have nothing to add to what Bruce already said.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company