Home > mailing lists

Re: Hash Indexes - Mailing list pgsql-hackers

From	Jesper Pedersen
Subject	Re: Hash Indexes
Date	September 14, 2016 19:13:15
Msg-id	a49b17a9-bf5c-822b-4a89-be7d9fdca35c@redhat.com Whole thread Raw
In response to	Re: Hash Indexes (Amit Kapila <amit.kapila16@gmail.com>)
Responses	Re: Hash Indexes
List	pgsql-hackers

Tree view

Hi,

On 09/14/2016 07:24 AM, Amit Kapila wrote:
> On Wed, Sep 14, 2016 at 12:29 AM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> On 09/13/2016 07:26 AM, Amit Kapila wrote:
>>>
>>> Attached, new version of patch which contains the fix for problem
>>> reported on write-ahead-log of hash index thread [1].
>>>
>>
>> I have been testing patch in various scenarios, and it has a positive
>> performance impact in some cases.
>>
>> This is especially seen in cases where the values of the indexed column are
>> unique - SELECTs can see a 40-60% benefit over a similar query using b-tree.
>>
>
> Here, I think it is better if we have the data comparing the situation
> of hash index with respect to HEAD as well.  What I mean to say is
> that you are claiming that after the hash index improvements SELECT
> workload is 40-60% better, but where do we stand as of HEAD?
>

The tests I have done are with a copy of a production database using the
same queries sent with a b-tree index for the primary key, and the same
with a hash index. Those are seeing a speed-up of the mentioned 40-60%
in execution time - some involve JOINs.

Largest of those tables is 390Mb with a CHAR() based primary key.

>> UPDATE also sees an improvement.
>>
>
> Can you explain this more?  Is it more compare to HEAD or more as
> compare to Btree?  Isn't this contradictory to what the test in below
> mail shows?
>

Same thing here - where the fields involving the hash index aren't updated.

>> In cases where the indexed column value isn't unique, it takes a long time
>> to build the index due to the overflow page creation.
>>
>> Also in cases where the index column is updated with a high number of
>> clients, ala
>>
>> -- ddl.sql --
>> CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val;
>> CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id);
>> CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val);
>> ANALYZE;
>>
>> -- test.sql --
>> \set id random(1,10)
>> \set val random(0,10)
>> BEGIN;
>> UPDATE test SET val = :val WHERE id = :id;
>> COMMIT;
>>
>> w/ 100 clients - it takes longer than the b-tree counterpart (2921 tps for
>> hash, and 10062 tps for b-tree).
>>
>
> Thanks for doing the tests.  Have you applied both concurrent index
> and cache the meta page patch for these tests?  So from above tests,
> we can say that after these set of patches read-only workloads will be
> significantly improved even better than btree in quite-a-few useful
> cases.

Agreed.

>  However when the indexed column is updated, there is still a
> large gap as compare to btree (what about the case when the indexed
> column is not updated in read-write transaction as in our pgbench
> read-write transactions, by any chance did you ran any such test?).

I have done a run to look at the concurrency / TPS aspect of the
implementation - to try something different than Mark's work on testing
the pgbench setup.

With definitions as above, with SELECT as

-- select.sql --
\set id random(1,10)
BEGIN;
SELECT * FROM test WHERE id = :id;
COMMIT;

and UPDATE/Indexed with an index on 'val', and finally UPDATE/Nonindexed
w/o one.

[1] [2] [3] is new_hash - old_hash is the existing hash implementation
on master. btree is master too.

Machine is a 28C/56T with 256Gb RAM with 2 x RAID10 SSD for data + wal.
Clients ran with -M prepared.

[1]
https://www.postgresql.org/message-id/CAA4eK1+ERbP+7mdKkAhJZWQ_dTdkocbpt7LSWFwCQvUHBXzkmA@mail.gmail.com
[2]
https://www.postgresql.org/message-id/CAD__OujvYghFX_XVkgRcJH4VcEbfJNSxySd9x=1Wp5VyLvkf8Q@mail.gmail.com
[3]
https://www.postgresql.org/message-id/CAA4eK1JUYr_aB7BxFnSg5+JQhiwgkLKgAcFK9bfD4MLfFK6Oqw@mail.gmail.com

Don't know if you find this useful due to the small number of rows, but
let me know if there are other tests I can run, f.ex. bump the number of
rows.

> I
> think we need to focus on improving cases where index columns are
> updated, but it is better to do that work as a separate patch.
>

Ok.

Best regards,
  Jesper

Attachment

pgsql-hackers by date:

From: Andres Freund
Date: 14 September 2016, 18:50:20
Subject: Re: Logical Replication WIP

From: Petr Jelinek
Date: 14 September 2016, 19:18:11
Subject: Re: Logical Replication WIP

Re: Hash Indexes - Mailing list pgsql-hackers

Attachment

Previous

Next