Thread: UUIDs & Clustered Indexes

UUIDs & Clustered Indexes

From

Luke Gordon

Date:

30 August 2016, 13:59:16

I'm trying to decide on which UUID generator to use for my Postgres database, and I've narrowed it down to gen_random & uuid_generate_v1mc.

There's a fascinating article that discusses performance implications between gen_random_uuid & uuid_generate_v1mc:

https://www.starkandwayne.com/blog/uuid-primary-keys-in-postgresql/

TL;DR, the article suggests:

"Random produces very fragmented inserts that destroy tables. Use uuid_generate_v1mc() [instead].."

However, according to a message on this mailing list, Postgres doesn't have clustered indexes:

"But Postgres doesn't _have_ clustered indexes, so that article doesn't apply at all. The other authors appear to have missed this important point."

https://www.postgresql.org/message-id/56798352.7060902%40uchicago.edu

But, doing a quick check, it appears Postgres does indeed have a mechanism for a clustered index:

https://www.postgresql.org/docs/9.5/static/sql-cluster.html

So, does anyone know which is best? Or are the performance differences so minute they'd only matter in extremely unique circumstances?

Regards,

Luke Gordon

PS I attempted to reply to the above thread, but downloading the "raw" link resulted in a basic auth challenge. If I entered my mailing list credentials, I'd just get rechallenged with basic auth...

Re: UUIDs & Clustered Indexes

From

Tom Lane

Date:

30 August 2016, 14:16:01

Luke Gordon <gordysc@gmail.com> writes:
> However, according to a message on this mailing list, Postgres doesn't have
> clustered indexes:
> "But Postgres doesn't _have_ clustered indexes, so that article doesn't
> apply at all. The other authors appear to have missed this important point."
> https://www.postgresql.org/message-id/56798352.7060902%40uchicago.edu

> But, doing a quick check, it appears Postgres does indeed have a mechanism
> for a clustered index:
> https://www.postgresql.org/docs/9.5/static/sql-cluster.html

CLUSTER just does a one-time sort to put the table into index order.
There is no mechanism that would cause subsequent insertions of new keys
to respect that ordering, so it's pretty much irrelevant to the argument
about whether new UUID keys need to be generated in some ordered fashion.

Do you actually *need* UUID keys, and if so why?  A plain old bigint
column is smaller, cheaper to index, and the natural mechanism for
generating it (ie a sequence) will tend to preserve ordering for free.

            regards, tom lane

Re: UUIDs & Clustered Indexes

From

Melvin Davidson

Date:

30 August 2016, 14:17:29

On Tue, Aug 30, 2016 at 9:59 AM, Luke Gordon <gordysc@gmail.com> wrote:

I'm trying to decide on which UUID generator to use for my Postgres database, and I've narrowed it down to gen_random & uuid_generate_v1mc.

There's a fascinating article that discusses performance implications between gen_random_uuid & uuid_generate_v1mc:
https://www.starkandwayne.com/blog/uuid-primary-keys-in-postgresql/
TL;DR, the article suggests:
"Random produces very fragmented inserts that destroy tables. Use uuid_generate_v1mc() [instead].."

However, according to a message on this mailing list, Postgres doesn't have clustered indexes:
"But Postgres doesn't _have_ clustered indexes, so that article doesn't apply at all. The other authors appear to have missed this important point."
https://www.postgresql.org/message-id/56798352.7060902%40uchicago.edu

But, doing a quick check, it appears Postgres does indeed have a mechanism for a clustered index:
https://www.postgresql.org/docs/9.5/static/sql-cluster.html

So, does anyone know which is best? Or are the performance differences so minute they'd only matter in extremely unique circumstances?

Regards,

Luke Gordon

PS I attempted to reply to the above thread, but downloading the "raw" link resulted in a basic auth challenge. If I entered my mailing list credentials, I'd just get rechallenged with basic auth...

You did not mention your PostgreSQL version or O/S, which would be helpful for future reference, but please take note.

https://www.postgresql.org/docs/9.4/static/sql-cluster.html

Yes, PostgreSQL does have a mechanism to cluster the index, BUT... as stated in the docs, any subsequent insert or update will not be clustered, so a periodic

reCLUSTER is required to maintain it.

Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

Re: UUIDs & Clustered Indexes

From

Luke Gordon

Date:

30 August 2016, 14:17:51

Tom,

Ah, that makes more sense. Thank you very much!

On Tue, Aug 30, 2016 at 9:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Luke Gordon <gordysc@gmail.com> writes:
> However, according to a message on this mailing list, Postgres doesn't have
> clustered indexes:
> "But Postgres doesn't _have_ clustered indexes, so that article doesn't
> apply at all. The other authors appear to have missed this important point."
> https://www.postgresql.org/message-id/56798352.7060902%40uchicago.edu

> But, doing a quick check, it appears Postgres does indeed have a mechanism
> for a clustered index:
> https://www.postgresql.org/docs/9.5/static/sql-cluster.html

CLUSTER just does a one-time sort to put the table into index order.
There is no mechanism that would cause subsequent insertions of new keys
to respect that ordering, so it's pretty much irrelevant to the argument
about whether new UUID keys need to be generated in some ordered fashion.

Do you actually *need* UUID keys, and if so why? A plain old bigint
column is smaller, cheaper to index, and the natural mechanism for
generating it (ie a sequence) will tend to preserve ordering for free.

regards, tom lane

Re: UUIDs & Clustered Indexes

From

Tom Lane

Date:

30 August 2016, 14:40:22

Luke Gordon <gordysc@gmail.com> writes:
> PS I attempted to reply to the above thread, but downloading the "raw" link
> resulted in a basic auth challenge.  If I entered my mailing list
> credentials, I'd just get rechallenged with basic auth...

BTW, so far as that goes: it's just a weak anti-bot measure.
If you try it in Firefox you'll get an explanation:

    A username and password are being requested by
    https://www.postgresql.org. The site says: "Please authenticate
    with user archives and password antispam"

Unfortunately, we've found that most other browsers don't present that
message :-(

            regards, tom lane

Re: UUIDs & Clustered Indexes

From

Francisco Olarte

Date:

30 August 2016, 14:59:33

Luke:

On Tue, Aug 30, 2016 at 3:59 PM, Luke Gordon <gordysc@gmail.com> wrote:
> I'm trying to decide on which UUID generator to use for my Postgres
> database, and I've narrowed it down to gen_random & uuid_generate_v1mc.
>
> There's a fascinating article that discusses performance implications
> between gen_random_uuid & uuid_generate_v1mc:
> https://www.starkandwayne.com/blog/uuid-primary-keys-in-postgresql/
> TL;DR, the article suggests:
> "Random produces very fragmented inserts that destroy tables. Use
> uuid_generate_v1mc() [instead].."

He probably means destroy index, not tables, tables are not index
ordered in postgres ( like some kind of clustered tables in other
products )

>
> However, according to a message on this mailing list, Postgres doesn't have
> clustered indexes:
> "But Postgres doesn't _have_ clustered indexes, so that article doesn't
> apply at all. The other authors appear to have missed this important point."
> https://www.postgresql.org/message-id/56798352.7060902%40uchicago.edu

No, he hasn't. If you are generating them to use as a kind of primary
key, like a serial, having a timestamp based means they are generated
in ascending order, which means you append on the table ( at the end )
and on the index, so they end up better filled and perform better.
Specially in the index, ordered insertions tend to be well optimized,
being a common case, and perform quite well. It's not going to be as
advantegous as on a clustered table system, but it will help ( but not
that much ).

> But, doing a quick check, it appears Postgres does indeed have a mechanism
> for a clustered index:
> https://www.postgresql.org/docs/9.5/static/sql-cluster.html
> So, does anyone know which is best? Or are the performance differences so
> minute they'd only matter in extremely unique circumstances?

Cluster just resorts the table and rebuild the index, it's already
being told in other messages. It's equivalent to create temp table tt
as select * from the_table, truncate th_table, insert into the_table
select * from tt order by index_expression, drop table tt. It is nice
to do it for tables that are normally ordered but somehow lost it.
Like having a log table with an indexed field for insertion timestamp
and updating it a lot, or purging many old records. As you normally
would typically query it with a range condition on the timestamp, a
cluster will help.

Francisco Olarte.

Re: UUIDs & Clustered Indexes

From

"Mike Sofen"

Date:

30 August 2016, 16:40:55

From: Tom Lane Sent: Tuesday, August 30, 2016 7:16 AM

Luke Gordon <gordysc@gmail.com> writes:

> However, according to a message on this mailing list, Postgres doesn't

> have clustered indexes:

> "But Postgres doesn't _have_ clustered indexes, so that article

> doesn't apply at all. The other authors appear to have missed this important point."

> https://www.postgresql.org/message-id/56798352.7060902%40uchicago.edu

> But, doing a quick check, it appears Postgres does indeed have a

> mechanism for a clustered index:

> https://www.postgresql.org/docs/9.5/static/sql-cluster.html

CLUSTER just does a one-time sort to put the table into index order.

There is no mechanism that would cause subsequent insertions of new keys to respect that ordering, so it's pretty much irrelevant to the argument about whether new UUID keys need to be generated in some ordered fashion.

Do you actually *need* UUID keys, and if so why? A plain old bigint column is smaller, cheaper to index, and the natural mechanism for generating it (ie a sequence) will tend to preserve ordering for free.

regards, tom lane

I agree with Tom for a “normal” application – I would always use bigints (bigserial) as a PK column. The app I currently working on is a high security web app for which the app coders require guids for all identifiers flowing around the system. So in this scenario, I’m using BOTH bigserials as the PK and uuids as AKs in the core tables. I reference the bigints for all joins and (have to) use the uuids for the filters. It’s been working ok so far, lookup performance on a table with a few million rows, using the uuid (indexed) is instantaneous. I’ll soon have a 100 million+ rows loaded into a single table and know a bit more.

The uuids are also design insurance for me in case I need to shard, since I’ll need/want that uniqueness across servers.

Mike Sofen

Re: UUIDs & Clustered Indexes

From

George Neuner

Date:

31 August 2016, 00:54:18

On Tue, 30 Aug 2016 09:40:33 -0700, "Mike Sofen" <msofen@runbox.com>
wrote:

>From: Tom Lane  Sent: Tuesday, August 30, 2016 7:16 AM
>
>>Do you actually *need* UUID keys, and if so why?  A plain old bigint column
>>is smaller, cheaper to index, and the natural mechanism for generating it
>>(ie a sequence) will tend to preserve ordering for free.
>
>I agree with Tom for a "normal" application - I would always use bigints
>(bigserial) as a PK column.  The app I currently working on is a high
>security web app for which the app coders require guids for all identifiers
>flowing around the system.  So in this scenario, I'm using BOTH bigserials
>as the PK and uuids as AKs in the core tables.  I reference the bigints for
>all joins and (have to) use the uuids for the filters.  It's been working ok
>so far, lookup performance on a table with a few million rows, using the
>uuid (indexed) is instantaneous.  I'll soon have a 100 million+ rows loaded
>into a single table and know a bit more.
>
>
>
>The uuids are also design insurance for me in case I need to shard, since
>I'll need/want that uniqueness across servers.

FYI:  articles about sharding using bigint keys.

http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram
http://rob.conery.io/2014/05/29/a-better-id-generator-for-postgresql/

George

Re: UUIDs & Clustered Indexes

From

"Mike Sofen"

Date:

31 August 2016, 03:34:20

From: George Neuner Sent: Tuesday, August 30, 2016 5:54 PM

>Mike Sofen wrote: So in this scenario, I'm using

>BOTH bigserials as the PK and uuids as AKs in the core tables. I

>reference the bigints for all joins and (have to) use the uuids for the

>filters. It's been working ok so far, lookup performance on a table

>with a few million rows, using the uuid (indexed) is instantaneous.

>I'll soon have a 100 million+ rows loaded into a single table and know a bit more.

>The uuids are also design insurance for me in case I need to shard,

>since I'll need/want that uniqueness across servers.

FYI: articles about sharding using bigint keys.

http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram

http://rob.conery.io/2014/05/29/a-better-id-generator-for-postgresql/

George

I remember reading these articles a long time ago, forgot about them...and appreciate the reminder!

I really liked the enhanced Instagram function from Rob Conery in the second link, but so far haven’t needed to deal with it. However, an upcoming project may require huge data storage – approaching hundreds of billions of rows, and I’m sticking with Postgres – so this will be a great way to test drive the function. And I may try my hand at a further enhancement, time permitting. Thanks for the links!

Mike