Re: [HACKERS] Maintaining cluster order on insert - Mailing list pgsql-patches

From Ron Mayer
Subject Re: [HACKERS] Maintaining cluster order on insert
Date
Msg-id 44DD13C6.2020609@cheapcomplexdevices.com
Whole thread Raw
In response to Re: [HACKERS] Maintaining cluster order on insert  (Heikki Linnakangas <heikki@enterprisedb.com>)
List pgsql-patches
Heikki Linnakangas wrote:
> Ron Mayer wrote:
>> In my case my biggest/slowest tables are clustered by zip-code (which
>> does a reasonable job at keeping counties/cities/etc on the
>> same pages too)....
>
> No deletes? If the tables grow over time, you probably would need to run
> CLUSTER every now and then to get the best performance, though the patch
> would alleviate that quite a lot.

Yup, pretty much no deletes; since it's a historical archive of
some government documents with address info.  Though I the
live system may periodically expunge data, say, 10+years old.

> Do you have a development environment where you could test what effect
> the patch would have? It would be interesting to have a real-world use
> case, since I don't have one myself at the moment.

I have a development environment, but it doesn't have the same
real-time-growing behavior, and only a small region of the country.
I suppose I could pre-load N-1 years and cluster it, and then
incrementally insert the last year of data to simulate the effect.

But sure, I'll attempt to try the patch; but don't really have any
good benchmarking environment to give any definitive results.  If
an anecdotal "this is how it feels to me" is useful, I can give one
of those.

>> > Your best bet might be to partition the table into two subtables, one
>> > with "stable" data and one with the fresh data.
>>
>> Hmm... that should work well for me too....
>
> Umm, if your inserts are uniformly distributed across the country, you
> wouldn't have a stable part, right?

Hmm.  Maybe.  I was thinking when archiving to the large table
an "order by" clause when inserting from the new partition to the
stable partition could at least make the big table "piecewise"
clustered so most records for a zip code fit in the same few disk
pages, even though those pages would still end up lying around
far apart on the disk.

I wonder what part of "CLUSTER" gives the most benefit - that
most records of a type fit on a few blocks;  or that those blocks
are next to each other so can be read sequentially?





pgsql-patches by date:

Previous
From: Jim Nasby
Date:
Subject: Re: [HACKERS] Forcing current WAL file to be archived
Next
From: David Fetter
Date:
Subject: Re: New variable server_version_num