Re: REPACK and naming - Mailing list pgsql-hackers

From David Rowley
Subject Re: REPACK and naming
Date
Msg-id CAApHDvoeW9ecNxbaaXX0QDUS3i3u+Q684C+qU80paa8qqPHzxA@mail.gmail.com
Whole thread Raw
In response to Re: REPACK and naming  (Álvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: REPACK and naming
List pgsql-hackers
On Thu, 18 Sept 2025 at 03:03, Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> So there two operations here.  One is
> REPACK tab USING INDEX idx
> which we currently call CLUSTER, and there is also
> REPACK TAB
> (no index specified) which we currently call VACUUM FULL.

I was just thinking about how much of a heap-ism cluster using an
index is. If we were to ever have an index organised table AM, what
would it mean to REPACK tab USING INDEX idx? Would that "secondary"
index then go away and the table would become that index? or would
both continue to exist and the secondary index would be surplus?

I do understand that heap is well ingrained in our code (still), but
at least things like system catalogue tables/columns can evolve over
time. e.g pg_index.indisclustered I could imagine evolving (or
disappearing) if we had an IOT-AM. I do think locking in syntax is
going to be quite a bit more permanent and needs to be considered very
carefully. Something like REPACK tab ORDER BY col1; seems a bit more
future proof. table_relation_copy_for_cluster() does support both use
of an Index to get presorted results and sorting by the index's key
columns, so it doesn't seem impossible that the ability to cluster a
table *specifically* by an index couldn't easily go away at some
point. Locking us deeper into a syntax for that, I do have concerns
for. But maybe you've thought about all this already and I'm just not
aware...

I'm also trying to keep something like a column store in mind here
where you might not have any indexes, and efficient filtering is done
via the pruning of "chunks", which works by each chunk recording the
min/max (or maybe a dictionary of) values it contains for the columns.
I imagine something like that very much would want the ability to have
something like REPACK tbl ORDER BY col; if you think how efficient
run-length encoding would be for some orders and now inefficient it
could be for other orders.

Anyway, I'm not intentionally trying to make your job here any more
complex. I'm just trying to help make sure we don't end up with some
new syntax that also won't stand up to the test of time.

David



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: ReadRecentBuffer() doesn't scale well
Next
From: Michael Paquier
Date:
Subject: Re: PgStat_HashKey padding issue when passed by reference