Home > mailing lists

Re: REPACK and naming - Mailing list pgsql-hackers

From	Antonin Houska
Subject	Re: REPACK and naming
Date	September 19 14:58:30
Msg-id	12212.1758283110@localhost Whole thread Raw
In response to	Re: REPACK and naming (Álvaro Herrera <alvherre@alvh.no-ip.org>)
Responses	Re: REPACK and naming Re: REPACK and naming
List	pgsql-hackers

Tree view

Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> On 2025-Sep-19, David Rowley wrote:
>
> > I was just thinking about how much of a heap-ism cluster using an
> > index is. If we were to ever have an index organised table AM, what
> > would it mean to REPACK tab USING INDEX idx? Would that "secondary"
> > index then go away and the table would become that index? or would
> > both continue to exist and the secondary index would be surplus?
>
> So, there's already an implementation of an index-organized table in
> OrioleDB, as I understand, so maybe we can ask Alexander K. about this.
> I suspect it's fine to say that if you have a table for which it makes
> no sense to use REPACK USING INDEX, then we just throw an error in that
> case (but I suppose plain REPACK continues to work, and it just
> recreates/compacts the primary index and rebuilds all secondary indexes,
> just like VACUUM FULL would presumably do.)
>
> > I do understand that heap is well ingrained in our code (still), but
> > at least things like system catalogue tables/columns can evolve over
> > time. e.g pg_index.indisclustered I could imagine evolving (or
> > disappearing) if we had an IOT-AM. I do think locking in syntax is
> > going to be quite a bit more permanent and needs to be considered very
> > carefully. Something like REPACK tab ORDER BY col1; seems a bit more
> > future proof.
>
> Oh, I think we can implement REPACK tab ORDER BY all right -- do note
> that the current syntax has mandatory USING INDEX keywords (unlike
> CLUSTER), so we can add that feature and others with no grammar
> problems.  In fact even for current heaps it might make sense to allow
> an ORDER BY clause for which there's no index.  I don't see us
> gratuituously removing the option of specifying just an index name (or
> indisclustered), though, because there are likely users that have been
> running that for years.
>
> > table_relation_copy_for_cluster() does support both use
> > of an Index to get presorted results and sorting by the index's key
> > columns, so it doesn't seem impossible that the ability to cluster a
> > table *specifically* by an index couldn't easily go away at some
> > point.
>
> Well, I hope you mean that clustering by an index would stop being the
> _only_ way, not that it would completely disappear as an option.
>
> > Locking us deeper into a syntax for that, I do have concerns for. But
> > maybe you've thought about all this already and I'm just not aware...
>
> At this point we're not *implementing* any of that, but it is possible
> to do so afterwards and we're not blocking that road.
>
> > I'm also trying to keep something like a column store in mind here
> > where you might not have any indexes, and efficient filtering is done
> > via the pruning of "chunks", which works by each chunk recording the
> > min/max (or maybe a dictionary of) values it contains for the columns.
> > I imagine something like that very much would want the ability to have
> > something like REPACK tbl ORDER BY col; if you think how efficient
> > run-length encoding would be for some orders and now inefficient it
> > could be for other orders.
>
> That makes sense, yes, and again, AFAICT it can easily be implemented on
> top of the current work.

Admittedly I haven't thought about clause like ORDER BY yet, but I wonder if
it'd really be useful. My understanding is that the purpose of clustering is
to make index scan more efficient: with a clustered table, the heap tuples
pertaining to given index tuple should be located on the same page, so the
heap access is not that random.

If IOT-AM table does not have anything like index, I imagine it has some kind
of ordering information in the system catalog. Without that the query planner
can hardly utilize the ordering. In such case REPACK should use the catalog
information on ordering rather than accept arbitrary ORDER BY clause.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

pgsql-hackers by date:

From: David Rowley
Date: 19 September, 14:57:29
Subject: Re: REPACK and naming

From: David Rowley
Date: 19 September, 15:19:47
Subject: Re: REPACK and naming

Re: REPACK and naming - Mailing list pgsql-hackers

Previous

Next