Re: REPACK and naming - Mailing list pgsql-hackers
From | Antonin Houska |
---|---|
Subject | Re: REPACK and naming |
Date | |
Msg-id | 12212.1758283110@localhost Whole thread Raw |
In response to | Re: REPACK and naming (Álvaro Herrera <alvherre@alvh.no-ip.org>) |
Responses |
Re: REPACK and naming
Re: REPACK and naming |
List | pgsql-hackers |
Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > On 2025-Sep-19, David Rowley wrote: > > > I was just thinking about how much of a heap-ism cluster using an > > index is. If we were to ever have an index organised table AM, what > > would it mean to REPACK tab USING INDEX idx? Would that "secondary" > > index then go away and the table would become that index? or would > > both continue to exist and the secondary index would be surplus? > > So, there's already an implementation of an index-organized table in > OrioleDB, as I understand, so maybe we can ask Alexander K. about this. > I suspect it's fine to say that if you have a table for which it makes > no sense to use REPACK USING INDEX, then we just throw an error in that > case (but I suppose plain REPACK continues to work, and it just > recreates/compacts the primary index and rebuilds all secondary indexes, > just like VACUUM FULL would presumably do.) > > > I do understand that heap is well ingrained in our code (still), but > > at least things like system catalogue tables/columns can evolve over > > time. e.g pg_index.indisclustered I could imagine evolving (or > > disappearing) if we had an IOT-AM. I do think locking in syntax is > > going to be quite a bit more permanent and needs to be considered very > > carefully. Something like REPACK tab ORDER BY col1; seems a bit more > > future proof. > > Oh, I think we can implement REPACK tab ORDER BY all right -- do note > that the current syntax has mandatory USING INDEX keywords (unlike > CLUSTER), so we can add that feature and others with no grammar > problems. In fact even for current heaps it might make sense to allow > an ORDER BY clause for which there's no index. I don't see us > gratuituously removing the option of specifying just an index name (or > indisclustered), though, because there are likely users that have been > running that for years. > > > table_relation_copy_for_cluster() does support both use > > of an Index to get presorted results and sorting by the index's key > > columns, so it doesn't seem impossible that the ability to cluster a > > table *specifically* by an index couldn't easily go away at some > > point. > > Well, I hope you mean that clustering by an index would stop being the > _only_ way, not that it would completely disappear as an option. > > > Locking us deeper into a syntax for that, I do have concerns for. But > > maybe you've thought about all this already and I'm just not aware... > > At this point we're not *implementing* any of that, but it is possible > to do so afterwards and we're not blocking that road. > > > I'm also trying to keep something like a column store in mind here > > where you might not have any indexes, and efficient filtering is done > > via the pruning of "chunks", which works by each chunk recording the > > min/max (or maybe a dictionary of) values it contains for the columns. > > I imagine something like that very much would want the ability to have > > something like REPACK tbl ORDER BY col; if you think how efficient > > run-length encoding would be for some orders and now inefficient it > > could be for other orders. > > That makes sense, yes, and again, AFAICT it can easily be implemented on > top of the current work. Admittedly I haven't thought about clause like ORDER BY yet, but I wonder if it'd really be useful. My understanding is that the purpose of clustering is to make index scan more efficient: with a clustered table, the heap tuples pertaining to given index tuple should be located on the same page, so the heap access is not that random. If IOT-AM table does not have anything like index, I imagine it has some kind of ordering information in the system catalog. Without that the query planner can hardly utilize the ordering. In such case REPACK should use the catalog information on ordering rather than accept arbitrary ORDER BY clause. -- Antonin Houska Web: https://www.cybertec-postgresql.com
pgsql-hackers by date: