Thread: Proposal: Global Index for PostgreSQL
PostgreSQL’s table partitioning allows large tables to be broken into smaller, more manageable pieces for better performance. However, a key limitation currently is the absence of global indexes, which restricts using partitioned tables, especially when you need unique constraints on columns that aren't part of the partition key. It's true that global indexes might seem to go against the core idea of partitioning. However, they become essential when business needs dictate partitioning data on one key while also enforcing uniqueness on a different column. Without global indexes, users simply couldn't leverage partitioning at all in such scenarios. I've been developing a design and a POC for global indexes to address this. I'd like to share my work, including the challenges we still face and my proposed solutions. Storage ======= As we know that since global indexes are covering tuple of multiple partitions, TIDs are not sufficient to uniquely identifying the tuple so for that I have introduced a partitioned identifier which is 4 bytes integer and we may argue more on this whether this can be a variable length or not for saving space. I will describe more on this partition identifier in a later part of the email and what's the reason for choosing this instead of just a relation Oid as that seems much straightforward. Partition identifier is stored as part of the index tuple right after the last key column, storing it here after the last key column make a lot of things work in very straight forward way without any special handling for this partition identifier column, e.g. btree tuple are arranged in key order and if key are duplicate its arranged in TID order and now for global index we want if the keys are duplicates it should be arranged in partitioned identifier order and if partition identifier is also same then in TID order, so by storing identifier after the last key column and if we consider that field as part of the extended key column then tuple will automatically will be arranged in (keys, partition identifier, TID) order as we desire. Similarly suffix truncation and deduplication will also be simpler with this design choice. Partition Identifier ============== Before discussing anything about Create Index/Attach/Detach partition, I think we need to discuss the partition identifier. So the main reason for not using relation Oid as partition identifier is because if we do so we would be forced to remove all the tuple of the detached partitions from the global index, otherwise it would be very hard to identify the tuple which are not valid or we should not scan for example if the same partition detaches and reattaches then we would not be able to distinguish between old (which should be ignore) and new tuples, and also if Oid got reassigned to some other partition old partition is dropped and wraparound happened. To solve this, we've introduced a partition identifier, a 4-byte integer. We'll use a new catalog called pg_index_partitions to store a mapping from (partition identifier, global index OID) to relation OID. The mapping isn't a direct one from partition identifier to relation ID because each partition will have a different partition identifier for each global index it's associated with. This is crucial for correctly handling global indexes in multi-level partition hierarchies. Consider this example: - T is the top-level parent and has global index g. - T1 and T2 are children of T. T2 is partitioned table itself and also has global index g2. - T22 is a child of T2. Let's consider a scenario where if you assign a single partition identifier to T22 and create a one-to-one mapping with its relation IDs. What happens if we detach T2 from T? (While we haven't delved into the detach process yet, our current thought is to invalidate the partition identifier entry so that during a scan, when we look up the partition identifier to find the corresponding reloid, we'd simply ignore it.) However, this approach presents a problem. We can't simply invalidate the entry for T22 entirely, because while its partition identifier should be considered invalid when scanning global index 'g' (since T2 is detached from T), it should still be valid when scanning global index 'g2' (which belongs to T2). To address this, each leaf partition needs to be assigned a separate partition identifier for every global index it's associated with. While a separate partition identifier for each level in the partition hierarchy could suffice, we've opted for the simpler approach of assigning a distinct identifier per global index. We made this choice assuming users will be judicious in creating global indexes within their partition hierarchies, meaning we won't have to worry about rapidly consuming a large number of partition IDs. Additionally, the partition identifier counter can be maintained per global index rather than globally, this will avoid growing this counter rapidly. Creating a global Index ================= While creating an index if the index is marked as GLOBAL, then there would be some differences compared to the partitioned index, 1) And additional internal partition identifier key column will be added 2) A partition id will be allocated to each leaf partitions exist under the top level partition on which we are creating a global index and mapping will be stored in pg_index_partition table 3) Unlike partitioned index global index will not be created on each child instead this will be created only on the partitioned relation on which user chooses to create 4) Index build needs to be modified to scan through all the leaf partitions and create a single sort space for inserting into the global index. Attach Partition ============= While attaching a partition, if this is a leaf partition, we need to assign a new partition id with respect to each global index present on its parent and ancestors and insert the mapping in the pg_index_partitions table. If the partition being attached is itself a partitioned table then this step has to be done for every leaf partition present in the partition hierarchy under the partitioned table being attached. Currently we reindex the global indexes present on the parent parent and ancestor to which we are attaching a partition(s), but this can be optimized for example we may choose to selectively insert the tuple of the partition being attached but in some cases it could be costly if the attached partition itself has a lot of tuple compared to existing partitions which are already attached. So this could be a mixed approach and can be decided based on the existing number of tuples index the index vs new in coming tuples, and as of now I am listing this as open for suggestion items. Detach Partition ============= When detaching or dropping a partition, we just need to invalidate the corresponding mapping in pg_index_partitions. Specifically, for each leaf partition being detached, we'll mark its reloid as invalid within the (indexoid, partition id) entry. This ensures that during an index scan, when the system attempts to convert a partition ID to its reloid, it will recognize this entry as invalid. We'll also need to consider a mechanism to clean up these invalidated entries at some point. Global Index Scan ============== In short, the global index scan will be similar to the normal index. The key difference is that we'll also need to track heapOid alongside heapTid within BTScanPosItem. Although we store the partition ID directly in the index tuple, we can readily convert it to a heapOid by referencing the pg_index_partitions table. For performance, we'll maintain a cache within the global index's Relcache entry; this cache will simply store the mapping of partitions associated with that specific global index. Furthermore, we must account for new index access paths when global indexes are present. Currently, set_append_rel_size() generates index restriction information for each append relation. Now, this restriction information must also be generated for the partitioned table itself. Additionally, within set_append_rel_pathlist(), we'll need to invoke create_index_paths() for the partitioned table. This ensures that we consider index paths for the partitioned table, which, in the case of partitioned tables, will be global index scan paths. Locking consideration ================= Currently, when we perform DML operations on a top-level partitioned table, we acquire a lock on the entire partition hierarchy beneath it, which works well. However, if we're directly operating on a leaf relation, say, inserting a tuple, we only lock that specific relation. This isn't sufficient for global indexes. Since a global index can reside on a parent table, inserting a tuple into a leaf relation also necessitates an insertion into the global index. This means we also need to lock the parent table on which the global index is created. However, even that isn't enough. We actually need to lock the entire partition hierarchy under the parent table that has the global index. Here's why: when you insert a tuple into any leaf partition, you'll also insert it into the global index. If it's a unique index, a conflict might arise with a key belonging to another partition. To validate whether the tuple is live or not, we might need to access data in other partitions. Based on our chosen design, we identify all these necessary locks during the planning phase. We then acquire these locks during planning and store them within the planned statement. This way, if the statement is prepared and executed later, these locks can be re-acquired during AcquireExecutorLocks. Open Problems/Need suggestions ========================== Here is the list of top few problems which would be good to discuss sooner than later Vacuum for global indexes —---------------------------------- I think this has been discussed multiple times in the past whenever we talked about the global index. The core issue is that, by default, global indexes are vacuumed with each individual partition. This becomes incredibly inefficient and costly when dealing with millions of partitions, as it leads to repeated, expensive scans of a large global index. Ideally, global indexes should only be vacuumed once, after all partitions have been processed. This is because a global index keeps data from all partitions, meaning a single vacuum operation after all partition vacuums would be sufficient and far more efficient. This approach would significantly reduce the overhead associated with maintaining global indexes in large, partitioned datasets. Our testing highlights a significant performance bottleneck when vacuuming with global indexes. In a scenario with 1,000 partitions and a total of 50 million tuples, vacuuming without global indexes took a mere 15 seconds. However, after introducing global indexes, the vacuum time skyrocketed to 45 minutes, a 300-fold increase! This dramatic slowdown is expected, as the global index, containing data from all partitions, is effectively vacuumed with each of the 1,000 individual partition vacuums. We've made a quick but significant optimization: we now vacuum each partition and its local indexes first, skipping the global index vacuum and the second pass of heap vacuuming for a bit. Instead, we just keep track of the dead-tid stores (dead tuples) for each partition. Once all partitions are done, we then vacuum the global index once and perform that second heap pass across all partitions. This change slashed our vacuuming time from 45 minutes down to just 40 seconds! This clearly shows we have plenty of room for optimization, and this challenge isn't a showstopper. I'll share more details on other solutions I'm testing in an upcoming email to keep this one concise. Global Index rewrite —------------------------- One particular class of things that needs careful consideration is table-rewriting operations. We have a few of those: CLUSTER, VACUUM FULL..ALTER TABLE etc, When such an operation occurs, all the TIDs potentially change, so tablecmds.c arranges to rebuild indexes. When there’s a global index involved, we also need to rebuild global indexes. Or, as a possible alternative strategy, we could allocate a new partition ID and just leave the index entries to be cleaned out eventually, but that seems to risk a lot of bloat. However, a key point here is that we don’t want to be rebuilding the same global indexes over and over. If someone does a table-rewriting operation across a whole partitioning hierarchy, we don’t want to rebuild the same global indexes multiple times. We also need to consider what happens when relations are truncated. Here, the two possible strategies are: (1) assign new partition IDs and leave the old index entries around, (2) truncate the entire index and then reinsert entries for any untruncated partitions. As above (1) risks bloat, but it also makes TRUNCATE fast. Of course, if the whole partitioning hierarchy is truncated all at once, then (2) is clearly better. I'm aiming to submit the first WIP patch set before the July commitfest. It won't have all the issues ironed out yet, but the main design will be functional. Thanks, Robert, for many of the key design ideas and regular discussion throughout designing this. I'd also like to thank Joe, Peter Geoghegan, Alvaro, and Masahiko Sawada for discussing the independent issues with me offlist. -- Regards, Dilip Kumar Google
Hi Dilip Kumar
Thank you for your working on this ,I remember six years ago there was talk about global index ,You can see if this mailing list has any references to (https://www.postgresql.org/message-id/CALtqXTcurqy1PKXzP9XO%3DofLLA5wBSo77BnUnYVEZpmcA3V0ag%40mail.gmail.com)
Thanks
On Fri, Jun 6, 2025 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
PostgreSQL’s table partitioning allows large tables to be broken into
smaller, more manageable pieces for better performance. However, a key
limitation currently is the absence of global indexes, which restricts
using partitioned tables, especially when you need unique constraints
on columns that aren't part of the partition key. It's true that
global indexes might seem to go against the core idea of partitioning.
However, they become essential when business needs dictate
partitioning data on one key while also enforcing uniqueness on a
different column. Without global indexes, users simply couldn't
leverage partitioning at all in such scenarios.
I've been developing a design and a POC for global indexes to address
this. I'd like to share my work, including the challenges we still
face and my proposed solutions.
Storage
=======
As we know that since global indexes are covering tuple of multiple
partitions, TIDs are not sufficient to uniquely identifying the tuple
so for that I have introduced a partitioned identifier which is 4
bytes integer and we may argue more on this whether this can be a
variable length or not for saving space. I will describe more on this
partition identifier in a later part of the email and what's the
reason for choosing this instead of just a relation Oid as that seems
much straightforward.
Partition identifier is stored as part of the index tuple right after
the last key column, storing it here after the last key column make a
lot of things work in very straight forward way without any special
handling for this partition identifier column, e.g. btree tuple are
arranged in key order and if key are duplicate its arranged in TID
order and now for global index we want if the keys are duplicates it
should be arranged in partitioned identifier order and if partition
identifier is also same then in TID order, so by storing identifier
after the last key column and if we consider that field as part of the
extended key column then tuple will automatically will be arranged in
(keys, partition identifier, TID) order as we desire. Similarly
suffix truncation and deduplication will also be simpler with this
design choice.
Partition Identifier
==============
Before discussing anything about Create Index/Attach/Detach partition,
I think we need to discuss the partition identifier. So the main
reason for not using relation Oid as partition identifier is because
if we do so we would be forced to remove all the tuple of the detached
partitions from the global index, otherwise it would be very hard to
identify the tuple which are not valid or we should not scan for
example if the same partition detaches and reattaches then we would
not be able to distinguish between old (which should be ignore) and
new tuples, and also if Oid got reassigned to some other partition old
partition is dropped and wraparound happened.
To solve this, we've introduced a partition identifier, a 4-byte
integer. We'll use a new catalog called pg_index_partitions to store a
mapping from (partition identifier, global index OID) to relation OID.
The mapping isn't a direct one from partition identifier to relation
ID because each partition will have a different partition identifier
for each global index it's associated with. This is crucial for
correctly handling global indexes in multi-level partition
hierarchies.
Consider this example:
- T is the top-level parent and has global index g.
- T1 and T2 are children of T. T2 is partitioned table itself and also
has global index g2.
- T22 is a child of T2.
Let's consider a scenario where if you assign a single partition
identifier to T22 and create a one-to-one mapping with its relation
IDs. What happens if we detach T2 from T? (While we haven't delved
into the detach process yet, our current thought is to invalidate the
partition identifier entry so that during a scan, when we look up the
partition identifier to find the corresponding reloid, we'd simply
ignore it.)
However, this approach presents a problem. We can't simply invalidate
the entry for T22 entirely, because while its partition identifier
should be considered invalid when scanning global index 'g' (since T2
is detached from T), it should still be valid when scanning global
index 'g2' (which belongs to T2).
To address this, each leaf partition needs to be assigned a separate
partition identifier for every global index it's associated with.
While a separate partition identifier for each level in the partition
hierarchy could suffice, we've opted for the simpler approach of
assigning a distinct identifier per global index. We made this choice
assuming users will be judicious in creating global indexes within
their partition hierarchies, meaning we won't have to worry about
rapidly consuming a large number of partition IDs. Additionally, the
partition identifier counter can be maintained per global index rather
than globally, this will avoid growing this counter rapidly.
Creating a global Index
=================
While creating an index if the index is marked as GLOBAL, then there
would be some differences compared to the partitioned index, 1) And
additional internal partition identifier key column will be added 2) A
partition id will be allocated to each leaf partitions exist under the
top level partition on which we are creating a global index and
mapping will be stored in pg_index_partition table 3) Unlike
partitioned index global index will not be created on each child
instead this will be created only on the partitioned relation on which
user chooses to create 4) Index build needs to be modified to scan
through all the leaf partitions and create a single sort space for
inserting into the global index.
Attach Partition
=============
While attaching a partition, if this is a leaf partition, we need to
assign a new partition id with respect to each global index present on
its parent and ancestors and insert the mapping in the
pg_index_partitions table. If the partition being attached is itself
a partitioned table then this step has to be done for every leaf
partition present in the partition hierarchy under the partitioned
table being attached.
Currently we reindex the global indexes present on the parent parent
and ancestor to which we are attaching a partition(s), but this can be
optimized for example we may choose to selectively insert the tuple of
the partition being attached but in some cases it could be costly if
the attached partition itself has a lot of tuple compared to existing
partitions which are already attached. So this could be a mixed
approach and can be decided based on the existing number of tuples
index the index vs new in coming tuples, and as of now I am listing
this as open for suggestion items.
Detach Partition
=============
When detaching or dropping a partition, we just need to invalidate the
corresponding mapping in pg_index_partitions. Specifically, for each
leaf partition being detached, we'll mark its reloid as invalid within
the (indexoid, partition id) entry. This ensures that during an index
scan, when the system attempts to convert a partition ID to its
reloid, it will recognize this entry as invalid. We'll also need to
consider a mechanism to clean up these invalidated entries at some
point.
Global Index Scan
==============
In short, the global index scan will be similar to the normal index.
The key difference is that we'll also need to track heapOid alongside
heapTid within BTScanPosItem. Although we store the partition ID
directly in the index tuple, we can readily convert it to a heapOid by
referencing the pg_index_partitions table. For performance, we'll
maintain a cache within the global index's Relcache entry; this cache
will simply store the mapping of partitions associated with that
specific global index.
Furthermore, we must account for new index access paths when global
indexes are present. Currently, set_append_rel_size() generates index
restriction information for each append relation. Now, this
restriction information must also be generated for the partitioned
table itself. Additionally, within set_append_rel_pathlist(), we'll
need to invoke create_index_paths() for the partitioned table. This
ensures that we consider index paths for the partitioned table, which,
in the case of partitioned tables, will be global index scan paths.
Locking consideration
=================
Currently, when we perform DML operations on a top-level partitioned
table, we acquire a lock on the entire partition hierarchy beneath it,
which works well. However, if we're directly operating on a leaf
relation, say, inserting a tuple, we only lock that specific relation.
This isn't sufficient for global indexes.
Since a global index can reside on a parent table, inserting a tuple
into a leaf relation also necessitates an insertion into the global
index. This means we also need to lock the parent table on which the
global index is created.
However, even that isn't enough. We actually need to lock the entire
partition hierarchy under the parent table that has the global index.
Here's why: when you insert a tuple into any leaf partition, you'll
also insert it into the global index. If it's a unique index, a
conflict might arise with a key belonging to another partition. To
validate whether the tuple is live or not, we might need to access
data in other partitions.
Based on our chosen design, we identify all these necessary locks
during the planning phase. We then acquire these locks during planning
and store them within the planned statement. This way, if the
statement is prepared and executed later, these locks can be
re-acquired during AcquireExecutorLocks.
Open Problems/Need suggestions
==========================
Here is the list of top few problems which would be good to discuss
sooner than later
Vacuum for global indexes
—----------------------------------
I think this has been discussed multiple times in the past whenever we
talked about the global index. The core issue is that, by default,
global indexes are vacuumed with each individual partition. This
becomes incredibly inefficient and costly when dealing with millions
of partitions, as it leads to repeated, expensive scans of a large
global index.
Ideally, global indexes should only be vacuumed once, after all
partitions have been processed. This is because a global index keeps
data from all partitions, meaning a single vacuum operation after all
partition vacuums would be sufficient and far more efficient. This
approach would significantly reduce the overhead associated with
maintaining global indexes in large, partitioned datasets.
Our testing highlights a significant performance bottleneck when
vacuuming with global indexes. In a scenario with 1,000 partitions and
a total of 50 million tuples, vacuuming without global indexes took a
mere 15 seconds. However, after introducing global indexes, the vacuum
time skyrocketed to 45 minutes, a 300-fold increase! This dramatic
slowdown is expected, as the global index, containing data from all
partitions, is effectively vacuumed with each of the 1,000 individual
partition vacuums.
We've made a quick but significant optimization: we now vacuum each
partition and its local indexes first, skipping the global index
vacuum and the second pass of heap vacuuming for a bit. Instead, we
just keep track of the dead-tid stores (dead tuples) for each
partition. Once all partitions are done, we then vacuum the global
index once and perform that second heap pass across all partitions.
This change slashed our vacuuming time from 45 minutes down to just 40
seconds! This clearly shows we have plenty of room for optimization,
and this challenge isn't a showstopper. I'll share more details on
other solutions I'm testing in an upcoming email to keep this one
concise.
Global Index rewrite
—-------------------------
One particular class of things that needs careful consideration is
table-rewriting operations. We have a few of those: CLUSTER, VACUUM
FULL..ALTER TABLE etc, When such an operation occurs, all the TIDs
potentially change, so tablecmds.c arranges to rebuild indexes. When
there’s a global index involved, we also need to rebuild global
indexes. Or, as a possible alternative strategy, we could allocate a
new partition ID and just leave the index entries to be cleaned out
eventually, but that seems to risk a lot of bloat. However, a key
point here is that we don’t want to be rebuilding the same global
indexes over and over. If someone does a table-rewriting operation
across a whole partitioning hierarchy, we don’t want to rebuild the
same global indexes multiple times.
We also need to consider what happens when relations are truncated.
Here, the two possible strategies are: (1) assign new partition IDs
and leave the old index entries around, (2) truncate the entire index
and then reinsert entries for any untruncated partitions. As above (1)
risks bloat, but it also makes TRUNCATE fast. Of course, if the whole
partitioning hierarchy is truncated all at once, then (2) is clearly
better.
I'm aiming to submit the first WIP patch set before the July
commitfest. It won't have all the issues ironed out yet, but the main
design will be functional.
Thanks, Robert, for many of the key design ideas and regular
discussion throughout designing this. I'd also like to thank Joe,
Peter Geoghegan, Alvaro, and Masahiko Sawada for discussing the
independent issues with me offlist.
--
Regards,
Dilip Kumar
On Fri, Jun 6, 2025 at 1:01 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote: > > Hi Dilip Kumar > Thank you for your working on this ,I remember six years ago there was talk about global index ,You can see if thismailing list has any references to (https://www.postgresql.org/message-id/CALtqXTcurqy1PKXzP9XO%3DofLLA5wBSo77BnUnYVEZpmcA3V0ag%40mail.gmail.com) > Sure Thanks. -- Regards, Dilip Kumar Google