From f8748aae4eceb8279ee37eae008437572edf8204 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Thu, 3 Dec 2020 16:50:10 -0800 Subject: [PATCH v11 2/2] Add bottom-up index deletion. Teach heapam and nbtree to eagerly delete duplicate tuples that represent old versions. This leaf page level process is triggered in response to a flood of versions on the page. Heuristics detect the problem at the leaf page level (including the recently added "index is logically unchanged by an UPDATE" executor hint). The immediate goal of bottom-up index deletion in nbtree is to avoid "unnecessary" page splits caused entirely by duplicates needed only for MVCC/versioning purposes. It naturally has an even more useful effect, though: it acts as a backstop against accumulating an excessive number of index tuple versions for any given _logical row_. Note that the relationship between this localized condition and the proportion of garbage tuples in the entire index is very loose, and can be very volatile. Bottom-up index deletion complements what we might now call "top-down index deletion": index vacuuming performed by VACUUM. It responds to the immediate local needs of queries, while leaving it up to autovacuum to perform infrequent clean sweeps of the index. Also extend deletion of LP_DEAD-marked index tuples by teaching it to delete extra index tuples (that are not LP_DEAD-marked) in passing. This increases the number of index tuples deleted significantly in many cases (and typically doesn't necessitate that the deletion process visits extra table blocks, unlike bottom-up deletion). Testing has shown that the enhanced LP_DEAD deletion process almost never fails to delete at least a few extra not-LP_DEAD-marked index tuples when the regression tests are run. In practice the enhanced deletion process can pick up a surprisingly large number of "extra" index tuples; this can even significantly exceed the number of LP_DEAD-marked tuples. The previous table interface (the table_compute_xid_horizon_for_tuples() function) has been replaced with a new interface that supports all of our new requirements. At least some of the capabilities added to nbtree by this commit could be extended to other index AMs without too much trouble. It would be fairly straightforward to add support for including "extra" TIDs when deleting LP_DEAD-marked index tuples to both the hash AM and GiST AM. That's left as work for a future commit. This commit extends nbtree's _bt_delitems_delete() function to support granular TID deletion in posting list tuples. Bottom-up index deletion and the enhanced LP_DEAD deletion process both support deleting individual TIDs from posting list tuples. This enhancement avoids possible performance issues caused by posting list tuples having only one single LP_DEAD bit. In practice there is a fairly good chance that we'll pick up a subset of the posting list TIDs in passing, so it may well not matter that it still isn't possible to individually mark each posting list TID LP_DEAD. Bump XLOG_PAGE_MAGIC because xl_btree_delete changed. No bump in BTREE_VERSION, since there are no changes to the on-disk representation of nbtree indexes. Indexes built on PostgreSQL 12 or PostgreSQL 13 will automatically benefit from the optimization (i.e. no reindexing required) following a pg_upgrade. Author: Peter Geoghegan Reviewed-By: Heikki Linnakangas Reviewed-By: Victor Yegorov Discussion: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com --- src/include/access/heapam.h | 5 +- src/include/access/nbtree.h | 23 +- src/include/access/nbtxlog.h | 101 +++-- src/include/access/tableam.h | 143 +++++- src/backend/access/common/reloptions.c | 10 + src/backend/access/heap/heapam.c | 546 +++++++++++++++++++++-- src/backend/access/heap/heapam_handler.c | 2 +- src/backend/access/index/genam.c | 41 +- src/backend/access/nbtree/README | 133 +++++- src/backend/access/nbtree/nbtdedup.c | 320 ++++++++++++- src/backend/access/nbtree/nbtinsert.c | 357 +++++++++++++-- src/backend/access/nbtree/nbtpage.c | 494 ++++++++++++++------ src/backend/access/nbtree/nbtree.c | 2 +- src/backend/access/nbtree/nbtsort.c | 1 - src/backend/access/nbtree/nbtutils.c | 4 +- src/backend/access/nbtree/nbtxlog.c | 94 ++-- src/backend/access/table/tableam.c | 6 +- src/backend/access/table/tableamapi.c | 2 +- src/bin/psql/tab-complete.c | 4 +- doc/src/sgml/btree.sgml | 109 ++++- doc/src/sgml/ref/create_index.sgml | 16 + 21 files changed, 2055 insertions(+), 358 deletions(-) diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index 54b2eb7378..a0d55c9165 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -166,9 +166,8 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid); extern void simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup); -extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel, - ItemPointerData *items, - int nitems); +extern TransactionId heap_compute_delete_for_tuples(Relation rel, + TM_IndexDeleteOp *delstate); /* in heap/pruneheap.c */ struct GlobalVisState; diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h index 3b60e696eb..b80fc47806 100644 --- a/src/include/access/nbtree.h +++ b/src/include/access/nbtree.h @@ -17,6 +17,7 @@ #include "access/amapi.h" #include "access/itup.h" #include "access/sdir.h" +#include "access/tableam.h" #include "access/xlogreader.h" #include "catalog/pg_am_d.h" #include "catalog/pg_index.h" @@ -766,8 +767,9 @@ typedef struct BTDedupStateData typedef BTDedupStateData *BTDedupState; /* - * BTVacuumPostingData is state that represents how to VACUUM a posting list - * tuple when some (though not all) of its TIDs are to be deleted. + * BTVacuumPostingData is state that represents how to VACUUM (or delete) a + * posting list tuple when some (though not all) of its TIDs are to be + * deleted. * * Convention is that itup field is the original posting list tuple on input, * and palloc()'d final tuple used to overwrite existing tuple on output. @@ -963,6 +965,7 @@ typedef struct BTOptions /* fraction of newly inserted tuples prior to trigger index cleanup */ float8 vacuum_cleanup_index_scale_factor; bool deduplicate_items; /* Try to deduplicate items? */ + bool delete_items; /* Bottom-up delete items? */ } BTOptions; #define BTGetFillFactor(relation) \ @@ -978,6 +981,11 @@ typedef struct BTOptions relation->rd_rel->relam == BTREE_AM_OID), \ ((relation)->rd_options ? \ ((BTOptions *) (relation)->rd_options)->deduplicate_items : true)) +#define BTGetDeleteItems(relation) \ + (AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \ + relation->rd_rel->relam == BTREE_AM_OID), \ + ((relation)->rd_options ? \ + ((BTOptions *) (relation)->rd_options)->delete_items : true)) /* * Constant definition for progress reporting. Phase numbers must match @@ -1031,6 +1039,8 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan); extern void _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem, Size newitemsz, bool checkingunique); +extern bool _bt_bottomup_pass(Relation rel, Buffer buf, Relation heapRel, + Size newitemsz); extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base, OffsetNumber baseoff); extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup); @@ -1045,7 +1055,8 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting, * prototypes for functions in nbtinsert.c */ extern bool _bt_doinsert(Relation rel, IndexTuple itup, - IndexUniqueCheck checkUnique, Relation heapRel); + IndexUniqueCheck checkUnique, bool indexUnchanged, + Relation heapRel); extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack); extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child); @@ -1083,9 +1094,9 @@ extern bool _bt_page_recyclable(Page page); extern void _bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable, int ndeletable, BTVacuumPosting *updatable, int nupdatable); -extern void _bt_delitems_delete(Relation rel, Buffer buf, - OffsetNumber *deletable, int ndeletable, - Relation heapRel); +extern void _bt_delitems_delete_check(Relation rel, Buffer buf, + Relation heapRel, + TM_IndexDeleteOp *delstate); extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact); diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h index 5c014bdc66..db1eb11042 100644 --- a/src/include/access/nbtxlog.h +++ b/src/include/access/nbtxlog.h @@ -176,24 +176,6 @@ typedef struct xl_btree_dedup #define SizeOfBtreeDedup (offsetof(xl_btree_dedup, nintervals) + sizeof(uint16)) -/* - * This is what we need to know about delete of individual leaf index tuples. - * The WAL record can represent deletion of any number of index tuples on a - * single index page when *not* executed by VACUUM. Deletion of a subset of - * the TIDs within a posting list tuple is not supported. - * - * Backup Blk 0: index page - */ -typedef struct xl_btree_delete -{ - TransactionId latestRemovedXid; - uint32 ndeleted; - - /* DELETED TARGET OFFSET NUMBERS FOLLOW */ -} xl_btree_delete; - -#define SizeOfBtreeDelete (offsetof(xl_btree_delete, ndeleted) + sizeof(uint32)) - /* * This is what we need to know about page reuse within btree. This record * only exists to generate a conflict point for Hot Standby. @@ -211,9 +193,61 @@ typedef struct xl_btree_reuse_page #define SizeOfBtreeReusePage (sizeof(xl_btree_reuse_page)) /* - * This is what we need to know about which TIDs to remove from an individual - * posting list tuple during vacuuming. An array of these may appear at the - * end of xl_btree_vacuum records. + * xl_btree_vacuum and xl_btree_delete records describe deletion of index + * tuples on a leaf page. The former variant is used by VACUUM, while the + * latter variant is used by the ad-hoc deletions that sometimes take place + * when btinsert() is called. + * + * The records are very similar. The only difference is that xl_btree_delete + * has to include a latestRemovedXid field to generate recovery conflicts. + * (VACUUM operations can just rely on earlier conflicts generated during + * pruning of the table whose TIDs the to-be-deleted index tuples point to. + * There are also small differences between each REDO routine that we don't go + * into here.) + * + * xl_btree_vacuum and xl_btree_delete both represent deletion of any number + * of index tuples on a single leaf page using page offset numbers. Both also + * support "updates" of index tuples, which is how deletes of a subset of TIDs + * contained in an existing posting list tuple are implemented. + * + * Updated posting list tuples are represented using xl_btree_update metadata. + * The REDO routines each use the xl_btree_update entries (plus each + * corresponding original index tuple from the target leaf page) to generate + * the final updated tuple. + * + * Updates are only used when there will be some remaining TIDs left by the + * REDO routine. Otherwise the posting list tuple just gets deleted outright. + */ +typedef struct xl_btree_vacuum +{ + uint16 ndeleted; + uint16 nupdated; + + /* DELETED TARGET OFFSET NUMBERS FOLLOW */ + /* UPDATED TARGET OFFSET NUMBERS FOLLOW */ + /* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */ +} xl_btree_vacuum; + +#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16)) + +typedef struct xl_btree_delete +{ + TransactionId latestRemovedXid; + uint16 ndeleted; + uint16 nupdated; + + /* DELETED TARGET OFFSET NUMBERS FOLLOW */ + /* UPDATED TARGET OFFSET NUMBERS FOLLOW */ + /* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */ +} xl_btree_delete; + +#define SizeOfBtreeDelete (offsetof(xl_btree_delete, nupdated) + sizeof(uint16)) + +/* + * The offsets that appear in xl_btree_update metadata are offsets into the + * original posting list from tuple, not page offset numbers. These are + * 0-based. The page offset number for the original posting list tuple comes + * from main xl_btree_delete/xl_btree_vacuum record. */ typedef struct xl_btree_update { @@ -224,31 +258,6 @@ typedef struct xl_btree_update #define SizeOfBtreeUpdate (offsetof(xl_btree_update, ndeletedtids) + sizeof(uint16)) -/* - * This is what we need to know about a VACUUM of a leaf page. The WAL record - * can represent deletion of any number of index tuples on a single index page - * when executed by VACUUM. It can also support "updates" of index tuples, - * which is how deletes of a subset of TIDs contained in an existing posting - * list tuple are implemented. (Updates are only used when there will be some - * remaining TIDs once VACUUM finishes; otherwise the posting list tuple can - * just be deleted). - * - * Updated posting list tuples are represented using xl_btree_update metadata. - * The REDO routine uses each xl_btree_update (plus its corresponding original - * index tuple from the target leaf page) to generate the final updated tuple. - */ -typedef struct xl_btree_vacuum -{ - uint16 ndeleted; - uint16 nupdated; - - /* DELETED TARGET OFFSET NUMBERS FOLLOW */ - /* UPDATED TARGET OFFSET NUMBERS FOLLOW */ - /* UPDATED TUPLES METADATA ARRAY FOLLOWS */ -} xl_btree_vacuum; - -#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16)) - /* * This is what we need to know about marking an empty subtree for deletion. * The target identifies the tuple removed from the parent page (note that we diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h index 387eb34a61..2d2228f4aa 100644 --- a/src/include/access/tableam.h +++ b/src/include/access/tableam.h @@ -128,6 +128,123 @@ typedef struct TM_FailureData bool traversed; } TM_FailureData; +/* + * State representing call to table_compute_delete_for_tuples(), which checks + * TID status with tableam for index deletion purposes. + * + * Index am caller provides a TM_IndexDeleteOp, which points to two palloc()'d + * arrays. Each array has one entry per TID that the tableam is asked to + * consider (typically these are all of the TIDs from a single index page, so + * there could be hundreds or even thousand of entries in arrays). ndeltids + * tracks the current number of entries, and is set by index AM initially. + * + * The two arrays are conceptually one single variable-sized array. Two + * arrays/structs are used for performance reasons. (We really need to keep + * the TM_IndexDelete struct small so that the tableam can do an initial sort + * by TID as quickly as possible.) + * + * Regular deletion callers and bottom-up deletion callers: + * + * Most index AM callers specify bottomup = false, and include only known-dead + * deltids. These known-dead entries are marked deleteitup = true directly + * (typically these are TIDs from LP_DEAD-marked index tuples). Callers that + * only call table_compute_delete_for_tuples() to get a latestRemovedXid + * transaction ID can take this simple approach, and don't need to do anything + * with the final array (the call can even be skipped entirely with an + * unlogged index). This usage is all that the previous interface (the old + * compute_xid_horizon_for_tuples() routine) ever supported in prior versions + * (those before PostgreSQL 14). + * + * Callers that specify bottomup = true are "bottom-up index deletion" + * callers. The considerations here are somewhat more subtle. Most of the + * complexity of the current interface exists to support bottom-up deletion. + * + * The final contents of the array are always interesting to bottom-up + * callers, because they need to consult it to determine which index tuples + * are actually safe to delete. Entries for a bottom-up caller must always be + * initially marked deleteitup = false, leaving it up to the tableam to + * determine which entries are safely deletable. Which index tuples to get + * deltids entries from in the first place is up to the index AM, but it's + * expected that index AMs take TIDs from version churn index tuple + * duplicates. The index AM should cast a wide net (for example by including + * all TIDs on a given index page), and leave it up to the tableam to worry + * about the cost of checking transaction status information. See also: notes + * below about "promising" tuples. + * + * Some regular deletion callers (!bottomup callers) may also have to look at + * the final deltids array to decide exactly what to delete. This happens + * with !bottomup callers that speculatively ask table to check extra TIDs in + * passing, somewhat like the bottom-up deletion case (though tableam does not + * access whole table blocks speculatively here). The table blocks are blocks + * that tableam is expected to visit anyway, so there is no reason to not be + * open to the possibility of finding extra deletable entries. This works by + * having !bottomup caller include deltids that are initially marked + * deleteitup = false (extra etnries that go along with the usual known + * dead-to-all etnries). It may be possible for the caller to ultimate delete + * these extra TIDs, but if it isn't then it's no great loss. + * + * The convention is that index AM caller takes TIDs whose block number + * happens to match any single including deltid that is already known + * dead-to-all. This makes it cheap to check in passing, at least with + * heap-style tableams. (A tableam that wants to opt out can simply ignore + * entries marked deleteitup = false. In general the tableam is entitled to + * do nothing at all with any deleteitup = false deltid, based on + * tableam-local performance considerations.) + * + * The index AM can keep track of which index tuple relates to which deltid by + * setting idxoffnum (and/or relying on each entry being uniquely identifiable + * using tid). That's how callers that care about the final deltids array + * (both bottom-up callers and regular deletion callers that include "extra" + * deletions) can relocate the required deltid for each index tuple. Such a + * system must be necessary because a table_compute_delete_for_tuples() + * implementation can change the sort order of deltids, and can even reduce + * the number of deltids by modifying ndeltids. Bottom-up callers may even + * find that ndeltids is set to 0 for them (which means that they cannot + * proceed with any deletions). + * + * Bottom-up deletion, index tuple space savings, and promising tuples: + * + * Index AM requests an amount of target free space bottomupfreespace. They + * must also specify the amount of space freed by each deltid by setting + * tupsize (this includes line pointer overhead). This information enables + * intelligent managements of costs within the tableam. The tableam drives + * the progress of bottom-up index deletion, and ramps up as needed. All + * !bottomup callers set these to zero, since there isn't any question about + * which table blocks will be visited. + * + * The index AM also provides strong hints about where to look to the tableam + * by marking some entries as "promising". Index AM does this with duplicate + * index tuples that are strongly suspected to be old versions left behind by + * UPDATEs that did not logically changed any indexed values. Index AM may + * find it helpful to only mark TIDs/entries as promising when they're thought + * to have been affected by such an UPDATE in the recent past. Again, this + * isn't useful to !bottomup callers. + */ +typedef struct TM_IndexDelete +{ + ItemPointerData tid; /* table TID from index tuple */ + int16 id; /* Offset into TM_IndexStatus array */ +} TM_IndexDelete; + +typedef struct TM_IndexStatus +{ + OffsetNumber idxoffnum; /* Index am page offset number */ + int16 tupsize; /* Space freed in index if tuple deleted */ + bool ispromising; /* Duplicate in index? */ + bool deleteitup; /* Known dead-to-all? */ +} TM_IndexStatus; + +typedef struct TM_IndexDeleteOp +{ + bool bottomup; /* Bottom-up deletion/opportunistic? */ + int bottomupfreespace; /* Space target for tableam */ + + /* Mutable state follows */ + int ndeltids; /* Number of deltids/status for op */ + TM_IndexDelete *deltids; + TM_IndexStatus *status; +} TM_IndexDeleteOp; + /* "options" flag bits for table_tuple_insert */ /* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */ #define TABLE_INSERT_SKIP_FSM 0x0002 @@ -342,10 +459,9 @@ typedef struct TableAmRoutine TupleTableSlot *slot, Snapshot snapshot); - /* see table_compute_xid_horizon_for_tuples() */ - TransactionId (*compute_xid_horizon_for_tuples) (Relation rel, - ItemPointerData *items, - int nitems); + /* see table_compute_delete_for_tuples() */ + TransactionId (*compute_delete_for_tuples) (Relation rel, + TM_IndexDeleteOp *delstate); /* ------------------------------------------------------------------------ @@ -1122,16 +1238,21 @@ table_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot, } /* - * Compute the newest xid among the tuples pointed to by items. This is used - * to compute what snapshots to conflict with when replaying WAL records for - * page-level index vacuums. + * Compute which index tuples are safe to delete, and the newest xid among the + * tuples that caller finds it is able to delete. + * + * Sets deletable tuples in entries from caller's TM_IndexDeleteOp state that + * are found to point to dead-to-all tuples in the table. See the + * TM_IndexDeleteOp struct for full details. + * + * Returns a latestRemovedXid transaction ID that index AM must use to + * generate a recovery conflict when required. This is the newest xid among + * the tuples pointed to by deltids TIDs that caller can delete. */ static inline TransactionId -table_compute_xid_horizon_for_tuples(Relation rel, - ItemPointerData *items, - int nitems) +table_compute_delete_for_tuples(Relation rel, TM_IndexDeleteOp *delstate) { - return rel->rd_tableam->compute_xid_horizon_for_tuples(rel, items, nitems); + return rel->rd_tableam->compute_delete_for_tuples(rel, delstate); } diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index 8ccc228a8c..95e29345de 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -168,6 +168,16 @@ static relopt_bool boolRelOpts[] = }, true }, + { + { + "delete_items", + "Enables \"bottom-up index deletion\" feature for this btree index", + RELOPT_KIND_BTREE, + ShareUpdateExclusiveLock /* since it applies only to later + * inserts */ + }, + true + }, /* list terminator */ {{NULL}} }; diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index a9583f3103..fa827d0ce4 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -55,6 +55,7 @@ #include "miscadmin.h" #include "pgstat.h" #include "port/atomics.h" +#include "port/pg_bitutils.h" #include "storage/bufmgr.h" #include "storage/freespace.h" #include "storage/lmgr.h" @@ -102,6 +103,8 @@ static void MultiXactIdWait(MultiXactId multi, MultiXactStatus status, uint16 in int *remaining); static bool ConditionalMultiXactIdWait(MultiXactId multi, MultiXactStatus status, uint16 infomask, Relation rel, int *remaining); +static void heap_delete_sort(TM_IndexDeleteOp *delstate); +static int heap_delete_bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate); static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup); static HeapTuple ExtractReplicaIdentity(Relation rel, HeapTuple tup, bool key_changed, bool *copy); @@ -166,18 +169,33 @@ static const struct #ifdef USE_PREFETCH /* - * heap_compute_xid_horizon_for_tuples and xid_horizon_prefetch_buffer use - * this structure to coordinate prefetching activity. + * heap_compute_delete_for_tuples and compute_delete_prefetch_buffer use this + * structure to coordinate prefetching activity */ typedef struct { BlockNumber cur_hblkno; int next_item; - int nitems; - ItemPointerData *tids; -} XidHorizonPrefetchState; + int ndeltids; + TM_IndexDelete *deltids; +} TidPrefetchState; #endif +/* Bottom-up index deletion limits */ +#define BOTTOMUP_FAVORABLE_STRIDE 3 +#define BOTTOMUP_MAX_HEAP_BLOCKS 6 + +/* + * heap_compute_delete_for_tuples uses this structure to determine which heap + * pages to visit, and in what order for bottom-up index deletion check + */ +typedef struct IndexDeleteCounts +{ + int16 npromisingtids; + int16 ntids; + int16 ideltids; +} IndexDeleteCounts; + /* * This table maps tuple lock strength values for each particular * MultiXactStatus value. @@ -6936,28 +6954,32 @@ HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple, #ifdef USE_PREFETCH /* - * Helper function for heap_compute_xid_horizon_for_tuples. Issue prefetch + * Helper function for heap_compute_delete_for_tuples. Issue prefetch * requests for the number of buffers indicated by prefetch_count. The * prefetch_state keeps track of all the buffers that we can prefetch and * which ones have already been prefetched; each call to this function picks * up where the previous call left off. + * + * Note: we expect the deltids array to be sorted in an order that groups TIDs + * by heap block, with all TIDs for each block appearing together in exactly + * one group. */ static void -xid_horizon_prefetch_buffer(Relation rel, - XidHorizonPrefetchState *prefetch_state, - int prefetch_count) +compute_delete_prefetch_buffer(Relation rel, + TidPrefetchState *prefetch_state, + int prefetch_count) { BlockNumber cur_hblkno = prefetch_state->cur_hblkno; int count = 0; int i; - int nitems = prefetch_state->nitems; - ItemPointerData *tids = prefetch_state->tids; + int ndeltids = prefetch_state->ndeltids; + TM_IndexDelete *deltids = prefetch_state->deltids; for (i = prefetch_state->next_item; - i < nitems && count < prefetch_count; + i < ndeltids && count < prefetch_count; i++) { - ItemPointer htid = &tids[i]; + ItemPointer htid = &deltids[i].tid; if (cur_hblkno == InvalidBlockNumber || ItemPointerGetBlockNumber(htid) != cur_hblkno) @@ -6978,49 +7000,68 @@ xid_horizon_prefetch_buffer(Relation rel, #endif /* - * Get the latestRemovedXid from the heap pages pointed at by the index - * tuples being deleted. + * Determine which heap tuples from a list of TIDs provided by index AM caller + * are dead. It is safe to delete index tuples that point to these dead heap + * tuples. Callers can mark deltids entries as deletable, or leave it to us + * to determine if they're deletable (though bottom-up deletion callers are + * not allowed to mark deltids entries as already dead-to-all). * - * We used to do this during recovery rather than on the primary, but that - * approach now appears inferior. It meant that the primary could generate - * a lot of work for the standby without any back-pressure to slow down the - * primary, and it required the standby to have reached consistency, whereas - * we want to have correct information available even before that point. + * Bottom-up index deletion callers ask us to perform the same steps, but are + * much more uncertain about the likelihood of success. We'll have to keep + * the costs and benefits in balance for these callers to manage this + * uncertainty. Many heap blocks that are pointed to by deltids entries will + * never be visited on each bottom-up call here. * - * It's possible for this to generate a fair amount of I/O, since we may be - * deleting hundreds of tuples from a single index block. To amortize that - * cost to some degree, this uses prefetching and combines repeat accesses to - * the same block. + * Returns the latestRemovedXid from the heap tuples pointed to by index + * tuples whose deltids entries are marked safe to delete. */ TransactionId -heap_compute_xid_horizon_for_tuples(Relation rel, - ItemPointerData *tids, - int nitems) +heap_compute_delete_for_tuples(Relation rel, TM_IndexDeleteOp *delstate) { TransactionId latestRemovedXid = InvalidTransactionId; BlockNumber hblkno; Buffer buf = InvalidBuffer; Page hpage; #ifdef USE_PREFETCH - XidHorizonPrefetchState prefetch_state; + TidPrefetchState prefetch_state; int prefetch_distance; #endif + SnapshotData SnapshotNonVacuumable; + TM_IndexDelete *deltids = delstate->deltids; + TM_IndexStatus *status = delstate->status; + int bottomupfreespace = delstate->bottomupfreespace; + int finalndeltids = 0, + nblocksaccessed = 0; + /* Bottom-up deletion state */ + int nblocksfavorable = 0, + spacefreed = 0, + spacefreedlast = 0; + bool bottomup_final_hpage = false; + + InitNonVacuumableSnapshot(SnapshotNonVacuumable, GlobalVisTestFor(rel)); + + /* Sort caller's deltids array by TID for further processing */ + heap_delete_sort(delstate); /* - * Sort to avoid repeated lookups for the same page, and to make it more - * likely to access items in an efficient order. In particular, this - * ensures that if there are multiple pointers to the same page, they all - * get processed looking up and locking the page just once. + * Bottom-up case: Resort deltids array in an order attuned to where the + * greatest number of promising TIDs are to be found, and determine how + * many blocks from the start of sorted array should be considered + * favorable. + * + * Note: This will usually shrink deltids array, capping the number of + * heap blocks accessed to BOTTOMUP_MAX_HEAP_BLOCKS. This helps to avoid + * unnecessary bottom-up case prefetching. */ - qsort((void *) tids, nitems, sizeof(ItemPointerData), - (int (*) (const void *, const void *)) ItemPointerCompare); + if (delstate->bottomup) + nblocksfavorable = heap_delete_bottomup_sort_and_shrink(delstate); #ifdef USE_PREFETCH /* Initialize prefetch state. */ prefetch_state.cur_hblkno = InvalidBlockNumber; prefetch_state.next_item = 0; - prefetch_state.nitems = nitems; - prefetch_state.tids = tids; + prefetch_state.ndeltids = delstate->ndeltids; + prefetch_state.deltids = deltids; /* * Compute the prefetch distance that we will attempt to maintain. @@ -7035,26 +7076,93 @@ heap_compute_xid_horizon_for_tuples(Relation rel, prefetch_distance = get_tablespace_maintenance_io_concurrency(rel->rd_rel->reltablespace); + /* Cap initial prefetch distance for bottom-up deletion caller */ + if (delstate->bottomup) + { + Assert(nblocksfavorable >= 1); + prefetch_distance = Min(prefetch_distance, nblocksfavorable); + } + /* Start prefetching. */ - xid_horizon_prefetch_buffer(rel, &prefetch_state, prefetch_distance); + compute_delete_prefetch_buffer(rel, &prefetch_state, prefetch_distance); #endif - /* Iterate over all tids, and check their horizon */ + /* Iterate over deltids, determine which to delete, check their horizon */ hblkno = InvalidBlockNumber; hpage = NULL; - for (int i = 0; i < nitems; i++) + Assert(delstate->ndeltids > 0); + for (int i = 0; i < delstate->ndeltids; i++) { - ItemPointer htid = &tids[i]; + ItemPointer htid = &deltids[i].tid; + TM_IndexStatus *dstatus = status + deltids[i].id; ItemId hitemid; OffsetNumber hoffnum; /* - * Read heap buffer, but avoid refetching if it's the same block as - * required for the last tid. + * Read heap buffer, and perform required extra steps each time a new + * heap block is encountered. Avoid refetching if it's the same heap + * block as the one from the last htid. */ if (hblkno == InvalidBlockNumber || ItemPointerGetBlockNumber(htid) != hblkno) { + /* + * Consider giving up early for bottom-up index deletion caller + * first. (Only prefetch next-next heap block afterwards, when it + * becomes clear that we're at least going to access the next heap + * block in line.) + * + * Sometimes the first heap block frees so much space for + * bottom-up caller that the deletion process can end without + * accessing any more heap blocks. It is usually necessary to + * access 2 or 3 blocks per bottom-up deletion operation, though. + */ + if (delstate->bottomup) + { + /* + * We usually do a little "extra" work on the final heap page + * after caller's target free space is reached. We finish off + * the entire final heap page because it's cheap to do so. + * Check if it's time to stop now. + */ + if (bottomup_final_hpage) + break; + + /* + * Give up when we didn't enable our caller to free any + * additional space as a result of processing the heap page + * that we just finished with. Insist on making steady + * progress in all cases. Being patient is probably + * unhelpful. + */ + if (nblocksaccessed >= 1 && spacefreed == spacefreedlast) + break; + spacefreedlast = spacefreed; /* For next time */ + + /* + * We will access next heap page in line. Decay the target + * free space from bottom-up deletion caller here, so that we + * don't behave too aggressively when that's unlikely to be of + * much use. + * + * The number of favorable blocks (physically contiguous + * blocks) is tested as a gating condition that delays when we + * first apply a decay. They allow bottom-up deletion to hang + * on for a little longer when heap blocks only allow index AM + * caller to free relatively small (though still non-zero) + * amounts of free space. + * + * Handle steps for that now: decrement the number of + * favorable blocks (if any remain), or else decay target + * space (if we're out of favorable blocks). + */ + Assert(nblocksaccessed > 0 || nblocksfavorable > 0); + if (nblocksfavorable > 0) + nblocksfavorable--; + else + bottomupfreespace /= 2; + } + /* release old buffer */ if (BufferIsValid(buf)) { @@ -7065,6 +7173,9 @@ heap_compute_xid_horizon_for_tuples(Relation rel, hblkno = ItemPointerGetBlockNumber(htid); buf = ReadBuffer(rel, hblkno); + nblocksaccessed++; + Assert(!delstate->bottomup || + nblocksaccessed <= BOTTOMUP_MAX_HEAP_BLOCKS); #ifdef USE_PREFETCH @@ -7072,7 +7183,7 @@ heap_compute_xid_horizon_for_tuples(Relation rel, * To maintain the prefetch distance, prefetch one more page for * each page we read. */ - xid_horizon_prefetch_buffer(rel, &prefetch_state, 1); + compute_delete_prefetch_buffer(rel, &prefetch_state, 1); #endif hpage = BufferGetPage(buf); @@ -7080,6 +7191,39 @@ heap_compute_xid_horizon_for_tuples(Relation rel, LockBuffer(buf, BUFFER_LOCK_SHARE); } + if (!dstatus->deleteitup) + { + ItemPointerData tmp; + bool all_dead, + found; + HeapTupleData heapTuple; + + tmp = *htid; /* Don't modify htid */ + all_dead = false; /* Check that whole HOT chain is vacuumable */ + found = heap_hot_search_buffer(&tmp, rel, buf, + &SnapshotNonVacuumable, &heapTuple, + &all_dead, true); + + if (found || !all_dead) + continue; + } + else + Assert(!delstate->bottomup); + + /* Caller can delete this TID from index */ + finalndeltids = i + 1; + dstatus->deleteitup = true; + spacefreed += dstatus->tupsize; + + if (delstate->bottomup && spacefreed >= bottomupfreespace) + { + /* + * Bottom-up deletion caller's free space target (or a decayed + * value based on caller's value) has been reached + */ + bottomup_final_hpage = true; + } + hoffnum = ItemPointerGetOffsetNumber(htid); hitemid = PageGetItemId(hpage, hoffnum); @@ -7126,6 +7270,9 @@ heap_compute_xid_horizon_for_tuples(Relation rel, ReleaseBuffer(buf); } + /* Make final array only include known-dead items */ + delstate->ndeltids = finalndeltids; + /* * If all heap tuples were LP_DEAD then we will be returning * InvalidTransactionId here, which avoids conflicts. This matches @@ -7137,6 +7284,319 @@ heap_compute_xid_horizon_for_tuples(Relation rel, return latestRemovedXid; } +/* + * Specialized inlineable comparison function for heap_delete_sort() + */ +static inline int +heap_delete_sort_cmp(TM_IndexDelete *deltid1, TM_IndexDelete *deltid2) +{ + ItemPointer tid1 = &deltid1->tid; + ItemPointer tid2 = &deltid2->tid; + + { + BlockNumber blk1 = ItemPointerGetBlockNumber(tid1); + BlockNumber blk2 = ItemPointerGetBlockNumber(tid2); + + if (blk1 != blk2) + return (blk1 < blk2) ? -1 : 1; + } + { + OffsetNumber pos1 = ItemPointerGetOffsetNumber(tid1); + OffsetNumber pos2 = ItemPointerGetOffsetNumber(tid2); + + if (pos1 != pos2) + return (pos1 < pos2) ? -1 : 1; + } + + pg_unreachable(); + + return 0; +} + +/* + * Sort deltids array from delstate by TID. This prepares it for further + * processing. + * + * This operation becomes a noticeable consumer of CPU cycles with some + * workloads. This is especially likely with bottom-up index deletion heavy + * workloads, especially when B-Tree deduplication is also used and we might + * well have over a thousand TIDs/deltids (even with default BLCKSZ). This + * justifies a specialized sort routine. + * + * We use shellsort because it's easy to specialize, compiles to relatively + * few instructions, and is adaptive to presorted inputs/subsets (which are + * typical here). The TM_IndexDelete struct is only 8 bytes, so swap + * operations are expected to be cheap here. + */ +static void +heap_delete_sort(TM_IndexDeleteOp *delstate) +{ + TM_IndexDelete *deltids = delstate->deltids; + int ndeltids = delstate->ndeltids; + int low = 0; + + /* + * Shellsort gap sequence (taken from Sedgewick-Incerpi paper). + * + * This implementation is fast with array sizes up to ~4500. This covers + * all supported BLCKSZ values. + */ + const int gaps[9] = {1968, 861, 336, 112, 48, 21, 7, 3, 1}; + + /* Think carefully before changing anything here */ + StaticAssertStmt(sizeof(TM_IndexDelete) <= 8, + "element size exceeds 8 bytes"); + + for (int g = 0; g < lengthof(gaps); g++) + { + for (int hi = gaps[g], i = low + hi; i < ndeltids; i++) + { + TM_IndexDelete d = deltids[i]; + int j = i; + + while (j >= hi && heap_delete_sort_cmp(&deltids[j - hi], &d) >= 0) + { + deltids[j] = deltids[j - hi]; + j -= hi; + } + deltids[j] = d; + } + } +} + +/* + * Determine how many favorable blocks are among blocks we'll access (which + * should be in their final bottom-up deletion order when we're called). + * + * Favorable blocks are contiguous heap blocks, which are likely to have + * relatively many dead items. These blocks are cheaper to access together + * all at once. Having many favorable blocks is common with low cardinality + * index tuples, where heap locality will have a relatively large influence on + * which heap blocks we visit (and the order they're processed in). + * + * Being more aggressive with favorable blocks is slightly more expensive in + * the short term, but less expensive in the long term, when there are many + * closely related calls to heap_compute_delete_for_tuples(). + * + * Returns number of favorable blocks, starting from (and including) the first + * block in line for processing. See heap_compute_delete_for_tuples() for + * details on how the value is applied. + */ +static int +top_block_groups_favorable(IndexDeleteCounts *blockcounts, int nblockgroups, + TM_IndexDelete *deltids) +{ + int nblocksfavorable = 0; + BlockNumber lastblock = InvalidBlockNumber; + + for (int b = 0; b < nblockgroups; b++) + { + IndexDeleteCounts *blockgroup = blockcounts + b; + TM_IndexDelete *firstgroup = deltids + blockgroup->ideltids; + BlockNumber thisblock = ItemPointerGetBlockNumber(&firstgroup->tid); + + if (BlockNumberIsValid(lastblock) && + (thisblock < lastblock || + thisblock > lastblock + BOTTOMUP_FAVORABLE_STRIDE)) + break; + + nblocksfavorable++; + lastblock = Min(thisblock, + MaxBlockNumber - BOTTOMUP_FAVORABLE_STRIDE); + } + + /* + * We always indicate that there is at least 1 favorable block (the first + * in line to process). The first block must always be in sorted order + * because the ordering is relative to the first (or previous) block. + * (heap_compute_delete_for_tuples() is okay with this degenerate case + * because it is supposed to always visit the first heap page in line.) + */ + Assert(nblocksfavorable >= 1); + + return nblocksfavorable; +} + +/* + * qsort comparison function for heap_delete_bottomup_sort_and_shrink() + */ +static int +heap_delete_bottomup_sort_and_shrink_cmp(const void *arg1, const void *arg2) +{ + const IndexDeleteCounts *count1 = (const IndexDeleteCounts *) arg1; + const IndexDeleteCounts *count2 = (const IndexDeleteCounts *) arg2; + + /* Caller normalizes non-zero npromisingtids values into powers-of-two */ + Assert(count1->npromisingtids == 0 || + ((count1->npromisingtids - 1) & count1->npromisingtids) == 0); + Assert(count2->npromisingtids == 0 || + ((count2->npromisingtids - 1) & count2->npromisingtids) == 0); + + /* + * Most significant field is npromisingtids (which we invert the order of + * so as to sort in desc order) + */ + if (count1->npromisingtids > count2->npromisingtids) + return -1; + if (count1->npromisingtids < count2->npromisingtids) + return 1; + + /* + * Tiebreak: desc ntids sort order. + * + * We cannot expect power-of-two values for ntids fields. We should + * behave as if they were already rounded up for us instead. + */ + if (count1->ntids != count2->ntids) + { + uint32 ntids1 = pg_nextpower2_32((uint32) count1->ntids); + uint32 ntids2 = pg_nextpower2_32((uint32) count2->ntids); + + if (ntids1 > ntids2) + return -1; + if (ntids1 < ntids2) + return 1; + } + + /* + * Tiebreak: asc offset-into-deltids-for-block (offset to first TID for + * block in deltids array) order. + * + * This is equivalent to sorting in ascending heap block number order + * (among otherwise equal subsets of the array). This approach allows us + * to avoid accessing the out-of-line TID. (We rely on the assumption + * that the deltids array was sorted in ascending heap TID order when + * these offsets to the first TID from each heap block group were formed.) + */ + if (count1->ideltids > count2->ideltids) + return 1; + if (count1->ideltids < count2->ideltids) + return -1; + + pg_unreachable(); + + return 0; +} + +/* + * heap_compute_delete_for_tuples() helper function for bottom-up deletion + * callers. + * + * Sorts deltids array in the order needed for useful processing by bottom-up + * deletion. The array should already be sorted in TID order when we're + * called. The sort process groups heap TIDs from deltids into heap block + * number groupings. Earlier/more-promising groups/blocks are those that are + * known to have the most "promising" TIDs. + * + * Sets new size of deltids array (ndeltids) in state. deltids will only have + * TIDs from the BOTTOMUP_MAX_HEAP_BLOCKS most promising heap blocks when we + * return (which is usually far fewer than what we started with). + * + * Returns number of "favorable" blocks. + */ +static int +heap_delete_bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate) +{ + IndexDeleteCounts *blockcounts; + TM_IndexDelete *reordereddeltids; + BlockNumber curblock = InvalidBlockNumber; + int nblockgroups = 0; + int ncopied = 0; + int nblocksfavorable = 0; + + Assert(delstate->bottomup); + Assert(delstate->ndeltids > 0); + + /* Calculate per-heap-block count of TIDs */ + blockcounts = palloc(sizeof(IndexDeleteCounts) * delstate->ndeltids); + for (int i = 0; i < delstate->ndeltids; i++) + { + ItemPointer deltid = &delstate->deltids[i].tid; + TM_IndexStatus *dstatus = delstate->status + delstate->deltids[i].id; + bool ispromising = dstatus->ispromising; + + if (curblock != ItemPointerGetBlockNumber(deltid)) + { + /* New block group */ + nblockgroups++; + + Assert(curblock < ItemPointerGetBlockNumber(deltid) || + !BlockNumberIsValid(curblock)); + + curblock = ItemPointerGetBlockNumber(deltid); + blockcounts[nblockgroups - 1].ideltids = i; + blockcounts[nblockgroups - 1].ntids = 1; + blockcounts[nblockgroups - 1].npromisingtids = 0; + } + else + { + blockcounts[nblockgroups - 1].ntids++; + } + + if (ispromising) + blockcounts[nblockgroups - 1].npromisingtids++; + } + + /* + * We're about ready to sort block groups to determine the optimal order + * for visiting heap pages. But before we do, round the number of + * promising tuples for each block group up to the nearest power-of-two + * (unless there are zero promising tuples). + * + * This scheme usefully divides heap pages into buckets. Each bucket + * contains heap pages that are approximately equally promising, that we + * want to treat as exactly equivalent (at least initially). We should + * not let the most promising heap pages win or lose (get accessed or not + * accessed by bottom-up deletion) on the basis of _relatively_ small + * differences in the total number of promising tuples. + * + * Note that we effectively have the same power-of-two bucketing scheme + * with the ntids field (which is compared after npromisingtids). The + * only reason that we don't fix nhtids here is that the original values + * will be needed when copying the final TIDs from winning block groups + * back into caller's deltids array. + */ + for (int b = 0; b < nblockgroups; b++) + { + IndexDeleteCounts *blockgroup = blockcounts + b; + + if (blockgroup->npromisingtids != 0) + blockgroup->npromisingtids = + pg_nextpower2_32((uint32) blockgroup->npromisingtids); + } + + /* Sort groups and rearrange caller's deltids array */ + qsort(blockcounts, nblockgroups, sizeof(IndexDeleteCounts), + heap_delete_bottomup_sort_and_shrink_cmp); + reordereddeltids = palloc(delstate->ndeltids * sizeof(TM_IndexDelete)); + + nblockgroups = Min(BOTTOMUP_MAX_HEAP_BLOCKS, nblockgroups); + /* Determine number of favorable blocks at the start of array */ + nblocksfavorable = top_block_groups_favorable(blockcounts, nblockgroups, + delstate->deltids); + + for (int b = 0; b < nblockgroups; b++) + { + IndexDeleteCounts *blockgroup = blockcounts + b; + TM_IndexDelete *firstgroup = delstate->deltids + blockgroup->ideltids; + + memcpy(reordereddeltids + ncopied, firstgroup, + sizeof(TM_IndexDelete) * blockgroup->ntids); + ncopied += blockgroup->ntids; + } + + /* Copy final grouped and sorted TIDs back into start of caller's array */ + memcpy(delstate->deltids, reordereddeltids, + sizeof(TM_IndexDelete) * ncopied); + delstate->ndeltids = ncopied; + + /* be tidy */ + pfree(reordereddeltids); + pfree(blockcounts); + + return nblocksfavorable; +} + /* * Perform XLogInsert to register a heap cleanup info message. These * messages are sent once per VACUUM and are required because diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index c6f438de72..37c037b820 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -2563,7 +2563,7 @@ static const TableAmRoutine heapam_methods = { .tuple_get_latest_tid = heap_get_latest_tid, .tuple_tid_valid = heapam_tuple_tid_valid, .tuple_satisfies_snapshot = heapam_tuple_satisfies_snapshot, - .compute_xid_horizon_for_tuples = heap_compute_xid_horizon_for_tuples, + .compute_delete_for_tuples = heap_compute_delete_for_tuples, .relation_set_new_filenode = heapam_relation_set_new_filenode, .relation_nontransactional_truncate = heapam_relation_nontransactional_truncate, diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c index e3164e674a..fe4695563a 100644 --- a/src/backend/access/index/genam.c +++ b/src/backend/access/index/genam.c @@ -278,9 +278,16 @@ BuildIndexValueDescription(Relation indexRelation, * Get the latestRemovedXid from the table entries pointed at by the index * tuples being deleted. * - * Note: index access methods that don't consistently use the standard - * IndexTuple + heap TID item pointer representation will need to provide - * their own version of this function. + * This is a table_compute_delete_for_tuples() shim for index access methods + * that have simple needs. If there isn't a natural opportunity to ask + * tableam about "extra" TIDs, simply call here with dead-to-all tuple offset + * numbers. + * + * It's okay to skip calling here with indexes/tables that don't need a + * latestRemovedXid value. + * + * Note: We assume that index access method uses standard IndexTuple + table + * TID item pointer representation. */ TransactionId index_compute_xid_horizon_for_tuples(Relation irel, @@ -289,12 +296,17 @@ index_compute_xid_horizon_for_tuples(Relation irel, OffsetNumber *itemnos, int nitems) { - ItemPointerData *ttids = - (ItemPointerData *) palloc(sizeof(ItemPointerData) * nitems); + TM_IndexDeleteOp delstate; TransactionId latestRemovedXid = InvalidTransactionId; Page ipage = BufferGetPage(ibuf); IndexTuple itup; + delstate.bottomup = false; /* Visit all LP_DEAD-related blocks */ + delstate.bottomupfreespace = 0; /* Visiting all table blocks anyway */ + delstate.ndeltids = 0; + delstate.deltids = palloc(nitems * sizeof(TM_IndexDelete)); + delstate.status = palloc(nitems * sizeof(TM_IndexStatus)); + /* identify what the index tuples about to be deleted point to */ for (int i = 0; i < nitems; i++) { @@ -303,14 +315,25 @@ index_compute_xid_horizon_for_tuples(Relation irel, iitemid = PageGetItemId(ipage, itemnos[i]); itup = (IndexTuple) PageGetItem(ipage, iitemid); - ItemPointerCopy(&itup->t_tid, &ttids[i]); + Assert(ItemIdIsDead(iitemid)); + + ItemPointerCopy(&itup->t_tid, &delstate.deltids[i].tid); + delstate.deltids[i].id = i; + delstate.status[i].idxoffnum = InvalidOffsetNumber; + delstate.status[i].tupsize = 0; /* irrelevant */ + delstate.status[i].ispromising = false; /* irrelevant */ + delstate.status[i].deleteitup = true; /* LP_DEAD-marked */ + delstate.ndeltids++; } /* determine the actual xid horizon */ - latestRemovedXid = - table_compute_xid_horizon_for_tuples(hrel, ttids, nitems); + latestRemovedXid = table_compute_delete_for_tuples(hrel, &delstate); - pfree(ttids); + /* assert tableam agrees that all items are deletable */ + Assert(delstate.ndeltids == nitems); + + pfree(delstate.deltids); + pfree(delstate.status); return latestRemovedXid; } diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README index 27f555177e..ebe4408378 100644 --- a/src/backend/access/nbtree/README +++ b/src/backend/access/nbtree/README @@ -419,8 +419,8 @@ without a backend's cached page also being detected as invalidated, but only when we happen to recycle a block that once again gets recycled as the rightmost leaf page. -On-the-Fly Deletion Of Index Tuples ------------------------------------ +On-the-Fly deletion of LP_DEAD-bit-set index tuples +--------------------------------------------------- If a process visits a heap tuple and finds that it's dead and removable (ie, dead to all open transactions, not only that process), then we can @@ -439,19 +439,26 @@ from the index immediately; since index scans only stop "between" pages, no scan can lose its place from such a deletion. We separate the steps because we allow LP_DEAD to be set with only a share lock (it's exactly like a hint bit for a heap tuple), but physically removing tuples requires -exclusive lock. In the current code we try to remove LP_DEAD tuples when -we are otherwise faced with having to split a page to do an insertion (and -hence have exclusive lock on it already). Deduplication can also prevent -a page split, but removing LP_DEAD tuples is the preferred approach. -(Note that posting list tuples can only have their LP_DEAD bit set when -every table TID within the posting list is known dead.) +exclusive lock. Also, delaying the deletion often allows us to pick up +extra index tuples that weren't initially safe for index scans to mark +LP_DEAD. Live index tuples that are close to LP_DEAD-marked tuples in +time and space are usually highly likely to become dead-to-all shortly. +This makes workloads that greatly benefit from the LP_DEAD optimization +resilient against intermittent disruption from long running transactions +that hold open an MVCC snapshot (compared to the behavior prior to +PostgreSQL 14, the version that taught the LP_DEAD deletion process to +check if nearby index tuples are safe to delete in passing). -This leaves the index in a state where it has no entry for a dead tuple -that still exists in the heap. This is not a problem for the current -implementation of VACUUM, but it could be a problem for anything that -explicitly tries to find index entries for dead tuples. (However, the -same situation is created by REINDEX, since it doesn't enter dead -tuples into the index.) +We only try to delete LP_DEAD tuples (and nearby tuples) when we are +otherwise faced with having to split a page to do an insertion (and hence +have exclusive lock on it already). Deduplication and bottom-up index +deletion can also prevent a page split, but removing LP_DEAD tuples is +always the preferred approach. (Note that posting list tuples can only +have their LP_DEAD bit set when every table TID within the posting list is +known dead. This isn't much of a problem because LP_DEAD deletion can +often still do granular deletion of TIDs from a posting list. This will +happen when the posting list tuple's TIDs point to a table block that some +LP_DEAD-marked index tuple happens to point to.) It's sufficient to have an exclusive lock on the index page, not a super-exclusive lock, to do deletion of LP_DEAD items. It might seem @@ -469,6 +476,87 @@ LSN of the page, and only act to set LP_DEAD bits when the LSN has not changed at all. (Avoiding dropping the pin entirely also makes it safe, of course.) +Bottom-Up deletion +------------------ + +We attempt to delete whatever duplicates happen to be present on the page +when the duplicates are suspected to be caused by version churn from +successive UPDATEs. This only happens when we receive an executor hint +indicating that optimizations like heapam's HOT have not worked out for +the index -- the incoming tuple must be a logically unchanged duplicate +which is needed for MVCC purposes, suggesting that that might well be the +dominant source of new index tuples on the leaf page in question. (Also, +bottom-up deletion is triggered within unique indexes in cases with +continual INSERT and DELETE related churn, since that is easy to detect +without any external hint.) + +On-the-fly deletion of LP_DEAD-bit-set items (which can include deletion +of other close by index tuples) will already have failed to prevent a page +split when a bottom-up deletion pass takes place (often because no LP_DEAD +bits were ever set on the page). The two mechanisms have closely related +implementations. The same WAL records are used for each operation, and +the same tableam infrastructure is used to determine what TIDs/tuples are +actually safe to delete. The implementations only differ in how they pick +TIDs to consider for deletion, and whether or not the tableam will give up +before accessing all table blocks (bottom-up deletion lives with the +uncertainty of its success by keeping the cost of failure low). Even +still, the two mechanisms are clearly distinct at the conceptual level. + +Bottom-up index deletion is driven entirely by heuristics (whereas +on-the-fly deletion is guaranteed to delete at least those index tuples +that are already LP_DEAD marked). We have no certainty that we'll find +even one index tuple to delete. That's why we access as few tableam +blocks as possible, and only commit to accessing the next table block in +line when a positive outcome for the operation as a whole still looks +likely. This means that the tableam needs to have a fairly good idea of +how much space it has freed on the leaf page, to keep the costs and +benefits in balance per operation (and even across successive operations +affecting the same leaf page). + +Bottom-up index deletion can be thought of as a backstop mechanism against +unnecessary version-driven page splits. It is based in part on an idea +from generational garbage collection: the "generational hypothesis". This +is the empirical observation that "most objects die young". Within +nbtree, new index tuples often quickly appear in the same place, and then +quickly become garbage. There can be intense concentrations of garbage in +relatively few leaf pages (or there would be without the intervention of +bottom-up deletion). This occurs with workloads that consist of skewed +UPDATEs. There is little to lose and much to gain by spending a few +cycles to become reasonably sure that a page split is truly necessary +(when it seems like there is some chance of that) -- page splits are +expensive, and practically irreversible. + +We expect to find a reasonably large number of tuples that are safe to +delete within each bottom-up pass. If we don't then we won't need to +consider the question of bottom-up deletion for the same leaf page for +quite a while (usually because the page splits, which resolves the +situation, at least for a while). We expect to perform regular bottom-up +deletion operations against pages that are at constant risk of unnecessary +page splits caused only by version churn. When the mechanism works well +we'll constantly be "on the verge" of having version-churn-driven page +splits, but never actually have even one. + +Our duplicate heuristics work well despite being fairly simple. +Unnecessary page splits only occur when there are truly pathological +levels of version churn (in theory a small amount of version churn could +make a page split occur earlier than strictly necessary, but that's pretty +harmless). We don't have to understand the underlying workload; we only +have to understand the general nature of the pathology that we target. +Version churn is easy to spot when it is truly pathological. Affected +leaf pages are homogeneous. + +If version churn hasn't become a real problem then we don't actually want +to do anything about it anyway (we should be lazy about cleaning it up, at +least). All that really matters is that garbage does not become +concentrated in any one part of the key space (the number of physical +versions accessed by queries to read any given logical row should remain +low over time and across all parts of the key space). Remaining garbage +tuples can be thought of as "floating garbage" that VACUUM will eventually +get around to removing (VACUUM can be thought of as a top-down mechanism +that bottom-up garbage collection complements). The absolute number of +garbage tuples (and even the proportion of all index tuples that are +garbage) is generally much less important. + WAL Considerations ------------------ @@ -767,9 +855,10 @@ into a single physical tuple with a posting list (a simple array of heap TIDs with the standard item pointer format). Deduplication is always applied lazily, at the point where it would otherwise be necessary to perform a page split. It occurs only when LP_DEAD items have been -removed, as our last line of defense against splitting a leaf page. We -can set the LP_DEAD bit with posting list tuples, though only when all -TIDs are known dead. +removed, as our last line of defense against splitting a leaf page +(bottom-up index deletion may be attempted first, as our second last line +of defense). We can set the LP_DEAD bit with posting list tuples, though +only when all TIDs are known dead. Our lazy approach to deduplication allows the page space accounting used during page splits to have absolutely minimal special case logic for @@ -826,6 +915,16 @@ delay a split that is probably inevitable anyway. This allows us to avoid the overhead of attempting to deduplicate with unique indexes that always have few or no duplicates. +Note: Avoiding "unnecessary" page splits driven by version churn is also +the goal of bottom-up index deletion, which was added to PostgreSQL 14. +Bottom-up index deletion is now the preferred way to deal with this +problem (with all kinds of indexes, though especially with unique +indexes). Still, deduplication can sometimes augment bottom-up index +deletion. When deletion cannot free tuples (due to an old snapshot +holding up cleanup), falling back on deduplication provides additional +capacity. Delaying the page split by deduplicating can allow a future +bottom-up deletion pass of the same page to succeed. + Posting list splits ------------------- diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c index 9e535124c4..19dd012043 100644 --- a/src/backend/access/nbtree/nbtdedup.c +++ b/src/backend/access/nbtree/nbtdedup.c @@ -19,6 +19,8 @@ #include "miscadmin.h" #include "utils/rel.h" +static void _bt_bottomup_finish_pending(Page page, TM_IndexDeleteOp *delstate, + BTDedupState state); static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state, OffsetNumber minoff, IndexTuple newitem); static void _bt_singleval_fillfactor(Page page, BTDedupState state, @@ -267,6 +269,161 @@ _bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem, pfree(state); } +/* + * Perform bottom-up index deletion pass. + * + * See if duplicate index tuples are eligible to be deleted by accessing + * visibility information from the tableam. Give up if we have to access more + * than a few tableam blocks. Caller tries to avoid "unnecessary" page splits + * (splits driven only by version churn) by calling here when it looks like + * that's about to happen. It's normal for there to be a lot of calls here + * for pages that are constantly at risk of an unnecessary split. + * + * Each failure to delete a duplicate/promising tuple here is a kind of + * learning experience. It results in caller falling back on splitting the + * page (or on a deduplication pass), discouraging future calls back here for + * the same key space range covered by a failed page (or at least discouraging + * processing the original duplicates in case where caller falls back on a + * successful deduplication pass). We converge on the most effective strategy + * for each page in the index over time. + * + * Returns true on success, in which case caller can assume page split will be + * avoided for a reasonable amount of time. Returns false when caller should + * deduplicate the page (if possible at all). + * + * Note: occasionally a true return value does not actually indicate that any + * items could be deleted. It might just indicate that caller should not go + * on to perform a deduplication pass. Caller is not expected to care about + * the difference. + * + * Note: Caller should have already deleted all existing items with their + * LP_DEAD bits set. + */ +bool +_bt_bottomup_pass(Relation rel, Buffer buf, Relation heapRel, Size newitemsz) +{ + OffsetNumber offnum, + minoff, + maxoff; + Page page = BufferGetPage(buf); + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); + BTDedupState state; + TM_IndexDeleteOp delstate; + bool neverdedup; + int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); + + /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */ + newitemsz += sizeof(ItemIdData); + + /* Initialize deduplication state */ + state = (BTDedupState) palloc(sizeof(BTDedupStateData)); + state->deduplicate = true; + state->nmaxitems = 0; + state->maxpostingsize = BLCKSZ; /* "posting list size" not a concern */ + state->base = NULL; + state->baseoff = InvalidOffsetNumber; + state->basetupsize = 0; + state->htids = palloc(state->maxpostingsize); + state->nhtids = 0; + state->nitems = 0; + state->phystupsize = 0; + state->nintervals = 0; + + /* + * Initialize tableam state that describes bottom-up index deletion + * operation. + * + * We will ask tableam to free 1/16 of BLCKSZ. We don't usually expect to + * have to free much space each call here in order to avoid page splits. + * We don't want to be too aggressive since in general the tableam will + * have to access more table blocks when we ask for more free space. In + * general we try to be conservative about what we ask for (though not too + * conservative), while leaving it up to the tableam to ramp up the number + * of tableam blocks accessed when conditions in the table structure + * happen to favor it. + * + * We expect to end up back here again and again for any leaf page that is + * more or less constantly at risk of unnecessary page splits -- in fact + * that's what happens when bottom-up deletion really helps. We must + * avoid thrashing when this becomes very frequent at the level of an + * individual page. Our free space target helps with that. It balances + * the costs and benefits over time and across related bottom-up deletion + * passes. + */ + delstate.bottomup = true; /* Only visit promising table blocks */ + delstate.bottomupfreespace = Max(BLCKSZ / 16, newitemsz); + + /* Now mutable state */ + delstate.ndeltids = 0; + delstate.deltids = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexDelete)); + delstate.status = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexStatus)); + + minoff = P_FIRSTDATAKEY(opaque); + maxoff = PageGetMaxOffsetNumber(page); + for (offnum = minoff; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ItemId itemid = PageGetItemId(page, offnum); + IndexTuple itup = (IndexTuple) PageGetItem(page, itemid); + + Assert(!ItemIdIsDead(itemid)); + + if (offnum == minoff) + { + _bt_dedup_start_pending(state, itup, offnum); + } + else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts && + _bt_dedup_save_htid(state, itup)) + { + /* Tuple is equal; just added its TIDs to pending interval */ + } + else + { + /* Finalize interval -- move its TIDs to delete state */ + _bt_bottomup_finish_pending(page, &delstate, state); + + /* itup starts new pending interval */ + _bt_dedup_start_pending(state, itup, offnum); + } + } + /* Finalize final interval -- move its TIDs to delete state */ + _bt_bottomup_finish_pending(page, &delstate, state); + + /* + * The tableam uses its own heuristics. They can influence the table + * blocks that it visits, especially when promising tuples are not + * concentrated in just a few table blocks. This is why we don't give up + * now in the event of having few (or even zero) promising tuples for the + * tableam. + * + * When there are no duplicates on the page at all we tell our caller to + * not attempt deduplication (by "reporting success"). Having zero + * duplicates/promising tuples should be rare, but when it happens we + * might as well save a few cycles. + */ + neverdedup = false; + if (state->nintervals == 0) + neverdedup = true; + + /* Done with dedup state */ + pfree(state->htids); + pfree(state); + + /* Confirm which TIDs are dead-to-all, then physically delete */ + _bt_delitems_delete_check(rel, buf, heapRel, &delstate); + + /* Done with deletion state */ + pfree(delstate.deltids); + pfree(delstate.status); + + if (neverdedup) + return true; + + /* Don't dedup when we won't end up back here any time soon anyway */ + return PageGetExactFreeSpace(page) >= Max(BLCKSZ / 24, newitemsz); +} + /* * Create a new pending posting list tuple based on caller's base tuple. * @@ -452,6 +609,165 @@ _bt_dedup_finish_pending(Page newpage, BTDedupState state) return spacesaving; } +/* + * Finalize interval during bottom-up index deletion. + * + * Adds TIDs to delstate for later processing. Also determines which TIDs are + * to be marked promising, based on heuristics. + */ +static void +_bt_bottomup_finish_pending(Page page, TM_IndexDeleteOp *delstate, + BTDedupState state) +{ + bool dupinterval = (state->nitems > 1); + + Assert(state->nitems > 0); + Assert(state->nitems <= state->nhtids); + Assert(state->intervals[state->nintervals].baseoff == state->baseoff); + + /* + * All TIDs from all tuples are at least recording in state. Tuples are + * marked promising when they're duplicates (i.e. when they appear in an + * interval with more than one item, as when we expect create a new + * posting list tuple in the deduplication case). + * + * It's easy to see what this means in the plain non-pivot tuple case: + * TIDs from duplicate plain tuples are promising. Posting list tuples + * are more subtle. We ought to do something with posting list tuples, + * though plain tuples tend to be more promising targets. (Plain tuples + * are the most likely to be dead/deletable because they suggest version + * churn. And they allow us to free more space when we actually succeed). + */ + for (int i = 0; i < state->nitems; i++) + { + OffsetNumber offnum = state->baseoff + i; + ItemId itemid = PageGetItemId(page, offnum); + IndexTuple itup = (IndexTuple) PageGetItem(page, itemid); + TM_IndexDelete *cdeltid; + TM_IndexStatus *dstatus; + + cdeltid = &delstate->deltids[delstate->ndeltids]; + dstatus = &delstate->status[delstate->ndeltids]; + + if (!BTreeTupleIsPosting(itup)) + { + /* Easy case: A plain non-pivot tuple's TID */ + cdeltid->tid = itup->t_tid; + cdeltid->id = delstate->ndeltids; + dstatus->idxoffnum = offnum; + dstatus->ispromising = dupinterval; + dstatus->deleteitup = false; /* for now */ + dstatus->tupsize = + ItemIdGetLength(itemid) + sizeof(ItemIdData); + delstate->ndeltids++; + } + else + { + /* + * Harder case: A posting list tuple's TIDs (multiple TIDs). + * + * Only a single TID from a posting list tuple may be promising, + * and only when it appears in a duplicate tuple (just like plain + * tuple case). In general there is a good chance that the + * posting list tuple relates to multiple logical rows, rather + * than multiple versions of just one logical row. (It can only + * be the latter case when a previous bottom-up deletion pass + * failed, necessitating a deduplication pass, which isn't all + * that common.) + * + * There is a pretty good chance that at least one of the logical + * rows from the posting list was updated, and so had a successor + * version (about as good a chance as it is in the regular tuple + * case, at least). We should at least try to follow the regular + * tuple case while making the conservative assumption that there + * can only be one affected logical row per posting list tuple. We + * do that by picking one TID when it appears to be from the + * predominant tableam block in the posting list (if any one + * tableam block predominates). The approach we take is to either + * choose the first or last TID in the posting list (if any at + * all). We go with whichever one is on the same tableam block at + * the middle tuple (and only the first TID when both the first + * and last TIDs relate to the same tableam block -- we could + * easily be too aggressive here). + * + * If it turns out that there are multiple old versions of a + * single logical table row, we still have a pretty good chance of + * being able to delete them this way. We don't want to give too + * strong a signal to the tableam. But we should always try to + * give some useful hints. Even cases with considerable + * uncertainty can consistently avoid an unnecessary page split, + * in part because the tableam will have tricks of its own for + * figuring out where to look in marginal cases. + */ + int nitem = BTreeTupleGetNPosting(itup); + bool firstpromise = false; + bool lastpromise = false; + + Assert(_bt_posting_valid(itup)); + + if (dupinterval) + { + /* Figure out if there really should be promising TIDs */ + BlockNumber minblocklist, + midblocklist, + maxblocklist; + ItemPointer mintid, + midtid, + maxtid; + + mintid = BTreeTupleGetHeapTID(itup); + midtid = BTreeTupleGetPostingN(itup, nitem / 2); + maxtid = BTreeTupleGetMaxHeapTID(itup); + minblocklist = ItemPointerGetBlockNumber(mintid); + midblocklist = ItemPointerGetBlockNumber(midtid); + maxblocklist = ItemPointerGetBlockNumber(maxtid); + + firstpromise = (minblocklist == midblocklist); + lastpromise = (!firstpromise && midblocklist == maxblocklist); + } + + /* No more than one TID from itup can be promising */ + Assert(!(firstpromise && lastpromise)); + + for (int p = 0; p < nitem; p++) + { + ItemPointer htid = BTreeTupleGetPostingN(itup, p); + + cdeltid->tid = *htid; + cdeltid->id = delstate->ndeltids; + dstatus->idxoffnum = offnum; + dstatus->ispromising = false; + + if ((firstpromise && p == 0) || + (lastpromise && p == nitem - 1)) + dstatus->ispromising = true; + + dstatus->deleteitup = false; /* for now */ + dstatus->tupsize = sizeof(ItemPointerData) + 1; + delstate->ndeltids++; + + cdeltid++; + dstatus++; + } + } + } + + if (dupinterval) + { + /* + * Maintain interval state for consistency with true deduplication + * case + */ + state->intervals[state->nintervals].nitems = state->nitems; + state->nintervals++; + } + + /* Reset state for next interval */ + state->nhtids = 0; + state->nitems = 0; + state->phystupsize = 0; +} + /* * Determine if page non-pivot tuples (data items) are all duplicates of the * same value -- if they are, deduplication's "single value" strategy should @@ -622,8 +938,8 @@ _bt_form_posting(IndexTuple base, ItemPointer htids, int nhtids) * Generate a replacement tuple by "updating" a posting list tuple so that it * no longer has TIDs that need to be deleted. * - * Used by VACUUM. Caller's vacposting argument points to the existing - * posting list tuple to be updated. + * Used by both VACUUM and index deletion. Caller's vacposting argument + * points to the existing posting list tuple to be updated. * * On return, caller's vacposting argument will point to final "updated" * tuple, which will be palloc()'d in caller's memory context. diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index dde43b1415..49911e8c4d 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -17,7 +17,6 @@ #include "access/nbtree.h" #include "access/nbtxlog.h" -#include "access/tableam.h" #include "access/transam.h" #include "access/xloginsert.h" #include "miscadmin.h" @@ -37,6 +36,7 @@ static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate, static OffsetNumber _bt_findinsertloc(Relation rel, BTInsertState insertstate, bool checkingunique, + bool indexUnchanged, BTStack stack, Relation heapRel); static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack); @@ -61,7 +61,13 @@ static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup, static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, BTInsertState insertstate, bool lpdeadonly, bool checkingunique, - bool uniquedup); + bool uniquedup, bool indexUnchanged); +static void _bt_lpdead_pass(Relation rel, Buffer buffer, Relation heapRel, + OffsetNumber *deletable, int ndeletable, + OffsetNumber minoff, OffsetNumber maxoff); +static BlockNumber *_bt_lpdead_blocks(Page page, OffsetNumber *deletable, + int ndeletable, int *nblocks); +static int _bt_lpdead_blocks_cmp(const void *arg1, const void *arg2); /* * _bt_doinsert() -- Handle insertion of a single index tuple in the tree. @@ -75,6 +81,11 @@ static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, * For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and * don't actually insert. * + * indexUnchanged executor hint indicates if itup is from an + * UPDATE that didn't logically change the indexed value, but + * must nevertheless have a new entry to point to a successor + * version. + * * The result value is only significant for UNIQUE_CHECK_PARTIAL: * it must be true if the entry is known unique, else false. * (In the current implementation we'll also return true after a @@ -83,7 +94,8 @@ static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, */ bool _bt_doinsert(Relation rel, IndexTuple itup, - IndexUniqueCheck checkUnique, Relation heapRel) + IndexUniqueCheck checkUnique, bool indexUnchanged, + Relation heapRel) { bool is_unique = false; BTInsertStateData insertstate; @@ -238,7 +250,7 @@ search: * checkingunique. */ newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique, - stack, heapRel); + indexUnchanged, stack, heapRel); _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack, itup, insertstate.itemsz, newitemoff, insertstate.postingoff, false); @@ -777,6 +789,17 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel, * room for the new tuple, this function moves right, trying to find a * legal page that does.) * + * If 'indexUnchanged' is true, this is for an UPDATE that didn't + * logically change the indexed value, but must nevertheless have a new + * entry to point to a successor version. This hint from the executor + * will influence our behavior when the page might have to be split and + * we must consider our options. Bottom-up index deletion can avoid + * pathological version-driven page splits, but we only want to go to the + * trouble of trying it when we already have moderate confidence that + * it's appropriate. The hint should not significantly affect our + * behavior over time unless practically all inserts on to the leaf page + * get the hint. + * * On exit, insertstate buffer contains the chosen insertion page, and * the offset within that page is returned. If _bt_findinsertloc needed * to move right, the lock and pin on the original page are released, and @@ -793,6 +816,7 @@ static OffsetNumber _bt_findinsertloc(Relation rel, BTInsertState insertstate, bool checkingunique, + bool indexUnchanged, BTStack stack, Relation heapRel) { @@ -817,7 +841,7 @@ _bt_findinsertloc(Relation rel, if (itup_key->heapkeyspace) { /* Keep track of whether checkingunique duplicate seen */ - bool uniquedup = false; + bool uniquedup = indexUnchanged; /* * If we're inserting into a unique index, we may have to walk right @@ -881,7 +905,8 @@ _bt_findinsertloc(Relation rel, */ if (PageGetFreeSpace(page) < insertstate->itemsz) _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false, - checkingunique, uniquedup); + checkingunique, uniquedup, + indexUnchanged); } else { @@ -923,7 +948,8 @@ _bt_findinsertloc(Relation rel, { /* Erase LP_DEAD items (won't deduplicate) */ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true, - checkingunique, false); + checkingunique, false, + indexUnchanged); if (PageGetFreeSpace(page) >= insertstate->itemsz) break; /* OK, now we have enough space */ @@ -977,7 +1003,7 @@ _bt_findinsertloc(Relation rel, * This can only erase LP_DEAD items (it won't deduplicate). */ _bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true, - checkingunique, false); + checkingunique, false, indexUnchanged); /* * Do new binary search. New insert location cannot overlap with any @@ -2609,15 +2635,24 @@ _bt_pgaddtup(Page page, * _bt_delete_or_dedup_one_page - Try to avoid a leaf page split by attempting * a variety of operations. * - * There are two operations performed here: deleting items already marked - * LP_DEAD, and deduplication. If both operations fail to free enough space - * for the incoming item then caller will go on to split the page. We always - * attempt our preferred strategy (which is to delete items whose LP_DEAD bit - * are set) first. If that doesn't work out we move on to deduplication. + * There are three operations performed here: deleting items already marked + * LP_DEAD, bottom-up index deletion, and deduplication. If all three + * operations fail to free enough space for the incoming item then caller will + * go on to split the page. We always attempt our preferred strategy (which + * is to delete items whose LP_DEAD bit are set) first. If that doesn't work + * out we consider alternatives. Most calls here will not exhaustively + * attempt all three operations. Deduplication and bottom-up index deletion + * are relatively expensive operations, so we try to pick one or the other up + * front (whichever one seems better for this specific page). * - * Caller's checkingunique and uniquedup arguments help us decide if we should - * perform deduplication, which is primarily useful with low cardinality data, - * but can sometimes absorb version churn. + * Caller's checkingunique, uniquedup, and indexUnchanged arguments help us + * decide which alternative strategy we should attempt (or attempt first). + * Deduplication is primarily useful with low cardinality data. Bottom-up + * index deletion is a backstop against version churn caused by repeated + * UPDATE statements where affected indexes don't receive logical changes + * (because an optimization like heapam's HOT cannot be applied in the + * tableam). But useful interplay between both techniques over time is + * sometimes possible. * * Callers that only want us to look for/delete LP_DEAD items can ask for that * directly by passing true 'lpdeadonly' argument. @@ -2639,11 +2674,12 @@ static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, BTInsertState insertstate, bool lpdeadonly, bool checkingunique, - bool uniquedup) + bool uniquedup, bool indexUnchanged) { OffsetNumber deletable[MaxIndexTuplesPerPage]; int ndeletable = 0; OffsetNumber offnum, + minoff, maxoff; Buffer buffer = insertstate->buf; BTScanInsert itup_key = insertstate->itup_key; @@ -2657,8 +2693,9 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, * Scan over all items to see which ones need to be deleted according to * LP_DEAD flags. */ + minoff = P_FIRSTDATAKEY(opaque); maxoff = PageGetMaxOffsetNumber(page); - for (offnum = P_FIRSTDATAKEY(opaque); + for (offnum = minoff; offnum <= maxoff; offnum = OffsetNumberNext(offnum)) { @@ -2670,7 +2707,8 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, if (ndeletable > 0) { - _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel); + _bt_lpdead_pass(rel, buffer, heapRel, deletable, ndeletable, + minoff, maxoff); insertstate->bounds_valid = false; /* Return when a page split has already been avoided */ @@ -2689,18 +2727,19 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, * return at this point (or when we go on the try either or both of our * other strategies and they also fail). We do not bother expending a * separate write to clear it, however. Caller will definitely clear it - * when it goes on to split the page (plus deduplication knows to clear - * the flag when it actually modifies the page). + * when it goes on to split the page (note also that deduplication process + * knows to clear the flag when it actually modifies the page). */ if (lpdeadonly) return; /* * We can get called in the checkingunique case when there is no reason to - * believe that there are any duplicates on the page; we should at least - * still check for LP_DEAD items. If that didn't work out, give up and - * let caller split the page. Deduplication cannot be justified given - * there is no reason to think that there are duplicates. + * believe that there are any duplicates on the page; we just needed to + * check for LP_DEAD items. When we were called under these circumstances + * and get this far, LP_DEAD item deletion didn't work out, and so we give + * up and let caller split the page. (A bottom-up pass or deduplication + * pass are also unlikely to work out.) */ if (checkingunique && !uniquedup) return; @@ -2708,6 +2747,22 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, /* Assume bounds about to be invalidated (this is almost certain now) */ insertstate->bounds_valid = false; + /* + * Perform bottom-up index deletion pass when executor hint indicated that + * incoming item is logically unchanged, or for a unique index that is + * known to have physical duplicates for some other reason. (There is a + * large overlap between these two cases for a unique index. It's worth + * having both triggering conditions in order to apply the optimization in + * the event of successive related INSERT and DELETE statements.) + * + * We'll go on to do a deduplication pass when a bottom-up pass fails to + * delete an acceptable amount of free space (a significant fraction of + * the page, or space for the new item, whichever is greater). + */ + if (BTGetDeleteItems(rel) && (indexUnchanged || uniquedup) && + _bt_bottomup_pass(rel, buffer, heapRel, insertstate->itemsz)) + return; + /* * Perform deduplication pass, though only when it is enabled for the * index and known to be safe (it must be an allequalimage index). @@ -2716,3 +2771,255 @@ _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel, _bt_dedup_pass(rel, buffer, heapRel, insertstate->itup, insertstate->itemsz, checkingunique); } + +/* + * _bt_lpdead_pass - Try to avoid a leaf page split by deleting LP_DEAD-set + * index tuples, as well as any other nearby tuples that are convenient to + * delete in passing. + * + * The tableam can inexpensively check extra index tuples whose TIDs happen to + * point to the same table blocks as a TID from an LP_DEAD-marked tuple's TID. + * This routine is responsible for gathering TIDs from LP_DEAD-marked index + * tuples (which are surely deletable) alongside index tuples with same-block + * TIDs (which are totally speculative) for processing by tableam. Physical + * deletion of the final known-safe TIDs from the leaf page takes place at the + * end. + * + * In practice it is often possible to delete at least a few extra tuples here + * for indexUnchanged callers. This will happen when LP_DEAD bit setting was + * temporarily disrupted by some transaction that held open an MVCC snapshot + * for a relatively long time; we can pick up newer version-duplicate index + * tuples that couldn't have their LP_DEAD bits set by UPDATEs, provided + * they're on the same tableam block as earlier versions that were marked (and + * provided the snapshot is no longer held open by now). We don't try to be + * clever, though. We simply focus on extra tuples that are practically free + * to check in passing. In practice the number of extra index tuples that + * turn out to be deletable often greatly exceeds the number of LP_DEAD-marked + * index tuples. + */ +static void +_bt_lpdead_pass(Relation rel, Buffer buffer, Relation heapRel, + OffsetNumber *deletable, int ndeletable, + OffsetNumber minoff, OffsetNumber maxoff) +{ + Page page = BufferGetPage(buffer); + TM_IndexDeleteOp delstate; + BlockNumber *blocks; + int nblocks; + OffsetNumber offnum; + + blocks = _bt_lpdead_blocks(page, deletable, ndeletable, &nblocks); + + /* + * Initialize tableam state that describes index deletion operation + */ + delstate.bottomup = false; /* Visit all LP_DEAD-related blocks */ + delstate.bottomupfreespace = 0; /* Visiting all table blocks anyway */ + + /* Now mutable state */ + delstate.ndeltids = 0; + delstate.deltids = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexDelete)); + delstate.status = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexStatus)); + + for (offnum = minoff; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ItemId itemid = PageGetItemId(page, offnum); + IndexTuple itup = (IndexTuple) PageGetItem(page, itemid); + TM_IndexDelete *cdeltid; + TM_IndexStatus *dstatus; + BlockNumber tidblock; + BlockNumber *match; + + cdeltid = &delstate.deltids[delstate.ndeltids]; + dstatus = &delstate.status[delstate.ndeltids]; + + if (!BTreeTupleIsPosting(itup)) + { + /* Plain non-pivot tuple's TID */ + tidblock = ItemPointerGetBlockNumber(&itup->t_tid); + + match = (BlockNumber *) bsearch(&tidblock, blocks, nblocks, + sizeof(BlockNumber), + _bt_lpdead_blocks_cmp); + + if (!match) + { + Assert(!ItemIdIsDead(itemid)); + continue; + } + + /* + * TID has heap block among those pointed to by LP_DEAD-bit set + * tuples on leaf page + */ + cdeltid->tid = itup->t_tid; + cdeltid->id = delstate.ndeltids; + dstatus->idxoffnum = offnum; + dstatus->ispromising = false; /* irrelevant */ + dstatus->deleteitup = ItemIdIsDead(itemid); + dstatus->tupsize = 0; /* irrelevant */ + delstate.ndeltids++; + } + else + { + int nitem = BTreeTupleGetNPosting(itup); + + for (int p = 0; p < nitem; p++) + { + ItemPointer htid = BTreeTupleGetPostingN(itup, p); + + tidblock = ItemPointerGetBlockNumber(htid); + + match = (BlockNumber *) bsearch(&tidblock, blocks, nblocks, + sizeof(BlockNumber), + _bt_lpdead_blocks_cmp); + + if (!match) + { + Assert(!ItemIdIsDead(itemid)); + continue; + } + + /* + * TID has heap block among those pointed to by LP_DEAD-bit + * set tuples on leaf page + */ + cdeltid->tid = *htid; + cdeltid->id = delstate.ndeltids; + dstatus->idxoffnum = offnum; + dstatus->ispromising = false; /* irrelevant */ + dstatus->deleteitup = ItemIdIsDead(itemid); + dstatus->tupsize = 0; /* irrelevant */ + delstate.ndeltids++; + + cdeltid++; + dstatus++; + } + } + } + + Assert(delstate.ndeltids >= ndeletable); + + /* Physically delete LP_DEAD tuples (plus extra dead-to-all TIDs) */ + _bt_delitems_delete_check(rel, buffer, heapRel, &delstate); + + /* be tidy */ + pfree(blocks); + pfree(delstate.deltids); + pfree(delstate.status); +} + +/* + * _bt_lpdead_blocks() -- Build a list of LP_DEAD related table blocks + * + * Build a list of those blocks pointed to by index tuples that caller found + * had their LP_DEAD bits set. Used by _bt_lpdead_pass to delete extra nearby + * tuples that are convenient to delete in passing. + */ +static BlockNumber * +_bt_lpdead_blocks(Page page, OffsetNumber *deletable, int ndeletable, + int *nblocks) +{ + int spacenhtids; + int nhtids; + ItemPointer htids; + BlockNumber *blocks; + BlockNumber lastblock = InvalidBlockNumber; + + /* Array will grow iff there are posting list tuples to consider */ + spacenhtids = ndeletable; + nhtids = 0; + htids = (ItemPointer) palloc(sizeof(ItemPointerData) * spacenhtids); + for (int i = 0; i < ndeletable; i++) + { + ItemId itemid; + IndexTuple itup; + + itemid = PageGetItemId(page, deletable[i]); + itup = (IndexTuple) PageGetItem(page, itemid); + + Assert(ItemIdIsDead(itemid)); + Assert(!BTreeTupleIsPivot(itup)); + + if (!BTreeTupleIsPosting(itup)) + { + if (nhtids + 1 > spacenhtids) + { + spacenhtids *= 2; + htids = (ItemPointer) + repalloc(htids, sizeof(ItemPointerData) * spacenhtids); + } + + Assert(ItemPointerIsValid(&itup->t_tid)); + ItemPointerCopy(&itup->t_tid, &htids[nhtids]); + nhtids++; + } + else + { + int nposting = BTreeTupleGetNPosting(itup); + + if (nhtids + nposting > spacenhtids) + { + spacenhtids = Max(spacenhtids * 2, nhtids + nposting); + htids = (ItemPointer) + repalloc(htids, sizeof(ItemPointerData) * spacenhtids); + } + + for (int j = 0; j < nposting; j++) + { + ItemPointer htid = BTreeTupleGetPostingN(itup, j); + + Assert(ItemPointerIsValid(htid)); + ItemPointerCopy(htid, &htids[nhtids]); + nhtids++; + } + } + } + + Assert(nhtids >= ndeletable); + + qsort((void *) htids, nhtids, sizeof(ItemPointerData), + (int (*) (const void *, const void *)) ItemPointerCompare); + + blocks = palloc(sizeof(BlockNumber) * nhtids); + *nblocks = 0; + + for (int i = 0; i < nhtids; i++) + { + ItemPointer tid = htids + i; + BlockNumber tidblock = ItemPointerGetBlockNumber(tid); + + if (tidblock == lastblock) + continue; + + lastblock = tidblock; + blocks[*nblocks] = tidblock; + (*nblocks)++; + } + + pfree(htids); + + return blocks; +} + +/* + * _bt_lpdead_blocks_cmp() -- BlockNumber comparator + * + * Used by _bt_lpdead_pass to search through its array of table blocks known + * to be pointed to by TIDs from LP_DEAD-marked index tuples. + */ +static int +_bt_lpdead_blocks_cmp(const void *arg1, const void *arg2) +{ + BlockNumber b1 = *((BlockNumber *) arg1); + BlockNumber b2 = *((BlockNumber *) arg2); + + if (b1 < b2) + return -1; + else if (b1 > b2) + return 1; + + return 0; +} diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index 793434c026..d6189cafed 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -38,8 +38,14 @@ static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf); static void _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedXid); -static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page, - OffsetNumber *deletable, int ndeletable); +static void _bt_delitems_delete(Relation rel, Buffer buf, + TransactionId latestRemovedXid, + OffsetNumber *deletable, int ndeletable, + BTVacuumPosting *updatable, int nupdatable, + Relation heapRel); +static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable, + OffsetNumber *updatedoffsets, + Size *updatedbuflen, bool needswal); static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack); static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, @@ -1110,15 +1116,15 @@ _bt_page_recyclable(Page page) * sorted in ascending order. * * Routine deals with deleting TIDs when some (but not all) of the heap TIDs - * in an existing posting list item are to be removed by VACUUM. This works - * by updating/overwriting an existing item with caller's new version of the - * item (a version that lacks the TIDs that are to be deleted). + * in an existing posting list item are to be removed. This works by + * updating/overwriting an existing item with caller's new version of the item + * (a version that lacks the TIDs that are to be deleted). * * We record VACUUMs and b-tree deletes differently in WAL. Deletes must - * generate their own latestRemovedXid by accessing the heap directly, whereas - * VACUUMs rely on the initial heap scan taking care of it indirectly. Also, - * only VACUUM can perform granular deletes of individual TIDs in posting list - * tuples. + * generate their own latestRemovedXid by accessing the table directly, + * whereas VACUUMs rely on the initial heap scan taking care of it indirectly. + * Also, we remove the VACUUM cycle ID from pages, which b-tree deletes don't + * do. */ void _bt_delitems_vacuum(Relation rel, Buffer buf, @@ -1127,7 +1133,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, { Page page = BufferGetPage(buf); BTPageOpaque opaque; - Size itemsz; + bool needswal = RelationNeedsWAL(rel); char *updatedbuf = NULL; Size updatedbuflen = 0; OffsetNumber updatedoffsets[MaxIndexTuplesPerPage]; @@ -1135,45 +1141,11 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, /* Shouldn't be called unless there's something to do */ Assert(ndeletable > 0 || nupdatable > 0); - for (int i = 0; i < nupdatable; i++) - { - /* Replace work area IndexTuple with updated version */ - _bt_update_posting(updatable[i]); - - /* Maintain array of updatable page offsets for WAL record */ - updatedoffsets[i] = updatable[i]->updatedoffset; - } - - /* XLOG stuff -- allocate and fill buffer before critical section */ - if (nupdatable > 0 && RelationNeedsWAL(rel)) - { - Size offset = 0; - - for (int i = 0; i < nupdatable; i++) - { - BTVacuumPosting vacposting = updatable[i]; - - itemsz = SizeOfBtreeUpdate + - vacposting->ndeletedtids * sizeof(uint16); - updatedbuflen += itemsz; - } - - updatedbuf = palloc(updatedbuflen); - for (int i = 0; i < nupdatable; i++) - { - BTVacuumPosting vacposting = updatable[i]; - xl_btree_update update; - - update.ndeletedtids = vacposting->ndeletedtids; - memcpy(updatedbuf + offset, &update.ndeletedtids, - SizeOfBtreeUpdate); - offset += SizeOfBtreeUpdate; - - itemsz = update.ndeletedtids * sizeof(uint16); - memcpy(updatedbuf + offset, vacposting->deletetids, itemsz); - offset += itemsz; - } - } + /* Generate new version of posting lists without deleted TIDs */ + if (nupdatable > 0) + updatedbuf = _bt_delitems_update(updatable, nupdatable, + updatedoffsets, &updatedbuflen, + needswal); /* No ereport(ERROR) until changes are logged */ START_CRIT_SECTION(); @@ -1194,6 +1166,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, { OffsetNumber updatedoffset = updatedoffsets[i]; IndexTuple itup; + Size itemsz; itup = updatable[i]->itup; itemsz = MAXALIGN(IndexTupleSize(itup)); @@ -1227,7 +1200,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, MarkBufferDirty(buf); /* XLOG stuff */ - if (RelationNeedsWAL(rel)) + if (needswal) { XLogRecPtr recptr; xl_btree_vacuum xlrec_vacuum; @@ -1260,7 +1233,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, /* can't leak memory here */ if (updatedbuf != NULL) pfree(updatedbuf); - /* free tuples generated by calling _bt_update_posting() */ + /* free tuples allocated within _bt_delitems_update() */ for (int i = 0; i < nupdatable; i++) pfree(updatable[i]->itup); } @@ -1269,36 +1242,70 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, * Delete item(s) from a btree leaf page during single-page cleanup. * * This routine assumes that the caller has pinned and write locked the - * buffer. Also, the given deletable array *must* be sorted in ascending - * order. + * buffer. Also, the given deletable and updatable arrays *must* be sorted in + * ascending order. + * + * Routine deals with deleting TIDs when some (but not all) of the heap TIDs + * in an existing posting list item are to be removed. This works by + * updating/overwriting an existing item with caller's new version of the item + * (a version that lacks the TIDs that are to be deleted). * * This is nearly the same as _bt_delitems_vacuum as far as what it does to - * the page, but it needs to generate its own latestRemovedXid by accessing - * the heap. This is used by the REDO routine to generate recovery conflicts. - * Also, it doesn't handle posting list tuples unless the entire tuple can be - * deleted as a whole (since there is only one LP_DEAD bit per line pointer). + * the page, but it needs its own latestRemovedXid from called (caller gets + * this from tableam). This is used by the REDO routine to generate recovery + * conflicts. The other difference is that _bt_delitems_vacuum will clear + * page's VACUUM cycle ID. We must never do that. */ -void -_bt_delitems_delete(Relation rel, Buffer buf, +static void +_bt_delitems_delete(Relation rel, Buffer buf, TransactionId latestRemovedXid, OffsetNumber *deletable, int ndeletable, + BTVacuumPosting *updatable, int nupdatable, Relation heapRel) { Page page = BufferGetPage(buf); BTPageOpaque opaque; - TransactionId latestRemovedXid = InvalidTransactionId; + bool needswal = RelationNeedsWAL(rel); + char *updatedbuf = NULL; + Size updatedbuflen = 0; + OffsetNumber updatedoffsets[MaxIndexTuplesPerPage]; /* Shouldn't be called unless there's something to do */ - Assert(ndeletable > 0); + Assert(ndeletable > 0 || nupdatable > 0); - if (XLogStandbyInfoActive() && RelationNeedsWAL(rel)) - latestRemovedXid = - _bt_xid_horizon(rel, heapRel, page, deletable, ndeletable); + /* Generate new versions of posting lists without deleted TIDs */ + if (nupdatable > 0) + updatedbuf = _bt_delitems_update(updatable, nupdatable, + updatedoffsets, &updatedbuflen, + needswal); /* No ereport(ERROR) until changes are logged */ START_CRIT_SECTION(); - /* Fix the page */ - PageIndexMultiDelete(page, deletable, ndeletable); + /* + * Handle posting tuple updates. + * + * Deliberately do this before handling simple deletes. If we did it the + * other way around (i.e. WAL record order -- simple deletes before + * updates) then we'd have to make compensating changes to the 'updatable' + * array of offset numbers. + */ + for (int i = 0; i < nupdatable; i++) + { + OffsetNumber updatedoffset = updatedoffsets[i]; + IndexTuple itup; + Size itemsz; + + itup = updatable[i]->itup; + itemsz = MAXALIGN(IndexTupleSize(itup)); + if (!PageIndexTupleOverwrite(page, updatedoffset, (Item) itup, + itemsz)) + elog(PANIC, "failed to update partially dead item in block %u of index \"%s\"", + BufferGetBlockNumber(buf), RelationGetRelationName(rel)); + } + + /* Now handle simple deletes of entire tuples */ + if (ndeletable > 0) + PageIndexMultiDelete(page, deletable, ndeletable); /* * Unlike _bt_delitems_vacuum, we *must not* clear the vacuum cycle ID, @@ -1318,25 +1325,29 @@ _bt_delitems_delete(Relation rel, Buffer buf, MarkBufferDirty(buf); /* XLOG stuff */ - if (RelationNeedsWAL(rel)) + if (needswal) { XLogRecPtr recptr; xl_btree_delete xlrec_delete; xlrec_delete.latestRemovedXid = latestRemovedXid; xlrec_delete.ndeleted = ndeletable; + xlrec_delete.nupdated = nupdatable; XLogBeginInsert(); XLogRegisterBuffer(0, buf, REGBUF_STANDARD); XLogRegisterData((char *) &xlrec_delete, SizeOfBtreeDelete); - /* - * The deletable array is not in the buffer, but pretend that it is. - * When XLogInsert stores the whole buffer, the array need not be - * stored too. - */ - XLogRegisterBufData(0, (char *) deletable, - ndeletable * sizeof(OffsetNumber)); + if (ndeletable > 0) + XLogRegisterBufData(0, (char *) deletable, + ndeletable * sizeof(OffsetNumber)); + + if (nupdatable > 0) + { + XLogRegisterBufData(0, (char *) updatedoffsets, + nupdatable * sizeof(OffsetNumber)); + XLogRegisterBufData(0, updatedbuf, updatedbuflen); + } recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE); @@ -1344,83 +1355,304 @@ _bt_delitems_delete(Relation rel, Buffer buf, } END_CRIT_SECTION(); + + /* can't leak memory here */ + if (updatedbuf != NULL) + pfree(updatedbuf); + /* free tuples allocated within _bt_delitems_update() */ + for (int i = 0; i < nupdatable; i++) + pfree(updatable[i]->itup); } /* - * Get the latestRemovedXid from the table entries pointed to by the non-pivot - * tuples being deleted. + * Set up state needed to delete TIDs from posting list tuples via "updating" + * the tuple. Performs steps common to both _bt_delitems_vacuum and + * _bt_delitems_delete. These steps must take place before each function's + * critical section begins. * - * This is a specialized version of index_compute_xid_horizon_for_tuples(). - * It's needed because btree tuples don't always store table TID using the - * standard index tuple header field. + * updatabable and nupdatable are inputs, though note that we will use + * _bt_update_posting() to replace the original itup with a pointer to a final + * version in palloc()'d memory. Caller should free the tuples when its done. + * + * The first nupdatable entries from updatedoffsets are set to the page offset + * number for posting list tuples that caller updates. This is mostly useful + * because caller may need to WAL-log the page offsets (though we always do + * this for caller out of convenience). + * + * Returns buffer consisting of an array of xl_btree_update structs that + * describe the steps we perform here for caller (though only when needswal is + * true). Also sets *updatedbuflen to the final size of the buffer. This + * buffer is used by caller when WAL logging is required. */ -static TransactionId -_bt_xid_horizon(Relation rel, Relation heapRel, Page page, - OffsetNumber *deletable, int ndeletable) +static char * +_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable, + OffsetNumber *updatedoffsets, Size *updatedbuflen, + bool needswal) { - TransactionId latestRemovedXid = InvalidTransactionId; - int spacenhtids; - int nhtids; - ItemPointer htids; + char *updatedbuf = NULL; + Size buflen = 0; - /* Array will grow iff there are posting list tuples to consider */ - spacenhtids = ndeletable; - nhtids = 0; - htids = (ItemPointer) palloc(sizeof(ItemPointerData) * spacenhtids); - for (int i = 0; i < ndeletable; i++) + /* Shouldn't be called unless there's something to do */ + Assert(nupdatable > 0); + + for (int i = 0; i < nupdatable; i++) { - ItemId itemid; - IndexTuple itup; + BTVacuumPosting vacposting = updatable[i]; + Size itemsz; - itemid = PageGetItemId(page, deletable[i]); - itup = (IndexTuple) PageGetItem(page, itemid); + /* Replace work area IndexTuple with updated version */ + _bt_update_posting(vacposting); - Assert(ItemIdIsDead(itemid)); - Assert(!BTreeTupleIsPivot(itup)); + /* Keep track of size of xl_btree_update for updatedbuf in passing */ + itemsz = SizeOfBtreeUpdate + vacposting->ndeletedtids * sizeof(uint16); + buflen += itemsz; - if (!BTreeTupleIsPosting(itup)) + /* Build updatedoffsets buffer in passing */ + updatedoffsets[i] = vacposting->updatedoffset; + } + + /* XLOG stuff */ + if (needswal) + { + Size offset = 0; + + /* Allocate, set final size for caller */ + updatedbuf = palloc(buflen); + *updatedbuflen = buflen; + for (int i = 0; i < nupdatable; i++) { - if (nhtids + 1 > spacenhtids) - { - spacenhtids *= 2; - htids = (ItemPointer) - repalloc(htids, sizeof(ItemPointerData) * spacenhtids); - } + BTVacuumPosting vacposting = updatable[i]; + Size itemsz; + xl_btree_update update; - Assert(ItemPointerIsValid(&itup->t_tid)); - ItemPointerCopy(&itup->t_tid, &htids[nhtids]); - nhtids++; - } - else - { - int nposting = BTreeTupleGetNPosting(itup); + update.ndeletedtids = vacposting->ndeletedtids; + memcpy(updatedbuf + offset, &update.ndeletedtids, + SizeOfBtreeUpdate); + offset += SizeOfBtreeUpdate; - if (nhtids + nposting > spacenhtids) - { - spacenhtids = Max(spacenhtids * 2, nhtids + nposting); - htids = (ItemPointer) - repalloc(htids, sizeof(ItemPointerData) * spacenhtids); - } - - for (int j = 0; j < nposting; j++) - { - ItemPointer htid = BTreeTupleGetPostingN(itup, j); - - Assert(ItemPointerIsValid(htid)); - ItemPointerCopy(htid, &htids[nhtids]); - nhtids++; - } + itemsz = update.ndeletedtids * sizeof(uint16); + memcpy(updatedbuf + offset, vacposting->deletetids, itemsz); + offset += itemsz; } } - Assert(nhtids >= ndeletable); + return updatedbuf; +} - latestRemovedXid = - table_compute_xid_horizon_for_tuples(heapRel, htids, nhtids); +/* + * Comparator used by _bt_delitems_delete_check() to restore deltids array + * back to its original leaf-page-wise sort order + */ +static int +_bt_delitems_cmp(const void *a, const void *b) +{ + TM_IndexDelete *indexdelete1 = (TM_IndexDelete *) a; + TM_IndexDelete *indexdelete2 = (TM_IndexDelete *) b; - pfree(htids); + if (indexdelete1->id > indexdelete2->id) + return 1; + if (indexdelete1->id < indexdelete2->id) + return -1; - return latestRemovedXid; + Assert(false); + + return 0; +} + +/* + * Try to delete item(s) from a btree leaf page during single-page cleanup. + * + * nbtree interface to table_compute_delete_for_tuples(). Deletes a subset of + * index tuples from caller's deltids array: those that are found dead-to-all + * in the table. Only this subset of TIDs is safe to delete from leaf page. + * + * Bottom-up index deletion caller provides all the TIDs from the leaf page, + * without expecting that tableam will check most of them. This makes sense + * because the tableam has considerable discretion about which table blocks it + * checks in bottom-up deletion case. + * + * LP_DEAD tuple deletion caller goes through here too, while asking for + * !bottomup processing from tableam. It only includes TIDs that point to + * blocks from index tuples actually marked LP_DEAD (it includes only these + * tids in delstate.deltids). This ensures that tableam will visit all table + * blocks pointed to by LP_DEAD index tuples, without visiting extra table + * blocks. This approach allows us to delete some extra index tuples that + * happen to be dead-to-all (and happen to not have already had their LP_DEAD + * bit set in passing). The extra TID checks for LP_DEAD caller are cheap + * because tableam needs to generate a latestRemovedXid value anyway, which + * necessitates visiting all table blocks. (Actually, it's only strictly + * necessary to get a latestRemovedXid with logged indexes. LP_DEAD caller + * nevertheless gets us to try to delete extra TIDs in all cases, to be + * consistent.) + * + * Note: We rely on the assumption that the delstate.deltids array is sorted + * on its id field, which is a proxy for the original leaf-page-wise order of + * index tuples. Caller must gather items in delstate in the natural way: + * through appending each TID that we consider in leaf-page-wise order. + */ +void +_bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel, + TM_IndexDeleteOp *delstate) +{ + Page page = BufferGetPage(buf); + TransactionId latestRemovedXid; + OffsetNumber postingidxoffnum; + int ndeletable, + nupdatable; + OffsetNumber deletable[MaxIndexTuplesPerPage]; + BTVacuumPosting updatable[MaxIndexTuplesPerPage]; + + /* Use tableam interface to determine which tuples to delete first */ + latestRemovedXid = table_compute_delete_for_tuples(heapRel, delstate); + + if (delstate->ndeltids == 0) + { + /* The tableam has nothing (must be a bottom-up caller) */ + Assert(delstate->bottomup); + return; + } + + /* Don't need to WAL-log latestRemovedXid in all cases */ + if (!XLogStandbyInfoActive() || !RelationNeedsWAL(rel)) + latestRemovedXid = InvalidTransactionId; + + /* + * Construct a leaf-page-wise description of what _bt_delitems_delete() + * needs to do to physically delete index tuples from the page. + * + * Must sort deltids array first. It must match the order expected by + * loop: leaf-page-wise order. (Note that the array is likely to be much + * smaller now, at least in the bottom-up deletion case.) + */ + qsort(delstate->deltids, delstate->ndeltids, sizeof(TM_IndexDelete), + _bt_delitems_cmp); + postingidxoffnum = InvalidOffsetNumber; + ndeletable = 0; + nupdatable = 0; + for (int i = 0; i < delstate->ndeltids; i++) + { + TM_IndexStatus *dstatus = delstate->status + delstate->deltids[i].id; + OffsetNumber idxoffnum = dstatus->idxoffnum; + ItemId itemid = PageGetItemId(page, idxoffnum); + IndexTuple itup = (IndexTuple) PageGetItem(page, itemid); + int tidi, + nitem; + BTVacuumPosting vacposting; + + if (idxoffnum == postingidxoffnum) + { + /* + * This deltid entry is a TID from a posting list tuple that has + * already been completely processed (since we process all of a + * posting lists TIDs together, once) + */ + Assert(BTreeTupleIsPosting(itup)); + continue; + } + + if (!BTreeTupleIsPosting(itup)) + { + /* Plain non-pivot tuple */ + Assert(ItemPointerEquals(&itup->t_tid, &delstate->deltids[i].tid)); + if (dstatus->deleteitup) + deletable[ndeletable++] = idxoffnum; + continue; + } + + /* + * Posting list tuple. Process all of its TIDs together, at once. + * + * tidi is a posting-list-tid local iterator for array. We're going + * to peak at later entries in deltid array here. Remember to skip + * over the itup-related entries that we peak at here later on. We + * should not do anything more with them when get back to the top of + * the outermost deltids loop (we should just skip them). + * + * Innermost loop exploits the fact that both itup's TIDs and the + * entries from the array (whose TIDs came from itup) are in ascending + * TID order. We avoid unnecessary TID comparisons by starting each + * execution of the innermost loop at the point where the previous + * execution (for previous TID from itup) left off at. + */ + postingidxoffnum = idxoffnum; /* Remember: process itup once only */ + tidi = i; /* Initialize for itup's first TID */ + vacposting = NULL; /* Describes what to do with itup */ + nitem = BTreeTupleGetNPosting(itup); + for (int j = 0; j < nitem; j++) + { + ItemPointer htid = BTreeTupleGetPostingN(itup, j); + int cmp = -1; + + for (; tidi < delstate->ndeltids; tidi++) + { + TM_IndexDelete *tcdeltid = &delstate->deltids[tidi]; + TM_IndexStatus *tdstatus = (delstate->status + tcdeltid->id); + + /* Stop when we get to first entry beyond itup's entries */ + Assert(tdstatus->idxoffnum >= idxoffnum); + if (tdstatus->idxoffnum != idxoffnum) + break; + + /* Skip any non-deletable entries for itup */ + if (!tdstatus->deleteitup) + continue; + + /* Have we found matching deletable entry for htid? */ + cmp = ItemPointerCompare(htid, &tcdeltid->tid); + + /* Keep going until equal or greater tid from array located */ + if (cmp <= 0) + break; + } + + /* Final check on htid: must match a deletable array entry */ + if (cmp != 0) + continue; + + if (vacposting == NULL) + { + /* + * First deletable TID for itup found. Start maintaining + * metadata describing which TIDs to delete from itup. + */ + vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) + + nitem * sizeof(uint16)); + vacposting->itup = itup; + vacposting->updatedoffset = idxoffnum; + vacposting->ndeletedtids = 0; + } + + /* htid will be deleted from itup */ + vacposting->deletetids[vacposting->ndeletedtids++] = j; + } + + if (vacposting == NULL) + { + /* No TIDs to delete from itup -- do nothing */ + } + else if (vacposting->ndeletedtids == nitem) + { + /* Straight delete of itup (to delete all TIDs) */ + deletable[ndeletable++] = idxoffnum; + /* Turns out we won't need granular information */ + pfree(vacposting); + } + else + { + /* Delete some but not all TIDs from itup */ + Assert(vacposting->ndeletedtids > 0 && + vacposting->ndeletedtids < nitem); + updatable[nupdatable++] = vacposting; + } + } + + /* Physically delete the dead-to-all TIDs we've located */ + _bt_delitems_delete(rel, buf, latestRemovedXid, deletable, ndeletable, + updatable, nupdatable, heapRel); + + /* be tidy */ + for (int i = 0; i < nupdatable; i++) + pfree(updatable[i]); } /* diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index d6c8ad5d27..0d7f5199e5 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -209,7 +209,7 @@ btinsert(Relation rel, Datum *values, bool *isnull, itup = index_form_tuple(RelationGetDescr(rel), values, isnull); itup->t_tid = *ht_ctid; - result = _bt_doinsert(rel, itup, checkUnique, heapRel); + result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel); pfree(itup); diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index 8730de25ed..d5d90cf696 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -49,7 +49,6 @@ #include "access/parallel.h" #include "access/relscan.h" #include "access/table.h" -#include "access/tableam.h" #include "access/xact.h" #include "access/xlog.h" #include "access/xloginsert.h" diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c index 2f5f14e527..831cc28eac 100644 --- a/src/backend/access/nbtree/nbtutils.c +++ b/src/backend/access/nbtree/nbtutils.c @@ -2108,7 +2108,9 @@ btoptions(Datum reloptions, bool validate) {"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL, offsetof(BTOptions, vacuum_cleanup_index_scale_factor)}, {"deduplicate_items", RELOPT_TYPE_BOOL, - offsetof(BTOptions, deduplicate_items)} + offsetof(BTOptions, deduplicate_items)}, + {"delete_items", RELOPT_TYPE_BOOL, + offsetof(BTOptions, delete_items)} }; diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index 5135b800af..3e7289fe49 100644 --- a/src/backend/access/nbtree/nbtxlog.c +++ b/src/backend/access/nbtree/nbtxlog.c @@ -556,6 +556,47 @@ btree_xlog_dedup(XLogReaderState *record) UnlockReleaseBuffer(buf); } +static void +btree_xlog_updates(Page page, OffsetNumber *updatedoffsets, + xl_btree_update *updates, int nupdated) +{ + BTVacuumPosting vacposting; + IndexTuple origtuple; + ItemId itemid; + Size itemsz; + + for (int i = 0; i < nupdated; i++) + { + itemid = PageGetItemId(page, updatedoffsets[i]); + origtuple = (IndexTuple) PageGetItem(page, itemid); + + vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) + + updates->ndeletedtids * sizeof(uint16)); + vacposting->updatedoffset = updatedoffsets[i]; + vacposting->itup = origtuple; + vacposting->ndeletedtids = updates->ndeletedtids; + memcpy(vacposting->deletetids, + (char *) updates + SizeOfBtreeUpdate, + updates->ndeletedtids * sizeof(uint16)); + + _bt_update_posting(vacposting); + + /* Overwrite updated version of tuple */ + itemsz = MAXALIGN(IndexTupleSize(vacposting->itup)); + if (!PageIndexTupleOverwrite(page, updatedoffsets[i], + (Item) vacposting->itup, itemsz)) + elog(PANIC, "failed to update partially dead item"); + + pfree(vacposting->itup); + pfree(vacposting); + + /* advance to next xl_btree_update from array */ + updates = (xl_btree_update *) + ((char *) updates + SizeOfBtreeUpdate + + updates->ndeletedtids * sizeof(uint16)); + } +} + static void btree_xlog_vacuum(XLogReaderState *record) { @@ -589,41 +630,7 @@ btree_xlog_vacuum(XLogReaderState *record) xlrec->nupdated * sizeof(OffsetNumber)); - for (int i = 0; i < xlrec->nupdated; i++) - { - BTVacuumPosting vacposting; - IndexTuple origtuple; - ItemId itemid; - Size itemsz; - - itemid = PageGetItemId(page, updatedoffsets[i]); - origtuple = (IndexTuple) PageGetItem(page, itemid); - - vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) + - updates->ndeletedtids * sizeof(uint16)); - vacposting->updatedoffset = updatedoffsets[i]; - vacposting->itup = origtuple; - vacposting->ndeletedtids = updates->ndeletedtids; - memcpy(vacposting->deletetids, - (char *) updates + SizeOfBtreeUpdate, - updates->ndeletedtids * sizeof(uint16)); - - _bt_update_posting(vacposting); - - /* Overwrite updated version of tuple */ - itemsz = MAXALIGN(IndexTupleSize(vacposting->itup)); - if (!PageIndexTupleOverwrite(page, updatedoffsets[i], - (Item) vacposting->itup, itemsz)) - elog(PANIC, "failed to update partially dead item"); - - pfree(vacposting->itup); - pfree(vacposting); - - /* advance to next xl_btree_update from array */ - updates = (xl_btree_update *) - ((char *) updates + SizeOfBtreeUpdate + - updates->ndeletedtids * sizeof(uint16)); - } + btree_xlog_updates(page, updatedoffsets, updates, xlrec->nupdated); } if (xlrec->ndeleted > 0) @@ -675,7 +682,22 @@ btree_xlog_delete(XLogReaderState *record) page = (Page) BufferGetPage(buffer); - PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted); + if (xlrec->nupdated > 0) + { + OffsetNumber *updatedoffsets; + xl_btree_update *updates; + + updatedoffsets = (OffsetNumber *) + (ptr + xlrec->ndeleted * sizeof(OffsetNumber)); + updates = (xl_btree_update *) ((char *) updatedoffsets + + xlrec->nupdated * + sizeof(OffsetNumber)); + + btree_xlog_updates(page, updatedoffsets, updates, xlrec->nupdated); + } + + if (xlrec->ndeleted > 0) + PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted); /* Mark the page as not containing any LP_DEAD items */ opaque = (BTPageOpaque) PageGetSpecialPointer(page); diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c index 6438c45716..d778557909 100644 --- a/src/backend/access/table/tableam.c +++ b/src/backend/access/table/tableam.c @@ -207,9 +207,9 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc parallel_scan) /* * To perform that check simply start an index scan, create the necessary * slot, do the heap lookup, and shut everything down again. This could be - * optimized, but is unlikely to matter from a performance POV. If there - * frequently are live index pointers also matching a unique index key, the - * CPU overhead of this routine is unlikely to matter. + * optimized, but is unlikely to matter from a performance POV. Note that + * table_compute_delete_for_tuples() is optimized in this way, since it is + * designed as a batch operation. * * Note that *tid may be modified when we return true if the AM supports * storing multiple row versions reachable via a single index entry (like diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c index 58de0743ba..181fa8f2f8 100644 --- a/src/backend/access/table/tableamapi.c +++ b/src/backend/access/table/tableamapi.c @@ -66,7 +66,7 @@ GetTableAmRoutine(Oid amhandler) Assert(routine->tuple_tid_valid != NULL); Assert(routine->tuple_get_latest_tid != NULL); Assert(routine->tuple_satisfies_snapshot != NULL); - Assert(routine->compute_xid_horizon_for_tuples != NULL); + Assert(routine->compute_delete_for_tuples != NULL); Assert(routine->tuple_insert != NULL); diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index 3a43c09bf6..1aca85ce86 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -1765,14 +1765,14 @@ psql_completion(const char *text, int start, int end) /* ALTER INDEX SET|RESET ( */ else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "(")) COMPLETE_WITH("fillfactor", - "vacuum_cleanup_index_scale_factor", "deduplicate_items", /* BTREE */ + "vacuum_cleanup_index_scale_factor", "deduplicate_items", "delete_items", /* BTREE */ "fastupdate", "gin_pending_list_limit", /* GIN */ "buffering", /* GiST */ "pages_per_range", "autosummarize" /* BRIN */ ); else if (Matches("ALTER", "INDEX", MatchAny, "SET", "(")) COMPLETE_WITH("fillfactor =", - "vacuum_cleanup_index_scale_factor =", "deduplicate_items =", /* BTREE */ + "vacuum_cleanup_index_scale_factor =", "deduplicate_items =", "delete_items =", /* BTREE */ "fastupdate =", "gin_pending_list_limit =", /* GIN */ "buffering =", /* GiST */ "pages_per_range =", "autosummarize =" /* BRIN */ diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml index bb395e6a85..9e4abf40d2 100644 --- a/doc/src/sgml/btree.sgml +++ b/doc/src/sgml/btree.sgml @@ -629,6 +629,86 @@ options(relopts local_relopts *) returns + + Bottom-up index deletion + + B-Tree indexes are not directly aware that under MVCC, there might + be multiple extant versions of the same logical table row; to an + index, each tuple is an independent object that needs its own index + entry. Version churn tuples may sometimes + accumulate and adversely affect query latency and throughput. This + typically occurs with UPDATE-heavy workloads + where most individual updates cannot apply the + HOT optimization. Changing the value of only + one column covered by one index during an UPDATE + always necessitates a new set of index tuples + — one for each and every index on the + table. Note in particular that this includes indexes that were not + logically modified by the UPDATE. + All indexes will need a successor physical index tuple that points + to the latest version in the table. Each new tuple within each + index will generally need to coexist with the original + updated tuple for a short period of time (typically + until some time after the UPDATE transaction + commits). This process produces the majority of all garbage index + tuples in some scenarios. + + + Bottom-up index deletion targets this + particular variety of index tuple garbage. It effectively enforces + a soft limit on how many versions there can be in each index for + any given logical row. It is generally very effective provided + there are no long lived snapshots that hold back cleanup. + Bottom-up index deletion complements the top-down + index cleanup performed by VACUUM. It targets + leaf pages that are disproportionately affected by the accumulation + of garbage index tuples, while leaving it up to + VACUUM to perform infrequent clean sweeps of all + indexes. A bottom-up deletion pass takes place when a leaf page + does not have enough free space to fit an incoming tuple, though + only when the incoming tuple originates from an + UPDATE that did not logically change any of the + columns covered by the index in question. + + + The deletion process must closely cooperate with the table access + method. Despite the lack of convenient access to + authoritative information about how index + tuples represent versions or are related to each other, it is + possible for the B-Tree implementation to target garbage index + tuples using relatively simple heuristics. These heuristics decide + on which table blocks to visit based on where dead tuples seem most + likely to be concentrated. Some number of table blocks must be + accessed to get the required authoritative information, but it + isn't necessary to access very many table blocks each time. Also, + each table block access has to actually enable the implementation + to delete at least one additional index tuple. The whole process + ends when any single table block access fails to yield any index + tuples deletes. + + + The delete_items storage parameter can be used + to disable bottom-up index deletion within individual indexes. + Disabling bottom-up index deletion isn't usually helpful. + + + + It's also possible for index tuple deletion to take place + following opportunistic setting of LP_DEAD + status bits. This avoids a relatively expensive bottom-up + deletion pass, which must access table blocks directly. + + + LP_DEAD status bits are set when passing index + scans happen to notice that an index tuple is dead to every + possible MVCC snapshot (not just their own). + LP_DEAD-set tuples are already known to be safe + to delete, so it isn't necessary to access the table blocks + directly. + + + + Deduplication @@ -702,25 +782,16 @@ options(relopts local_relopts *) returns deduplication isn't usually helpful. - B-Tree indexes are not directly aware that under MVCC, there might - be multiple extant versions of the same logical table row; to an - index, each tuple is an independent object that needs its own index - entry. Version duplicates may sometimes accumulate - and adversely affect query latency and throughput. This typically - occurs with UPDATE-heavy workloads where most - individual updates cannot apply the HOT - optimization (often because at least one indexed column gets - modified, necessitating a new set of index tuple versions — - one new tuple for each and every index). In - effect, B-Tree deduplication ameliorates index bloat caused by - version churn. Note that even the tuples from a unique index are - not necessarily physically unique when stored - on disk due to version churn. The deduplication optimization is - selectively applied within unique indexes. It targets those pages - that appear to have version duplicates. The high level goal is to - give VACUUM more time to run before an - unnecessary page split caused by version churn can - take place. + It is sometimes possible for unique indexes (as well as unique + constraints) to use deduplication. This allows leaf pages to + temporarily absorb extra version churn duplicates. + Deduplication in unique indexes augments bottom-up index deletion, + especially in cases where a long-running transactions holds a + snapshot that blocks garbage collection. The goal is to buy time + for the bottom-up index deletion strategy to become effective + again. Delaying page splits until a single long-running + transaction naturally goes away can allow a bottom-up deletion pass + to succeed where an earlier deletion pass failed. diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml index 2054d5d943..97c894f13f 100644 --- a/doc/src/sgml/ref/create_index.sgml +++ b/doc/src/sgml/ref/create_index.sgml @@ -435,6 +435,22 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] + + delete_items (boolean) + + delete_items storage parameter + + + + + Controls usage of the B-tree bottom-up index deletion technique + described in . Set to + ON or OFF to enable or + disable the optimization. The default is ON. + + + + vacuum_cleanup_index_scale_factor (floating point) -- 2.25.1