Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date | |
Msg-id | CANWCAZZcPgb05bRw1W7TQdisK6bo32gTvOL0YRztwh1ghq=QZg@mail.gmail.com Whole thread Raw |
In response to | Re: [PoC] Improve dead tuple storage for lazy vacuum (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: [PoC] Improve dead tuple storage for lazy vacuum
Re: [PoC] Improve dead tuple storage for lazy vacuum |
List | pgsql-hackers |
On Thu, Dec 21, 2023 at 8:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I found the following comment and wanted to discuss: > > // this might be better as "iterate over nodes", plus a callback to > RT_DUMP_NODE, > // which should really only concern itself with single nodes > RT_SCOPE void > RT_DUMP(RT_RADIX_TREE *tree) > > If it means we need to somehow use the iteration functions also for > dumping the whole tree, it would probably need to refactor the > iteration codes so that the RT_DUMP() can use them while dumping > visited nodes. But we need to be careful of not adding overheads to > the iteration performance. Yeah, some months ago I thought a callback interface would make some things easier. I don't think we need that at the moment (possibly never), so that comment can be just removed. As far as these debug functions, I only found useful the stats and dumping a single node, FWIW. I've attached v47, which is v46 plus some fixes for radix tree. 0004 - moves everything for "delete" to the end -- gradually other things will be grouped together in a sensible order 0005 - trivial 0006 - shrink nodes -- still needs testing, but nothing crashes yet. This shows some renaming might be good: Previously we had RT_CHUNK_CHILDREN_ARRAY_COPY for growing nodes, but for shrinking I've added RT_COPY_ARRAYS_AND_DELETE, since the deletion happens by simply not copying the slot to be deleted. This means when growing it would be more clear to call the former RT_COPY_ARRAYS_FOR_INSERT, since that reserves a new slot for the caller in the new node, but the caller must do the insert itself. Note that there are some practical restrictions/best-practices on whether shrinking should happen after deletion or vice versa. Hopefully it's clear, but let me know if the description can be improved. Also, it doesn't yet shrink from size class 32 to 16, but it could with a bit of work. 0007 - trivial, but could use a better comment. I also need to make sure stats reporting works (may also need some cleanup work). 0008 - fixes RT_FREE_RECURSE -- I believe you wondered some months ago if DSA could just free all our allocated segments without throwing away the DSA, and that's still a good question. 0009 - fixes the assert in RT_ITER_SET_NODE_FROM (btw, I don't think this name is better than RT_UPDATE_ITER_STACK, so maybe we should go back to that). The assert doesn't fire, so I guess it does what it's supposed to? For me, the iteration logic is still the most confusing piece out of the whole radix tree. Maybe that could be helped with some better variable names, but I wonder if it needs more invasive work. I confess I don't have better ideas for how it would work differently. 0010 - some fixes for number of children accounting in node256 0011 - Long overdue pgindent of radixtree.h, without trying to fix up afterwards. Feel free to throw out and redo if this interferes with ongoing work. The rest are from your v46. The bench doesn't work for tid store anymore, so I squashed "disable bench for CI" until we get back to that. Some more review comments (note: patch numbers are for v47, but I changed nothing from v46 in this area): 0013: + * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, + * and stored in the radix tree. Recently outdated. The variable length values seems to work, so let's make everything match. +#define MAX_TUPLES_PER_PAGE MaxOffsetNumber Maybe we don't need this macro anymore? The name no longer fits, in any case. +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, + int num_offsets) +{ + char buf[MaxBlocktableEntrySize]; + BlocktableEntry *page = (BlocktableEntry *) buf; I'm not sure this is safe with alignment. Maybe rather than plain "char", it needs to be a union with BlocktableEntry, or something. +static inline BlocktableEntry * +tidstore_iter_kv(TidStoreIter *iter, uint64 *key) +{ + if (TidStoreIsShared(iter->ts)) + return shared_rt_iterate_next(iter->tree_iter.shared, key); + + return local_rt_iterate_next(iter->tree_iter.local, key); +} In the old encoding scheme, this function did something important, but now it's a useless wrapper with one caller. + /* + * In the shared case, TidStoreControl and radix_tree are backed by the + * same DSA area and rt_memory_usage() returns the value including both. + * So we don't need to add the size of TidStoreControl separately. + */ + if (TidStoreIsShared(ts)) + return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared); + + return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local); I don't see the point in including these tiny structs, since we will always blow past the limit by a number of kilobytes (at least, often megabytes or more) at the time it happens. + iter->output.max_offset = 64; Maybe needs a comment that this is just some starting size and not anything particular. + iter->output.offsets = palloc(sizeof(OffsetNumber) * iter->output.max_offset); + /* Make sure there is enough space to add offsets */ + if (result->num_offsets + bmw_popcount(w) > result->max_offset) + { + result->max_offset *= 2; + result->offsets = repalloc(result->offsets, + sizeof(OffsetNumber) * result->max_offset); + } popcount()-ing for every array element in every value is expensive -- let's just add sizeof(bitmapword). It's not that wasteful, but then the initial max will need to be 128. About separation of responsibilities for locking: The only thing currently where the tid store is not locked is tree iteration. That's a strange exception. Also, we've recently made RT_FIND return a pointer, so the caller must somehow hold a share lock, but I think we haven't exposed callers the ability to do that, and we rely on the tid store lock for that. We have a mix of tree locking and tid store locking. We will need to consider carefully how to make this more clear, maintainable, and understandable. 0015: "XXX: some regression test fails since this commit changes the minimum m_w_m to 2048 from 1024. This was necessary for the pervious memory" This shouldn't fail anymore if the "one-place" clamp was in a patch before this. If so, lets take out that GUC change and worry about min/max size separately. If it still fails, I'd like to know why. - * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the - * vacrel->dead_items array. + * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the TID store. What I was getting at earlier is that the first line here doesn't really need to change, we can just s/array/store/ ? -static int -lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer, - int index, Buffer vmbuffer) +static void +lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, + OffsetNumber *deadoffsets, int num_offsets, Buffer buffer, + Buffer vmbuffer) "buffer" should still come after "blkno", so that line doesn't need to change. $ git diff master -- src/backend/access/heap/ | grep has_lpdead_items - bool has_lpdead_items; /* includes existing LP_DEAD items */ - * pruning and freezing. all_visible implies !has_lpdead_items, but don't - Assert(!prunestate.all_visible || !prunestate.has_lpdead_items); - if (prunestate.has_lpdead_items) - else if (prunestate.has_lpdead_items && PageIsAllVisible(page)) - if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming) - prunestate->has_lpdead_items = false; - prunestate->has_lpdead_itemshas_lpdead_itemshas_lpdead_itemshas_lpdead_items = true; In a green field, it'd be fine to replace these with an expression of "num_offsets", but it adds a bit of noise for reviewers and the git log. Is it really necessary? - deadoffsets[lpdead_items++] = offnum; + prunestate->deadoffsets[prunestate->num_offsets++] = offnum; I'm also not quite sure why "deadoffsets" and "lpdead_items" got moved to the PruneState. The latter was renamed in a way that makes more sense, but I don't see why the churn is necessary. @@ -1875,28 +1882,9 @@ lazy_scan_prune(LVRelState *vacrel, } #endif - /* - * Now save details of the LP_DEAD items from the page in vacrel - */ - if (lpdead_items > 0) + if (prunestate->num_offsets > 0) { - VacDeadItems *dead_items = vacrel->dead_items; - ItemPointerData tmp; - vacrel->lpdead_item_pages++; - prunestate->has_lpdead_items = true; - - ItemPointerSetBlockNumber(&tmp, blkno); - - for (int i = 0; i < lpdead_items; i++) - { - ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]); - dead_items->items[dead_items->num_items++] = tmp; - } - - Assert(dead_items->num_items <= dead_items->max_items); - pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES, - dead_items->num_items); I don't understand why this block got removed and nothing new is adding anything to the tid store. @@ -1087,7 +1088,16 @@ lazy_scan_heap(LVRelState *vacrel) * with prunestate-driven visibility map and FSM steps (just like * the two-pass strategy). */ - Assert(dead_items->num_items == 0); + Assert(TidStoreNumTids(dead_items) == 0); + } + else if (prunestate.num_offsets > 0) + { + /* Save details of the LP_DEAD items from the page in dead_items */ + TidStoreSetBlockOffsets(dead_items, blkno, prunestate.deadoffsets, + prunestate.num_offsets); + + pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES, + TidStoreMemoryUsage(dead_items)); I guess it was added here, 800 lines away? If so, why? About progress reporting: I want to make sure no one is going to miss counting "num_dead_tuples". It's no longer relevant for the number of index scans we need to do, but do admins still have a use for it? Something to think about later. 0017 + /* + * max_bytes is forced to be at least 64kB, the current minimum valid + * value for the work_mem GUC. + */ + max_bytes = Max(64 * 1024L, max_bytes); If this still needs to be here, I still don't understand why.
Attachment
pgsql-hackers by date: