Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date | |
Msg-id | CAD21AoDQw3DTQkq_VU_1B5qytXwtDRjrvB5EOHrfAuBZanah0w@mail.gmail.com Whole thread Raw |
In response to | Re: [PoC] Improve dead tuple storage for lazy vacuum (John Naylor <john.naylor@enterprisedb.com>) |
Responses |
Re: [PoC] Improve dead tuple storage for lazy vacuum
|
List | pgsql-hackers |
On Tue, Jun 6, 2023 at 2:13 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Mon, Jun 5, 2023 at 5:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I justwant to share some progress on making the radix tree usable not only for that, but hopefully a wider range of applications,while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no iteration)and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ). > > > > > > > Thank you for making progress on this. I agree with these directions > > overall. I have some comments and questions: > > Glad to hear it and thanks for looking! > > > > * Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) -- currentlynode32's two size classes work fine for growing, but the code should be simplified before extending to other cases.) > > > > Within the size class, we just alloc a new node of lower size class > > and do memcpy(). I guess it will be almost same as what we do for > > growing. > > Oh, the memcpy part is great, very simple. I mean the (compile-time) "class info" table lookups are a bit awkward. I'mthinking the hard-coded numbers like this: > > .fanout = 3, > .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC), > > ...may be better with a #defined symbol that can also be used elsewhere. FWIW, exposing these definitions would be good in terms of testing too since we can use them in regression tests. > > > I don't think > > shrinking class-3 to class-1 makes sense. > > Agreed. The smallest kind should just be freed when empty. > > > > Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves" basedon the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level childslots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256 leaf',since zero is a valid value. Aside from that, this was a small change. > > > > Yes, but it also means that we use pointer-sized value anyway even if > > the value size is less than that, which wastes the memory, no? > > At a low-level, that makes sense, but I've found an interesting global effect showing the opposite: _less_ memory, whichmay compensate: > > psql -c "select * from bench_search_random_nodes(1*1000*1000)" > num_keys = 992660 > > (using a low enough number that the experimental change n125->n63 doesn't affect anything) > height = 4, n3 = 375258, n15 = 137490, n32 = 0, n63 = 0, n256 = 1025 > > v31: > mem_allocated | load_ms | search_ms > ---------------+---------+----------- > 47800768 | 253 | 134 > > (unreleased code "similar" to v33, but among other things restores the separate "extend down" function) > mem_allocated | load_ms | search_ms > ---------------+---------+----------- > 42926048 | 221 | 127 > > I'd need to make sure, but apparently just going from 6 non-empty memory contexts to 3 (remember all values are embeddedhere) reduces memory fragmentation significantly in this test. (That should also serve as a demonstration that additionalsize classes have both runtime costs as well as benefits. We need to have a balance.) Interesting. The result would probably vary if we change the slab block sizes. I'd like to experiment if the code is available. > > So, I'm inclined to think the only reason to prefer "multi-value leaves" is if 1) the value type is _bigger_ than a pointer2) there is no convenient abbreviation (like tid bitmaps have) and 3) the use case really needs to avoid another memoryaccess. Under those circumstances, though, the new code plus lazy expansion etc might suit and be easier to maintain. Indeed. > > > > What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster (untested)because of better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to extendthis idea and branch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go there,and if not, it'll be a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time infowill require e.g. an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in thenode3 case. In addition, we can and should also bump it back up to node4, still keeping the metadata within 8 bytes (nostruct padding). > > > > Sounds good. > > The additional bit per slot would require per-node logic and additional branches, which is not great. I'm now thinkinga much easier way to get there is to give up (at least for now) on promising that "run-time embeddable values" canuse the full pointer-size (unlike value types found embeddable at compile-time). Reserving the lowest pointer bit fora tag "value or pointer-to-leaf" would have a much smaller code footprint. Do you mean we can make sure that the value doesn't set the lowest bit? Or is it an optimization for TIDStore? > In addition, without a new bitmap, the smallest node can actually be up to a node5 with no struct padding, with a node2as a subclass. (Those numbers coincidentally were also one scenario in the paper, when calculating worst-case memoryusage). That's worth considering. Agreed. FWIW please let me know if there are some experiments I can help with. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: