Home > mailing lists

Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From	Masahiko Sawada
Subject	Re: [PoC] Improve dead tuple storage for lazy vacuum
Date	June 13, 2023 05:46:57
Msg-id	CAD21AoDQw3DTQkq_VU_1B5qytXwtDRjrvB5EOHrfAuBZanah0w@mail.gmail.com Whole thread
In response to	Re: [PoC] Improve dead tuple storage for lazy vacuum (John Naylor <john.naylor@enterprisedb.com>)
Responses	Re: [PoC] Improve dead tuple storage for lazy vacuum
List	pgsql-hackers

Tree view

On Tue, Jun 6, 2023 at 2:13 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Mon, Jun 5, 2023 at 5:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I
justwant to share some progress on making the radix tree usable not only for that, but hopefully a wider range of
applications,while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no
iteration)and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ). 
> > >
> >
> > Thank you for making progress on this. I agree with these directions
> > overall. I have some comments and questions:
>
> Glad to hear it and thanks for looking!
>
> > > * Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) --
currentlynode32's two size classes work fine for growing, but the code should be simplified before extending to other
cases.)
> >
> > Within the size class, we just alloc a new node of lower size class
> > and do memcpy(). I guess it will be almost same as what we do for
> > growing.
>
> Oh, the memcpy part is great, very simple. I mean the (compile-time) "class info" table lookups are a bit awkward.
I'mthinking the hard-coded numbers like this: 
>
> .fanout = 3,
> .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
>
> ...may be better with a #defined symbol that can also be used elsewhere.

FWIW, exposing these definitions would be good in terms of testing too
since we can use them in regression tests.

>
> > I don't think
> > shrinking class-3 to class-1 makes sense.
>
> Agreed. The smallest kind should just be freed when empty.
>
> > > Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves"
basedon the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level
childslots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256
leaf',since zero is a valid value. Aside from that, this was a small change. 
> >
> > Yes, but it also means that we use pointer-sized value anyway even if
> > the value size is less than that, which wastes the memory, no?
>
> At a low-level, that makes sense, but I've found an interesting global effect showing the opposite: _less_ memory,
whichmay compensate: 
>
> psql -c "select * from bench_search_random_nodes(1*1000*1000)"
> num_keys = 992660
>
> (using a low enough number that the experimental change n125->n63 doesn't affect anything)
> height = 4, n3 = 375258, n15 = 137490, n32 = 0, n63 = 0, n256 = 1025
>
> v31:
>  mem_allocated | load_ms | search_ms
> ---------------+---------+-----------
>       47800768 |     253 |       134
>
> (unreleased code "similar" to v33, but among other things restores the separate "extend down" function)
>  mem_allocated | load_ms | search_ms
> ---------------+---------+-----------
>       42926048 |     221 |       127
>
> I'd need to make sure, but apparently just going from 6 non-empty memory contexts to 3 (remember all values are
embeddedhere) reduces memory fragmentation significantly in this test. (That should also serve as a demonstration that
additionalsize classes have both runtime costs as well as benefits. We need to have a balance.) 

Interesting. The result would probably vary if we change the slab
block sizes. I'd like to experiment if the code is available.

>
> So, I'm inclined to think the only reason to prefer "multi-value leaves" is if 1) the value type is _bigger_ than a
pointer2) there is no convenient abbreviation (like tid bitmaps have) and 3) the use case really needs to avoid another
memoryaccess. Under those circumstances, though, the new code plus lazy expansion etc might suit and be easier to
maintain.

Indeed.

>
> > > What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster
(untested)because of better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to
extendthis idea and branch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go
there,and if not, it'll be a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time
infowill require e.g. an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in
thenode3 case. In addition, we can and should also bump it back up to node4, still keeping the metadata within 8 bytes
(nostruct padding). 
> >
> > Sounds good.
>
> The additional bit per slot would require per-node logic and additional branches, which is not great. I'm now
thinkinga much easier way to get there is to give up (at least for now) on promising that "run-time embeddable values"
canuse the full pointer-size (unlike value types found embeddable at compile-time). Reserving the lowest pointer bit
fora tag "value or pointer-to-leaf" would have a much smaller code footprint. 

Do you mean we can make sure that the value doesn't set the lowest
bit? Or is it an optimization for TIDStore?

> In addition, without a new bitmap, the smallest node can actually be up to a node5 with no struct padding, with a
node2as a subclass. (Those numbers coincidentally were also one scenario in the paper, when calculating worst-case
memoryusage). That's worth considering. 

Agreed.

FWIW please let me know if there are some experiments I can help with.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

pgsql-hackers by date:

From: Amit Kapila
Date: 13 June 2023, 05:15:28
Subject: Re: Support logical replication of DDLs

From: Michael Paquier
Date: 13 June 2023, 05:50:30
Subject: Re: Improve logging when using Huge Pages

Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

Previous

Next