RE: Logical Replica ReorderBuffer Size Accounting Issues - Mailing list pgsql-bugs
From | Wei Wang (Fujitsu) |
---|---|
Subject | RE: Logical Replica ReorderBuffer Size Accounting Issues |
Date | |
Msg-id | OSZPR01MB6278C3FCBCE47A42CCF05DE99E409@OSZPR01MB6278.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Logical Replica ReorderBuffer Size Accounting Issues (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: Logical Replica ReorderBuffer Size Accounting Issues
|
List | pgsql-bugs |
On Thu, May 9, 2023 at 22:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, May 9, 2023 at 6:06 PM Wei Wang (Fujitsu) > > > I think there are two separate issues. One is a pure memory accounting > > > issue: since the reorderbuffer accounts the memory usage by > > > calculating actual tuple size etc. it includes neither the chunk > > > header size nor fragmentations within blocks. So I can understand why > > > the output of MemoryContextStats(rb->context) could be two or three > > > times higher than logical_decoding_work_mem and doesn't match rb->size > > > in some cases. > > > > > > However it cannot explain the original issue that the memory usage > > > (reported by MemoryContextStats(rb->context)) reached 5GB in spite of > > > logilca_decoding_work_mem being 256MB, which seems like a memory leak > > > bug or something we ignore the memory limit. > > > > Yes, I agree that the chunk header size or fragmentations within blocks may > > cause the allocated space to be larger than the accounted space. However, since > > these spaces are very small (please refer to [1] and [2]), I also don't think > > this is the cause of the original issue in this thread. > > > > I think that the cause of the original issue in this thread is the > > implementation of 'Generational allocator'. > > Please consider the following user scenario: > > The parallel execution of different transactions led to very fragmented and > > mixed-up WAL records for those transactions. Later, when walsender serially > > decodes the WAL, different transactions' chunks were stored on a single block > > in rb->tup_context. However, when a transaction ends, the chunks related to > > this transaction on the block will be marked as free instead of being actually > > released. The block will only be released when all chunks in the block are > > free. In other words, the block will only be released when all transactions > > occupying the block have ended. As a result, the chunks allocated by some > > ending transactions are not being released on many blocks for a long time. > Then > > this issue occurred. I think this also explains why parallel execution is more > > likely to trigger this issue compared to serial execution of transactions. > > Please also refer to the analysis details of code in [3]. > > After some investigation, I don't think the implementation of > generation allocator is problematic but I agree that your scenario is > likely to explain the original issue. Especially, the output of > MemoryContextStats() shows: > > Tuples: 4311744512 total in 514 blocks (12858943 chunks); > 6771224 free (12855411 chunks); 4304973288 used > > First, since the total memory allocation was 4311744512 bytes in 514 > blocks we can see there were no special blocks in the context (8MB * > 514 = 4311744512 bytes). Second, it shows that the most chunks were > free (12858943 chunks vs. 12855411 chunks) but most memory were used > (4311744512 bytes vs. 4304973288 bytes), which means that there were > some in-use chunks at the tail of each block, i.e. the most blocks > were fragmented. I've attached another test to reproduce this > behavior. In this test, the memory usage reaches up to almost 4GB. > > One idea to deal with this issue is to choose the block sizes > carefully while measuring the performance as the comment shows: > > /* > * XXX the allocation sizes used below pre-date generation context's block > * growing code. These values should likely be benchmarked and set to > * more suitable values. > */ > buffer->tup_context = GenerationContextCreate(new_ctx, > "Tuples", > SLAB_LARGE_BLOCK_SIZE, > SLAB_LARGE_BLOCK_SIZE, > SLAB_LARGE_BLOCK_SIZE); > > For example, if I use SLAB_DEFAULT_BLOCK_SIZE, 8kB, the maximum memory > usage was about 17MB in the test. Thanks for your idea. I did some tests as you suggested. I think the modification mentioned above can work around this issue in the test 002_rb_memory_2.pl on [1] (To reach the size of large transactions, I set logical_decoding_work_mem to 1MB). But the test repreduce.sh on [2] still reproduces this issue. It seems that this modification will fix a subset of use cases, But the issue still occurs for other use cases. I think that the size of a block may lead to differences in the number of transactions stored on the block. For example, before the modification, a block could store some changes of 10 transactions, but after the modification, a block may only store some changes of 3 transactions. I think this means that once these three transactions are committed, this block will be actually released. As a result, the probability of the block being actually released is increased after the modification. Additionally, I think that the parallelism of the test repreduce.sh is higher than that of the test 002_rb_memory_2.pl, which is also the reason why this modification only fixed the issue in the test 002_rb_memory_2.pl. Please let me know if I'm missing something. Attach the modification patch that I used (tmp-modification.patch), as well as the two tests mentioned above. [1] - https://www.postgresql.org/message-id/CAD21AoAa17DCruz4MuJ_5Q_-JOp5FmZGPLDa%3DM9d%2BQzzg8kiBw%40mail.gmail.com [2] - https://www.postgresql.org/message-id/OS3PR01MB6275A7E5323601D59D18DB979EC29%40OS3PR01MB6275.jpnprd01.prod.outlook.com Regards, Wang wei
Attachment
pgsql-bugs by date: