Thread: Re: Using per-transaction memory contexts for storing decoded tuples
On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > We have several reports that logical decoding uses memory much more > than logical_decoding_work_mem[1][2][3]. For instance in one of the > reports[1], even though users set logical_decoding_work_mem to > '256MB', a walsender process was killed by OOM because of using more > than 4GB memory. > > As per the discussion in these threads so far, what happened was that > there was huge memory fragmentation in rb->tup_context. > rb->tup_context uses GenerationContext with 8MB memory blocks. We > cannot free memory blocks until all memory chunks in the block are > freed. If there is a long-running transaction making changes, its > changes could be spread across many memory blocks and we end up not > being able to free memory blocks unless the long-running transaction > is evicted or completed. Since we don't account fragmentation, block > header size, nor chunk header size into per-transaction memory usage > (i.e. txn->size), rb->size could be less than > logical_decoding_work_mem but the actual allocated memory in the > context is much higher than logical_decoding_work_mem. > It is not clear to me how the fragmentation happens. Is it because of some interleaving transactions which are even ended but the memory corresponding to them is not released? Can we try reducing the size of 8MB memory blocks? The comment atop allocation says: "XXX the allocation sizes used below pre-date generation context's block growing code. These values should likely be benchmarked and set to more suitable values.", so do we need some tuning here? > We can reproduce this issue with the attached patch > rb_excessive_memory_reproducer.patch (not intended to commit) that > adds a memory usage reporting and includes the test. After running the > tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory > usage report in the server logs like follows: > > LOG: reorderbuffer memory: logical_decoding_work_mem=268435456 > rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264 > > Which means that the logical decoding allocated 1GB memory in spite of > logical_decoding_work_mem being 256MB. > > One idea to deal with this problem is that we use > MemoryContextMemAllocated() as the reorderbuffer's memory usage > instead of txn->total_size. That is, we evict transactions until the > value returned by MemoryContextMemAllocated() gets lower than > logical_decoding_work_mem. However, it could end up evicting a lot of > (possibly all) transactions because the transaction whose decoded > tuples data are spread across memory context blocks could be evicted > after all other larger transactions are evicted (note that we evict > transactions from largest to smallest). Therefore, if we want to do > that, we would need to change the eviction algorithm to for example > LSN order instead of transaction size order. Furthermore, > reorderbuffer's changes that are counted in txn->size (and therefore > rb->size too) are stored in different memory contexts depending on the > kinds. For example, decoded tuples are stored in rb->context, > ReorderBufferChange are stored in rb->change_context, and snapshot > data are stored in builder->context. So we would need to sort out > which data is stored into which memory context and which memory > context should be accounted for in the reorderbuffer's memory usage. > Which could be a large amount of work. > > Another idea is to have memory context for storing decoded tuples per > transaction. That way, we can ensure that all memory blocks of the > context are freed when the transaction is evicted or completed. I > think that this idea would be simpler and worth considering, so I > attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the > decoded tuple data is created individually when the corresponding WAL > record is decoded but is released collectively when it is released > (e.g., transaction eviction), the bump memory context would fit the > best this case. One exception is that we immediately free the decoded > tuple data if the transaction is already aborted. However, in this > case, I think it is possible to skip the WAL decoding in the first > place. > > One thing we need to consider is that the number of transaction > entries and memory contexts in the reorderbuffer could be very high > since it has entries for both top-level transaction entries and > sub-transactions. To deal with that, the patch changes that decoded > tuples of a subtransaction are stored into its top-level transaction's > tuple context. > Won't that impact the calculation for ReorderBufferLargestTXN() which can select either subtransaction or top-level xact? > We always evict sub-transactions and its top-level > transaction at the same time, I think it would not be a problem. > Checking performance degradation due to using many memory contexts > would have to be done. > The generation context has been introduced in commit a4ccc1ce which claims that it has significantly reduced logical decoding's memory usage and improved its performance. Won't changing it requires us to validate all the cases which led us to use generation context in the first place? -- With Regards, Amit Kapila.
On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > We have several reports that logical decoding uses memory much more > > than logical_decoding_work_mem[1][2][3]. For instance in one of the > > reports[1], even though users set logical_decoding_work_mem to > > '256MB', a walsender process was killed by OOM because of using more > > than 4GB memory. > > > > As per the discussion in these threads so far, what happened was that > > there was huge memory fragmentation in rb->tup_context. > > rb->tup_context uses GenerationContext with 8MB memory blocks. We > > cannot free memory blocks until all memory chunks in the block are > > freed. If there is a long-running transaction making changes, its > > changes could be spread across many memory blocks and we end up not > > being able to free memory blocks unless the long-running transaction > > is evicted or completed. Since we don't account fragmentation, block > > header size, nor chunk header size into per-transaction memory usage > > (i.e. txn->size), rb->size could be less than > > logical_decoding_work_mem but the actual allocated memory in the > > context is much higher than logical_decoding_work_mem. > > > > It is not clear to me how the fragmentation happens. Is it because of > some interleaving transactions which are even ended but the memory > corresponding to them is not released? In a generation context, we can free a memory block only when all memory chunks there are freed. Therefore, individual tuple buffers are already pfree()'ed but the underlying memory blocks are not freed. > Can we try reducing the size of > 8MB memory blocks? The comment atop allocation says: "XXX the > allocation sizes used below pre-date generation context's block > growing code. These values should likely be benchmarked and set to > more suitable values.", so do we need some tuning here? Reducing the size of the 8MB memory block would be one solution and could be better as it could be back-patchable. It would mitigate the problem but would not resolve it. I agree to try reducing it and do some benchmark tests. If it reasonably makes the problem less likely to happen, it would be a good solution. > > > We can reproduce this issue with the attached patch > > rb_excessive_memory_reproducer.patch (not intended to commit) that > > adds a memory usage reporting and includes the test. After running the > > tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory > > usage report in the server logs like follows: > > > > LOG: reorderbuffer memory: logical_decoding_work_mem=268435456 > > rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264 > > > > Which means that the logical decoding allocated 1GB memory in spite of > > logical_decoding_work_mem being 256MB. > > > > One idea to deal with this problem is that we use > > MemoryContextMemAllocated() as the reorderbuffer's memory usage > > instead of txn->total_size. That is, we evict transactions until the > > value returned by MemoryContextMemAllocated() gets lower than > > logical_decoding_work_mem. However, it could end up evicting a lot of > > (possibly all) transactions because the transaction whose decoded > > tuples data are spread across memory context blocks could be evicted > > after all other larger transactions are evicted (note that we evict > > transactions from largest to smallest). Therefore, if we want to do > > that, we would need to change the eviction algorithm to for example > > LSN order instead of transaction size order. Furthermore, > > reorderbuffer's changes that are counted in txn->size (and therefore > > rb->size too) are stored in different memory contexts depending on the > > kinds. For example, decoded tuples are stored in rb->context, > > ReorderBufferChange are stored in rb->change_context, and snapshot > > data are stored in builder->context. So we would need to sort out > > which data is stored into which memory context and which memory > > context should be accounted for in the reorderbuffer's memory usage. > > Which could be a large amount of work. > > > > Another idea is to have memory context for storing decoded tuples per > > transaction. That way, we can ensure that all memory blocks of the > > context are freed when the transaction is evicted or completed. I > > think that this idea would be simpler and worth considering, so I > > attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the > > decoded tuple data is created individually when the corresponding WAL > > record is decoded but is released collectively when it is released > > (e.g., transaction eviction), the bump memory context would fit the > > best this case. One exception is that we immediately free the decoded > > tuple data if the transaction is already aborted. However, in this > > case, I think it is possible to skip the WAL decoding in the first > > place. > > > > One thing we need to consider is that the number of transaction > > entries and memory contexts in the reorderbuffer could be very high > > since it has entries for both top-level transaction entries and > > sub-transactions. To deal with that, the patch changes that decoded > > tuples of a subtransaction are stored into its top-level transaction's > > tuple context. > > > > Won't that impact the calculation for ReorderBufferLargestTXN() which > can select either subtransaction or top-level xact? Yeah, I missed that we could evict only sub-transactions when choosing the largest transaction. We need to find a better solution. > > > We always evict sub-transactions and its top-level > > transaction at the same time, I think it would not be a problem. > > Checking performance degradation due to using many memory contexts > > would have to be done. > > > > The generation context has been introduced in commit a4ccc1ce which > claims that it has significantly reduced logical decoding's memory > usage and improved its performance. Won't changing it requires us to > validate all the cases which led us to use generation context in the > first place? Agreed. Will investigate the thread and check the cases. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > We have several reports that logical decoding uses memory much more > > > than logical_decoding_work_mem[1][2][3]. For instance in one of the > > > reports[1], even though users set logical_decoding_work_mem to > > > '256MB', a walsender process was killed by OOM because of using more > > > than 4GB memory. > > > > > > As per the discussion in these threads so far, what happened was that > > > there was huge memory fragmentation in rb->tup_context. > > > rb->tup_context uses GenerationContext with 8MB memory blocks. We > > > cannot free memory blocks until all memory chunks in the block are > > > freed. If there is a long-running transaction making changes, its > > > changes could be spread across many memory blocks and we end up not > > > being able to free memory blocks unless the long-running transaction > > > is evicted or completed. Since we don't account fragmentation, block > > > header size, nor chunk header size into per-transaction memory usage > > > (i.e. txn->size), rb->size could be less than > > > logical_decoding_work_mem but the actual allocated memory in the > > > context is much higher than logical_decoding_work_mem. > > > > > > > It is not clear to me how the fragmentation happens. Is it because of > > some interleaving transactions which are even ended but the memory > > corresponding to them is not released? > > In a generation context, we can free a memory block only when all > memory chunks there are freed. Therefore, individual tuple buffers are > already pfree()'ed but the underlying memory blocks are not freed. > I understood this part but didn't understand the cases leading to this problem. For example, if there is a large (and only) transaction in the system that allocates many buffers for change records during decoding, in the end, it should free memory for all the buffers allocated in the transaction. So, wouldn't that free all the memory chunks corresponding to the memory blocks allocated? My guess was that we couldn't free all the chunks because there were small interleaving transactions that allocated memory but didn't free it before the large transaction ended. > > Can we try reducing the size of > > 8MB memory blocks? The comment atop allocation says: "XXX the > > allocation sizes used below pre-date generation context's block > > growing code. These values should likely be benchmarked and set to > > more suitable values.", so do we need some tuning here? > > Reducing the size of the 8MB memory block would be one solution and > could be better as it could be back-patchable. It would mitigate the > problem but would not resolve it. I agree to try reducing it and do > some benchmark tests. If it reasonably makes the problem less likely > to happen, it would be a good solution. > makes sense. -- With Regards, Amit Kapila.
On Tue, Sep 17, 2024 at 2:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > We have several reports that logical decoding uses memory much more > > > > than logical_decoding_work_mem[1][2][3]. For instance in one of the > > > > reports[1], even though users set logical_decoding_work_mem to > > > > '256MB', a walsender process was killed by OOM because of using more > > > > than 4GB memory. > > > > > > > > As per the discussion in these threads so far, what happened was that > > > > there was huge memory fragmentation in rb->tup_context. > > > > rb->tup_context uses GenerationContext with 8MB memory blocks. We > > > > cannot free memory blocks until all memory chunks in the block are > > > > freed. If there is a long-running transaction making changes, its > > > > changes could be spread across many memory blocks and we end up not > > > > being able to free memory blocks unless the long-running transaction > > > > is evicted or completed. Since we don't account fragmentation, block > > > > header size, nor chunk header size into per-transaction memory usage > > > > (i.e. txn->size), rb->size could be less than > > > > logical_decoding_work_mem but the actual allocated memory in the > > > > context is much higher than logical_decoding_work_mem. > > > > > > > > > > It is not clear to me how the fragmentation happens. Is it because of > > > some interleaving transactions which are even ended but the memory > > > corresponding to them is not released? > > > > In a generation context, we can free a memory block only when all > > memory chunks there are freed. Therefore, individual tuple buffers are > > already pfree()'ed but the underlying memory blocks are not freed. > > > > I understood this part but didn't understand the cases leading to this > problem. For example, if there is a large (and only) transaction in > the system that allocates many buffers for change records during > decoding, in the end, it should free memory for all the buffers > allocated in the transaction. So, wouldn't that free all the memory > chunks corresponding to the memory blocks allocated? My guess was that > we couldn't free all the chunks because there were small interleaving > transactions that allocated memory but didn't free it before the large > transaction ended. We haven't actually checked with the person who reported the problem, so this is just a guess, but I think there were concurrent transactions, including long-running INSERT transactions. For example, suppose a transaction that inserts 10 million rows and many OLTP-like (short) transactions are running at the same time. The scenario I thought of was that one 8MB Generation Context Block contains 1MB of the large insert transaction changes, and the other 7MB contains short OLTP transaction changes. If there are just 256 such blocks, even after all short-transactions have completed, the Generation Context will have allocated 2GB of memory until we decode the commit record of the large transaction, but rb->size will say 256MB. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Sep 17, 2024 at 2:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Can we try reducing the size of > > > 8MB memory blocks? The comment atop allocation says: "XXX the > > > allocation sizes used below pre-date generation context's block > > > growing code. These values should likely be benchmarked and set to > > > more suitable values.", so do we need some tuning here? > > > > Reducing the size of the 8MB memory block would be one solution and > > could be better as it could be back-patchable. It would mitigate the > > problem but would not resolve it. I agree to try reducing it and do > > some benchmark tests. If it reasonably makes the problem less likely > > to happen, it would be a good solution. > > > > makes sense. I've done some benchmark tests for three different code bases with different test cases. In short, reducing the generation memory context block size to 8kB seems to be promising; it mitigates the problem while keeping a similar performance. Here are three code bases that I used: * head: current head code. * per-tx-bump: the proposed idea (with a slight change; each sub and top-level transactions have its own bump memory context to store decoded tuples). * 8kb-mem-block: same as head except for changing the generation memory block size from 8MB to 8kB. And here are test cases and results: 1. Memory usage check I've run the test that I shared before and checked the maximum amount of memory allocated in the reorderbuffer context shown by MemoryContextMemAllocated(). Here are results: head: 2.1GB (while rb->size showing 43MB) per-tx-bump: 50MB (while rb->size showing 43MB) 8kb-mem-block: 54MB (while rb->size showing 43MB) I've confirmed that the excessive memory usage issue didn't happen in the per-tx-bump case and the 8kb-mem-block cases. 2. Decoding many sub transactions IIUC this kind of workload was a trigger to make us invent the Generation Context for logical decoding[1]. The single top-level transaction has 1M sub-transactions each of which insert a tuple. Here are results: head: 31694.163 ms (00:31.694) per-tx-bump: 32661.752 ms (00:32.662) 8kb-mem-block: 31834.872 ms (00:31.835) The head and 8kb-mem-block showed similar results whereas I see there is a bit of regression on per-tx-bump. I think this is because of the overhead of creating and deleting memory contexts for each sub-transactions. 3. Decoding a big transaction The next test case I did is to decode a single big transaction that inserts 10M rows. I set logical_decoding_work_mem large enough to avoid spilling behavior. Here are results: head: 19859.113 ms (00:19.859) per-tx-bump: 19422.308 ms (00:19.422) 8kb-mem-block: 19923.600 ms (00:19.924) There were no big differences. FYI, I also checked the maximum memory usage for this test case as well: head: 1.53GB per-tx-bump: 1.4GB 8kb-mem-block: 1.53GB The per-tx-bump used a bit lesser memory probably thanks to bump memory contexts. 4. Decoding many short transactions. The last test case I did is to decode a bunch of short pgbench transactions (10k transactions). Here are results: head: 31694.163 ms (00:31.694) per-tx-bump: 32661.752 ms (00:32.662) 8kb-mem-block: Time: 31834.872 ms (00:31.835) I can see a similar trend of the test case #2 above. Overall, reducing the generation context memory block size to 8kB seems to be promising. And using the bump memory context per transaction didn't bring performance improvement than I expected in these cases. Regards, [1] https://www.postgresql.org/message-id/flat/20160706185502.1426.28143@wrigleys.postgresql.org -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I've done some benchmark tests for three different code bases with > different test cases. In short, reducing the generation memory context > block size to 8kB seems to be promising; it mitigates the problem > while keeping a similar performance. Did you try any sizes between 8KB and 8MB? 1000x reduction seems quite large a jump. There is additional overhead from having more blocks. It means more malloc() work and more free() work when deleting a context. It would be nice to see some numbers with all powers of 2 between 8KB and 8MB. I imagine the returns are diminishing as the block size is reduced further. Another alternative idea would be to defragment transactions with a large number of changes after they grow to some predefined size. Defragmentation would just be a matter of performing palloc/memcpy/pfree for each change. If that's all done at once, all the changes for that transaction would be contiguous in memory. If you're smart about what the trigger point is for performing the defragmentation then I imagine there's not much risk of performance regression for the general case. For example, you might only want to trigger it when MemoryContextMemAllocated() for the generation context exceeds logical_decoding_work_mem by some factor and only do it for transactions where the size of the changes exceeds some threshold. > Overall, reducing the generation context memory block size to 8kB > seems to be promising. And using the bump memory context per > transaction didn't bring performance improvement than I expected in > these cases. Having a bump context per transaction would cause a malloc() every time you create a new context and a free() each time you delete the context when cleaning up the reorder buffer for the transaction. GenerationContext has a "freeblock" field that it'll populate instead of freeing a block. That block will be reused next time a new block is required. For truly FIFO workloads that never need an oversized block, all new blocks will come from the freeblock field once the first block becomes unused. See the comments in GenerationFree(). I expect this is why bump contexts are slower than the generation context for your short transaction workload. David
On 2024/09/19 8:53, Masahiko Sawada wrote: > On Tue, Sep 17, 2024 at 2:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> >>> On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> >>>> Can we try reducing the size of >>>> 8MB memory blocks? The comment atop allocation says: "XXX the >>>> allocation sizes used below pre-date generation context's block >>>> growing code. These values should likely be benchmarked and set to >>>> more suitable values.", so do we need some tuning here? >>> >>> Reducing the size of the 8MB memory block would be one solution and >>> could be better as it could be back-patchable. It would mitigate the >>> problem but would not resolve it. I agree to try reducing it and do >>> some benchmark tests. If it reasonably makes the problem less likely >>> to happen, it would be a good solution. >>> >> >> makes sense. > > I've done some benchmark tests for three different code bases with > different test cases. In short, reducing the generation memory context > block size to 8kB seems to be promising; it mitigates the problem > while keeping a similar performance. Sounds good! I believe the memory bloat issue in the reorder buffer should be considered a bug. Since this solution isn’t too invasive, I think it’s worth considering back-patch to older versions. Then, if we find a better approach, we can apply it to v18 or later. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Thu, Sep 19, 2024 at 6:46 AM David Rowley <dgrowleyml@gmail.com> wrote: > > On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've done some benchmark tests for three different code bases with > > different test cases. In short, reducing the generation memory context > > block size to 8kB seems to be promising; it mitigates the problem > > while keeping a similar performance. > > Did you try any sizes between 8KB and 8MB? 1000x reduction seems > quite large a jump. There is additional overhead from having more > blocks. It means more malloc() work and more free() work when deleting > a context. It would be nice to see some numbers with all powers of 2 > between 8KB and 8MB. I imagine the returns are diminishing as the > block size is reduced further. > Good idea. > Another alternative idea would be to defragment transactions with a > large number of changes after they grow to some predefined size. > Defragmentation would just be a matter of performing > palloc/memcpy/pfree for each change. If that's all done at once, all > the changes for that transaction would be contiguous in memory. If > you're smart about what the trigger point is for performing the > defragmentation then I imagine there's not much risk of performance > regression for the general case. For example, you might only want to > trigger it when MemoryContextMemAllocated() for the generation context > exceeds logical_decoding_work_mem by some factor and only do it for > transactions where the size of the changes exceeds some threshold. > After collecting the changes that exceed 'logical_decoding_work_mem', one can choose to stream the transaction and free the changes to avoid hitting this problem, however, we can use that or some other constant to decide the point of defragmentation. The other point we need to think in this idea is whether we actually need any defragmentation at all. This will depend on whether there are concurrent transactions being decoded. This would require benchmarking to see the performance impact. -- With Regards, Amit Kapila.
On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I've done other benchmarking tests while changing the memory block > sizes from 8kB to 8MB. I measured the execution time of logical > decoding of one transaction that inserted 10M rows. I set > logical_decoding_work_mem large enough to avoid spilling behavior. In > this scenario, we allocate many memory chunks while decoding the > transaction and resulting in calling more malloc() in smaller memory > block sizes. Here are results (an average of 3 executions): I was interested in seeing the memory consumption with the test that was causing an OOM due to the GenerationBlock fragmentation. You saw bad results with 8MB blocks and good results with 8KB blocks. The measure that's interesting here is the MemoryContextMemAllocated() for the GenerationContext in question. > The fact that we're using rb->size and txn->size to choose the > transactions to evict could make this idea less attractive. Even if we > defragment transactions, rb->size and txn->size don't change. > Therefore, it doesn't mean we can avoid evicting transactions due to > full of logical_decoding_work_mem, but just mean the amount of > allocated memory might have been reduced. I had roughly imagined that you'd add an extra field to store txn->size when the last fragmentation was done and only defrag the transaction when the ReorderBufferTXN->size is, say, double the last size when the changes were last defragmented. Of course, if the first defragmentation was enough to drop MemoryContextMemAllocated() below the concerning threshold above logical_decoding_work_mem then further defrags wouldn't be done, at least, until such times as the MemoryContextMemAllocated() became concerning again. If you didn't want to a full Size variable for the defragmentation size, you could just store a uint8 to store which power of 2 ReorderBufferTXN->size was when it was last defragmented. There are plenty of holds in that struct that could be filled without enlarging the struct. In general, it's a bit annoying to have to code around this GenerationContext fragmentation issue. Unfortunately, AllocSet could also suffer from fragmentation issues too if lots of chunks end up on freelists that are never reused, so using another context type might just create a fragmentation issue for a different workload. David
On Thu, Sep 19, 2024 at 10:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Sep 18, 2024 at 8:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Sep 19, 2024 at 6:46 AM David Rowley <dgrowleyml@gmail.com> wrote: > > > > > > On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I've done some benchmark tests for three different code bases with > > > > different test cases. In short, reducing the generation memory context > > > > block size to 8kB seems to be promising; it mitigates the problem > > > > while keeping a similar performance. > > > > > > Did you try any sizes between 8KB and 8MB? 1000x reduction seems > > > quite large a jump. There is additional overhead from having more > > > blocks. It means more malloc() work and more free() work when deleting > > > a context. It would be nice to see some numbers with all powers of 2 > > > between 8KB and 8MB. I imagine the returns are diminishing as the > > > block size is reduced further. > > > > > > > Good idea. > > Agreed. > > I've done other benchmarking tests while changing the memory block > sizes from 8kB to 8MB. I measured the execution time of logical > decoding of one transaction that inserted 10M rows. I set > logical_decoding_work_mem large enough to avoid spilling behavior. In > this scenario, we allocate many memory chunks while decoding the > transaction and resulting in calling more malloc() in smaller memory > block sizes. Here are results (an average of 3 executions): > > 8kB: 19747.870 ms > 16kB: 19780.025 ms > 32kB: 19760.575 ms > 64kB: 19772.387 ms > 128kB: 19825.385 ms > 256kB: 19781.118 ms > 512kB: 19808.138 ms > 1MB: 19757.640 ms > 2MB: 19801.429 ms > 4MB: 19673.996 ms > 8MB: 19643.547 ms > > Interestingly, there were no noticeable differences in the execution > time. I've checked the number of allocated memory blocks in each case > and more blocks are allocated in smaller block size cases. For > example, when the logical decoding used the maximum memory (about > 1.5GB), we allocated about 80k blocks in 8kb memory block size case > and 80 blocks in 8MB memory block cases. > What exactly do these test results mean? Do you want to prove that there is no regression by using smaller block sizes? -- With Regards, Amit Kapila.
On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote: > > On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've done other benchmarking tests while changing the memory block > > sizes from 8kB to 8MB. I measured the execution time of logical > > decoding of one transaction that inserted 10M rows. I set > > logical_decoding_work_mem large enough to avoid spilling behavior. In > > this scenario, we allocate many memory chunks while decoding the > > transaction and resulting in calling more malloc() in smaller memory > > block sizes. Here are results (an average of 3 executions): > > I was interested in seeing the memory consumption with the test that > was causing an OOM due to the GenerationBlock fragmentation. > +1. That test will be helpful. > > The fact that we're using rb->size and txn->size to choose the > > transactions to evict could make this idea less attractive. Even if we > > defragment transactions, rb->size and txn->size don't change. > > Therefore, it doesn't mean we can avoid evicting transactions due to > > full of logical_decoding_work_mem, but just mean the amount of > > allocated memory might have been reduced. > > I had roughly imagined that you'd add an extra field to store > txn->size when the last fragmentation was done and only defrag the > transaction when the ReorderBufferTXN->size is, say, double the last > size when the changes were last defragmented. Of course, if the first > defragmentation was enough to drop MemoryContextMemAllocated() below > the concerning threshold above logical_decoding_work_mem then further > defrags wouldn't be done, at least, until such times as the > MemoryContextMemAllocated() became concerning again. If you didn't > want to a full Size variable for the defragmentation size, you could > just store a uint8 to store which power of 2 ReorderBufferTXN->size > was when it was last defragmented. There are plenty of holds in that > struct that could be filled without enlarging the struct. > > In general, it's a bit annoying to have to code around this > GenerationContext fragmentation issue. > Right, and I am also slightly afraid that this may not cause some regression in other cases where defrag wouldn't help. -- With Regards, Amit Kapila.
On Thu, Sep 19, 2024 at 10:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote: > > > > On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I've done other benchmarking tests while changing the memory block > > > sizes from 8kB to 8MB. I measured the execution time of logical > > > decoding of one transaction that inserted 10M rows. I set > > > logical_decoding_work_mem large enough to avoid spilling behavior. In > > > this scenario, we allocate many memory chunks while decoding the > > > transaction and resulting in calling more malloc() in smaller memory > > > block sizes. Here are results (an average of 3 executions): > > > > I was interested in seeing the memory consumption with the test that > > was causing an OOM due to the GenerationBlock fragmentation. > > > > +1. That test will be helpful. Sure. Here are results of peak memory usage in bytes reported by MemoryContextMemAllocated() (when rb->size shows 43MB): 8kB: 52,371,328 16kB: 52,887,424 32kB: 53,428,096 64kB: 55,099,264 128kB: 86,163,328 256kB: 149,340,032 512kB: 273,334,144 1MB: 523,419,520 2MB: 1,021,493,120 4MB: 1,984,085,888 8MB: 2,130,886,528 Probably we can increase the size to 64kB? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote: > > In general, it's a bit annoying to have to code around this > > GenerationContext fragmentation issue. > > Right, and I am also slightly afraid that this may not cause some > regression in other cases where defrag wouldn't help. Yeah, that's certainly a possibility. I was hoping that MemoryContextMemAllocated() being much larger than logical_work_mem could only happen when there is fragmentation, but certainly, you could be wasting effort trying to defrag transactions where the changes all arrive in WAL consecutively and there is no defragmentation. It might be some other large transaction that's causing the context's allocations to be fragmented. I don't have any good ideas on how to avoid wasting effort on non-problematic transactions. Maybe there's something that could be done if we knew the LSN of the first and last change and the gap between the LSNs was much larger than the WAL space used for this transaction. That would likely require tracking way more stuff than we do now, however. With the smaller blocks idea, I'm a bit concerned that using smaller blocks could cause regressions on systems that are better at releasing memory back to the OS after free() as no doubt malloc() would often be slower on those systems. There have been some complaints recently about glibc being a bit too happy to keep hold of memory after free() and I wondered if that was the reason why the small block test does not cause much of a performance regression. I wonder how the small block test would look on Mac, FreeBSD or Windows. I think it would be risky to assume that all is well with reducing the block size after testing on a single platform. David
On Sun, Sep 22, 2024 at 11:27 AM David Rowley <dgrowleyml@gmail.com> wrote: > > On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote: > > > In general, it's a bit annoying to have to code around this > > > GenerationContext fragmentation issue. > > > > Right, and I am also slightly afraid that this may not cause some > > regression in other cases where defrag wouldn't help. > > Yeah, that's certainly a possibility. I was hoping that > MemoryContextMemAllocated() being much larger than logical_work_mem > could only happen when there is fragmentation, but certainly, you > could be wasting effort trying to defrag transactions where the > changes all arrive in WAL consecutively and there is no > defragmentation. It might be some other large transaction that's > causing the context's allocations to be fragmented. I don't have any > good ideas on how to avoid wasting effort on non-problematic > transactions. Maybe there's something that could be done if we knew > the LSN of the first and last change and the gap between the LSNs was > much larger than the WAL space used for this transaction. That would > likely require tracking way more stuff than we do now, however. > With more information tracking, we could avoid some non-problematic transactions but still, it would be difficult to predict that we didn't harm many cases because to make the memory non-contiguous, we only need a few interleaving small transactions. We can try to think of ideas for implementing defragmentation in our code if we first can prove that smaller block sizes cause problems. > With the smaller blocks idea, I'm a bit concerned that using smaller > blocks could cause regressions on systems that are better at releasing > memory back to the OS after free() as no doubt malloc() would often be > slower on those systems. There have been some complaints recently > about glibc being a bit too happy to keep hold of memory after free() > and I wondered if that was the reason why the small block test does > not cause much of a performance regression. I wonder how the small > block test would look on Mac, FreeBSD or Windows. I think it would be > risky to assume that all is well with reducing the block size after > testing on a single platform. > Good point. We need extensive testing on different platforms, as you suggest, to verify if smaller block sizes caused any regressions. -- With Regards, Amit Kapila.
On Fri, Sep 20, 2024 at 10:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Sep 19, 2024 at 10:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote: > > > > > > On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I've done other benchmarking tests while changing the memory block > > > > sizes from 8kB to 8MB. I measured the execution time of logical > > > > decoding of one transaction that inserted 10M rows. I set > > > > logical_decoding_work_mem large enough to avoid spilling behavior. In > > > > this scenario, we allocate many memory chunks while decoding the > > > > transaction and resulting in calling more malloc() in smaller memory > > > > block sizes. Here are results (an average of 3 executions): > > > > > > I was interested in seeing the memory consumption with the test that > > > was causing an OOM due to the GenerationBlock fragmentation. > > > > > > > +1. That test will be helpful. > > Sure. Here are results of peak memory usage in bytes reported by > MemoryContextMemAllocated() (when rb->size shows 43MB): > > 8kB: 52,371,328 > 16kB: 52,887,424 > 32kB: 53,428,096 > 64kB: 55,099,264 > 128kB: 86,163,328 > 256kB: 149,340,032 > 512kB: 273,334,144 > 1MB: 523,419,520 > 2MB: 1,021,493,120 > 4MB: 1,984,085,888 > 8MB: 2,130,886,528 > > Probably we can increase the size to 64kB? > Yeah, but before deciding on a particular size, we need more testing on different platforms as suggested by David. -- With Regards, Amit Kapila.
On Thu, Sep 19, 2024 at 10:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Sep 19, 2024 at 10:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Sep 18, 2024 at 8:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Sep 19, 2024 at 6:46 AM David Rowley <dgrowleyml@gmail.com> wrote: > > > > > > > > On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > I've done some benchmark tests for three different code bases with > > > > > different test cases. In short, reducing the generation memory context > > > > > block size to 8kB seems to be promising; it mitigates the problem > > > > > while keeping a similar performance. > > > > > > > > Did you try any sizes between 8KB and 8MB? 1000x reduction seems > > > > quite large a jump. There is additional overhead from having more > > > > blocks. It means more malloc() work and more free() work when deleting > > > > a context. It would be nice to see some numbers with all powers of 2 > > > > between 8KB and 8MB. I imagine the returns are diminishing as the > > > > block size is reduced further. > > > > > > > > > > Good idea. > > > > Agreed. > > > > I've done other benchmarking tests while changing the memory block > > sizes from 8kB to 8MB. I measured the execution time of logical > > decoding of one transaction that inserted 10M rows. I set > > logical_decoding_work_mem large enough to avoid spilling behavior. In > > this scenario, we allocate many memory chunks while decoding the > > transaction and resulting in calling more malloc() in smaller memory > > block sizes. Here are results (an average of 3 executions): > > > > 8kB: 19747.870 ms > > 16kB: 19780.025 ms > > 32kB: 19760.575 ms > > 64kB: 19772.387 ms > > 128kB: 19825.385 ms > > 256kB: 19781.118 ms > > 512kB: 19808.138 ms > > 1MB: 19757.640 ms > > 2MB: 19801.429 ms > > 4MB: 19673.996 ms > > 8MB: 19643.547 ms > > > > Interestingly, there were no noticeable differences in the execution > > time. I've checked the number of allocated memory blocks in each case > > and more blocks are allocated in smaller block size cases. For > > example, when the logical decoding used the maximum memory (about > > 1.5GB), we allocated about 80k blocks in 8kb memory block size case > > and 80 blocks in 8MB memory block cases. > > > > What exactly do these test results mean? Do you want to prove that > there is no regression by using smaller block sizes? Yes, there was no noticeable performance regression at least in this test scenario. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Sun, Sep 22, 2024 at 9:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Sep 22, 2024 at 11:27 AM David Rowley <dgrowleyml@gmail.com> wrote: > > > > On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote: > > > > In general, it's a bit annoying to have to code around this > > > > GenerationContext fragmentation issue. > > > > > > Right, and I am also slightly afraid that this may not cause some > > > regression in other cases where defrag wouldn't help. > > > > Yeah, that's certainly a possibility. I was hoping that > > MemoryContextMemAllocated() being much larger than logical_work_mem > > could only happen when there is fragmentation, but certainly, you > > could be wasting effort trying to defrag transactions where the > > changes all arrive in WAL consecutively and there is no > > defragmentation. It might be some other large transaction that's > > causing the context's allocations to be fragmented. I don't have any > > good ideas on how to avoid wasting effort on non-problematic > > transactions. Maybe there's something that could be done if we knew > > the LSN of the first and last change and the gap between the LSNs was > > much larger than the WAL space used for this transaction. That would > > likely require tracking way more stuff than we do now, however. > > > > With more information tracking, we could avoid some non-problematic > transactions but still, it would be difficult to predict that we > didn't harm many cases because to make the memory non-contiguous, we > only need a few interleaving small transactions. We can try to think > of ideas for implementing defragmentation in our code if we first can > prove that smaller block sizes cause problems. > > > With the smaller blocks idea, I'm a bit concerned that using smaller > > blocks could cause regressions on systems that are better at releasing > > memory back to the OS after free() as no doubt malloc() would often be > > slower on those systems. There have been some complaints recently > > about glibc being a bit too happy to keep hold of memory after free() > > and I wondered if that was the reason why the small block test does > > not cause much of a performance regression. I wonder how the small > > block test would look on Mac, FreeBSD or Windows. I think it would be > > risky to assume that all is well with reducing the block size after > > testing on a single platform. > > > > Good point. We need extensive testing on different platforms, as you > suggest, to verify if smaller block sizes caused any regressions. +1. I'll do the same test on my Mac as well. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Sep 27, 2024 at 12:39 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > On Mon, 23 Sept 2024 at 09:59, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Sun, Sep 22, 2024 at 11:27 AM David Rowley <dgrowleyml@gmail.com> wrote: > > > > > > On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote: > > > > > In general, it's a bit annoying to have to code around this > > > > > GenerationContext fragmentation issue. > > > > > > > > Right, and I am also slightly afraid that this may not cause some > > > > regression in other cases where defrag wouldn't help. > > > > > > Yeah, that's certainly a possibility. I was hoping that > > > MemoryContextMemAllocated() being much larger than logical_work_mem > > > could only happen when there is fragmentation, but certainly, you > > > could be wasting effort trying to defrag transactions where the > > > changes all arrive in WAL consecutively and there is no > > > defragmentation. It might be some other large transaction that's > > > causing the context's allocations to be fragmented. I don't have any > > > good ideas on how to avoid wasting effort on non-problematic > > > transactions. Maybe there's something that could be done if we knew > > > the LSN of the first and last change and the gap between the LSNs was > > > much larger than the WAL space used for this transaction. That would > > > likely require tracking way more stuff than we do now, however. > > > > > > > With more information tracking, we could avoid some non-problematic > > transactions but still, it would be difficult to predict that we > > didn't harm many cases because to make the memory non-contiguous, we > > only need a few interleaving small transactions. We can try to think > > of ideas for implementing defragmentation in our code if we first can > > prove that smaller block sizes cause problems. > > > > > With the smaller blocks idea, I'm a bit concerned that using smaller > > > blocks could cause regressions on systems that are better at releasing > > > memory back to the OS after free() as no doubt malloc() would often be > > > slower on those systems. There have been some complaints recently > > > about glibc being a bit too happy to keep hold of memory after free() > > > and I wondered if that was the reason why the small block test does > > > not cause much of a performance regression. I wonder how the small > > > block test would look on Mac, FreeBSD or Windows. I think it would be > > > risky to assume that all is well with reducing the block size after > > > testing on a single platform. > > > > > > > Good point. We need extensive testing on different platforms, as you > > suggest, to verify if smaller block sizes caused any regressions. > > I did similar tests on Windows. rb_mem_block_size was changed from 8kB > to 8MB. Below table shows the result (average of 5 runs) and Standard > Deviation (of 5 runs) for each block-size. > > =============================================== > block-size | Average time (ms) | Standard Deviation (ms) > ------------------------------------------------------------------------------------- > 8kb | 12580.879 ms | 144.6923467 > 16kb | 12442.7256 ms | 94.02799006 > 32kb | 12370.7292 ms | 97.7958552 > 64kb | 11877.4888 ms | 222.2419142 > 128kb | 11828.8568 ms | 129.732941 > 256kb | 11801.086 ms | 20.60030913 > 512kb | 12361.4172 ms | 65.27390105 > 1MB | 12343.3732 ms | 80.84427202 > 2MB | 12357.675 ms | 79.40017604 > 4MB | 12395.8364 ms | 76.78273689 > 8MB | 11712.8862 ms | 50.74323039 > ============================================== > > From the results, I think there is a small regression for small block size. > > I ran the tests in git bash. I have also attached the test script. Thank you for testing on Windows! I've run the same benchmark on Mac (Sonoma 14.7, M1 Pro): 8kB: 4852.198 ms 16kB: 4822.733 ms 32kB: 4776.776 ms 64kB: 4851.433 ms 128kB: 4804.821 ms 256kB: 4781.778 ms 512kB: 4776.486 ms 1MB: 4783.456 ms 2MB: 4770.671 ms 4MB: 4785.800 ms 8MB: 4747.447 ms I can see there is a small regression for small block sizes. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Sep 27, 2024 at 10:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Sep 27, 2024 at 12:39 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > > > On Mon, 23 Sept 2024 at 09:59, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Sun, Sep 22, 2024 at 11:27 AM David Rowley <dgrowleyml@gmail.com> wrote: > > > > > > > > On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote: > > > > > > In general, it's a bit annoying to have to code around this > > > > > > GenerationContext fragmentation issue. > > > > > > > > > > Right, and I am also slightly afraid that this may not cause some > > > > > regression in other cases where defrag wouldn't help. > > > > > > > > Yeah, that's certainly a possibility. I was hoping that > > > > MemoryContextMemAllocated() being much larger than logical_work_mem > > > > could only happen when there is fragmentation, but certainly, you > > > > could be wasting effort trying to defrag transactions where the > > > > changes all arrive in WAL consecutively and there is no > > > > defragmentation. It might be some other large transaction that's > > > > causing the context's allocations to be fragmented. I don't have any > > > > good ideas on how to avoid wasting effort on non-problematic > > > > transactions. Maybe there's something that could be done if we knew > > > > the LSN of the first and last change and the gap between the LSNs was > > > > much larger than the WAL space used for this transaction. That would > > > > likely require tracking way more stuff than we do now, however. > > > > > > > > > > With more information tracking, we could avoid some non-problematic > > > transactions but still, it would be difficult to predict that we > > > didn't harm many cases because to make the memory non-contiguous, we > > > only need a few interleaving small transactions. We can try to think > > > of ideas for implementing defragmentation in our code if we first can > > > prove that smaller block sizes cause problems. > > > > > > > With the smaller blocks idea, I'm a bit concerned that using smaller > > > > blocks could cause regressions on systems that are better at releasing > > > > memory back to the OS after free() as no doubt malloc() would often be > > > > slower on those systems. There have been some complaints recently > > > > about glibc being a bit too happy to keep hold of memory after free() > > > > and I wondered if that was the reason why the small block test does > > > > not cause much of a performance regression. I wonder how the small > > > > block test would look on Mac, FreeBSD or Windows. I think it would be > > > > risky to assume that all is well with reducing the block size after > > > > testing on a single platform. > > > > > > > > > > Good point. We need extensive testing on different platforms, as you > > > suggest, to verify if smaller block sizes caused any regressions. > > > > I did similar tests on Windows. rb_mem_block_size was changed from 8kB > > to 8MB. Below table shows the result (average of 5 runs) and Standard > > Deviation (of 5 runs) for each block-size. > > > > =============================================== > > block-size | Average time (ms) | Standard Deviation (ms) > > ------------------------------------------------------------------------------------- > > 8kb | 12580.879 ms | 144.6923467 > > 16kb | 12442.7256 ms | 94.02799006 > > 32kb | 12370.7292 ms | 97.7958552 > > 64kb | 11877.4888 ms | 222.2419142 > > 128kb | 11828.8568 ms | 129.732941 > > 256kb | 11801.086 ms | 20.60030913 > > 512kb | 12361.4172 ms | 65.27390105 > > 1MB | 12343.3732 ms | 80.84427202 > > 2MB | 12357.675 ms | 79.40017604 > > 4MB | 12395.8364 ms | 76.78273689 > > 8MB | 11712.8862 ms | 50.74323039 > > ============================================== > > > > From the results, I think there is a small regression for small block size. > > > > I ran the tests in git bash. I have also attached the test script. > > Thank you for testing on Windows! I've run the same benchmark on Mac > (Sonoma 14.7, M1 Pro): > > 8kB: 4852.198 ms > 16kB: 4822.733 ms > 32kB: 4776.776 ms > 64kB: 4851.433 ms > 128kB: 4804.821 ms > 256kB: 4781.778 ms > 512kB: 4776.486 ms > 1MB: 4783.456 ms > 2MB: 4770.671 ms > 4MB: 4785.800 ms > 8MB: 4747.447 ms > > I can see there is a small regression for small block sizes. > So, decoding a large transaction with many smaller allocations can have ~2.2% overhead with a smaller block size (say 8Kb vs 8MB). In real workloads, we will have fewer such large transactions or a mix of small and large transactions. That will make the overhead much less visible. Does this mean that we should invent some strategy to defrag the memory at some point during decoding or use any other technique? I don't find this overhead above the threshold to invent something fancy. What do others think? -- With Regards, Amit Kapila.
On Tue, Oct 1, 2024 at 5:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Sep 27, 2024 at 10:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Sep 27, 2024 at 12:39 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > > > > > On Mon, 23 Sept 2024 at 09:59, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Sun, Sep 22, 2024 at 11:27 AM David Rowley <dgrowleyml@gmail.com> wrote: > > > > > > > > > > On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote: > > > > > > > In general, it's a bit annoying to have to code around this > > > > > > > GenerationContext fragmentation issue. > > > > > > > > > > > > Right, and I am also slightly afraid that this may not cause some > > > > > > regression in other cases where defrag wouldn't help. > > > > > > > > > > Yeah, that's certainly a possibility. I was hoping that > > > > > MemoryContextMemAllocated() being much larger than logical_work_mem > > > > > could only happen when there is fragmentation, but certainly, you > > > > > could be wasting effort trying to defrag transactions where the > > > > > changes all arrive in WAL consecutively and there is no > > > > > defragmentation. It might be some other large transaction that's > > > > > causing the context's allocations to be fragmented. I don't have any > > > > > good ideas on how to avoid wasting effort on non-problematic > > > > > transactions. Maybe there's something that could be done if we knew > > > > > the LSN of the first and last change and the gap between the LSNs was > > > > > much larger than the WAL space used for this transaction. That would > > > > > likely require tracking way more stuff than we do now, however. > > > > > > > > > > > > > With more information tracking, we could avoid some non-problematic > > > > transactions but still, it would be difficult to predict that we > > > > didn't harm many cases because to make the memory non-contiguous, we > > > > only need a few interleaving small transactions. We can try to think > > > > of ideas for implementing defragmentation in our code if we first can > > > > prove that smaller block sizes cause problems. > > > > > > > > > With the smaller blocks idea, I'm a bit concerned that using smaller > > > > > blocks could cause regressions on systems that are better at releasing > > > > > memory back to the OS after free() as no doubt malloc() would often be > > > > > slower on those systems. There have been some complaints recently > > > > > about glibc being a bit too happy to keep hold of memory after free() > > > > > and I wondered if that was the reason why the small block test does > > > > > not cause much of a performance regression. I wonder how the small > > > > > block test would look on Mac, FreeBSD or Windows. I think it would be > > > > > risky to assume that all is well with reducing the block size after > > > > > testing on a single platform. > > > > > > > > > > > > > Good point. We need extensive testing on different platforms, as you > > > > suggest, to verify if smaller block sizes caused any regressions. > > > > > > I did similar tests on Windows. rb_mem_block_size was changed from 8kB > > > to 8MB. Below table shows the result (average of 5 runs) and Standard > > > Deviation (of 5 runs) for each block-size. > > > > > > =============================================== > > > block-size | Average time (ms) | Standard Deviation (ms) > > > ------------------------------------------------------------------------------------- > > > 8kb | 12580.879 ms | 144.6923467 > > > 16kb | 12442.7256 ms | 94.02799006 > > > 32kb | 12370.7292 ms | 97.7958552 > > > 64kb | 11877.4888 ms | 222.2419142 > > > 128kb | 11828.8568 ms | 129.732941 > > > 256kb | 11801.086 ms | 20.60030913 > > > 512kb | 12361.4172 ms | 65.27390105 > > > 1MB | 12343.3732 ms | 80.84427202 > > > 2MB | 12357.675 ms | 79.40017604 > > > 4MB | 12395.8364 ms | 76.78273689 > > > 8MB | 11712.8862 ms | 50.74323039 > > > ============================================== > > > > > > From the results, I think there is a small regression for small block size. > > > > > > I ran the tests in git bash. I have also attached the test script. > > > > Thank you for testing on Windows! I've run the same benchmark on Mac > > (Sonoma 14.7, M1 Pro): > > > > 8kB: 4852.198 ms > > 16kB: 4822.733 ms > > 32kB: 4776.776 ms > > 64kB: 4851.433 ms > > 128kB: 4804.821 ms > > 256kB: 4781.778 ms > > 512kB: 4776.486 ms > > 1MB: 4783.456 ms > > 2MB: 4770.671 ms > > 4MB: 4785.800 ms > > 8MB: 4747.447 ms > > > > I can see there is a small regression for small block sizes. > > > > So, decoding a large transaction with many smaller allocations can > have ~2.2% overhead with a smaller block size (say 8Kb vs 8MB). In > real workloads, we will have fewer such large transactions or a mix of > small and large transactions. That will make the overhead much less > visible. Does this mean that we should invent some strategy to defrag > the memory at some point during decoding or use any other technique? I > don't find this overhead above the threshold to invent something > fancy. What do others think? I agree that the overhead will be much less visible in real workloads. +1 to use a smaller block (i.e. 8kB). It's easy to backpatch to old branches (if we agree) and to revert the change in case something happens. BTW I've read the discussions for inventing generational memory context[1][2], but I could not find any discussion on the memory block sizes. It seems that we use 8MB memory blocks from the first patch. [1] https://www.postgresql.org/message-id/20160706185502.1426.28143%40wrigleys.postgresql.org [2] https://www.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
RE: Using per-transaction memory contexts for storing decoded tuples
From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Sawada-san, Amit, > > So, decoding a large transaction with many smaller allocations can > > have ~2.2% overhead with a smaller block size (say 8Kb vs 8MB). In > > real workloads, we will have fewer such large transactions or a mix of > > small and large transactions. That will make the overhead much less > > visible. Does this mean that we should invent some strategy to defrag > > the memory at some point during decoding or use any other technique? I > > don't find this overhead above the threshold to invent something > > fancy. What do others think? > > I agree that the overhead will be much less visible in real workloads. > +1 to use a smaller block (i.e. 8kB). It's easy to backpatch to old > branches (if we agree) and to revert the change in case something > happens. I also felt okay. Just to confirm - you do not push rb_mem_block_size patch and just replace SLAB_LARGE_BLOCK_SIZE -> SLAB_DEFAULT_BLOCK_SIZE, right? It seems that only reorderbuffer.c uses the LARGE macro so that it can be removed. Best regards, Hayato Kuroda FUJITSU LIMITED
On Wed, Oct 2, 2024 at 9:42 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, Amit, > > > > So, decoding a large transaction with many smaller allocations can > > > have ~2.2% overhead with a smaller block size (say 8Kb vs 8MB). In > > > real workloads, we will have fewer such large transactions or a mix of > > > small and large transactions. That will make the overhead much less > > > visible. Does this mean that we should invent some strategy to defrag > > > the memory at some point during decoding or use any other technique? I > > > don't find this overhead above the threshold to invent something > > > fancy. What do others think? > > > > I agree that the overhead will be much less visible in real workloads. > > +1 to use a smaller block (i.e. 8kB). It's easy to backpatch to old > > branches (if we agree) and to revert the change in case something > > happens. > > I also felt okay. Just to confirm - you do not push rb_mem_block_size patch and > just replace SLAB_LARGE_BLOCK_SIZE -> SLAB_DEFAULT_BLOCK_SIZE, right? Right. > It seems that > only reorderbuffer.c uses the LARGE macro so that it can be removed. I'm going to keep the LARGE macro since extensions might be using it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 2024/10/03 13:47, Masahiko Sawada wrote: >>> I agree that the overhead will be much less visible in real workloads. >>> +1 to use a smaller block (i.e. 8kB). +1 >>> It's easy to backpatch to old >>> branches (if we agree) +1 >> It seems that >> only reorderbuffer.c uses the LARGE macro so that it can be removed. > > I'm going to keep the LARGE macro since extensions might be using it. Yes, for the back-patch. But in the master branch, we basically don't need to maintain this kind of compatibility? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Thu, Oct 3, 2024 at 2:46 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > > > On 2024/10/03 13:47, Masahiko Sawada wrote: > >>> I agree that the overhead will be much less visible in real workloads. > >>> +1 to use a smaller block (i.e. 8kB). > > +1 > > > >>> It's easy to backpatch to old > >>> branches (if we agree) > > +1 > > > >> It seems that > >> only reorderbuffer.c uses the LARGE macro so that it can be removed. > > > > I'm going to keep the LARGE macro since extensions might be using it. > > Yes, for the back-patch. But in the master branch, > we basically don't need to maintain this kind of compatibility? > Yes, but as for this macro specifically, I thought that it might be better to keep it, since it avoids breaking extension unnecessarily and it seems to be natural to have it as an option for slab context. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 2024/10/04 3:32, Masahiko Sawada wrote: > Yes, but as for this macro specifically, I thought that it might be > better to keep it, since it avoids breaking extension unnecessarily > and it seems to be natural to have it as an option for slab context. If the macro has value, I'm okay with leaving it as is. Are you planning to post the patch? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Thu, Oct 10, 2024 at 8:04 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > > > On 2024/10/04 3:32, Masahiko Sawada wrote: > > Yes, but as for this macro specifically, I thought that it might be > > better to keep it, since it avoids breaking extension unnecessarily > > and it seems to be natural to have it as an option for slab context. > > If the macro has value, I'm okay with leaving it as is. > > Are you planning to post the patch? > Yes, I'll post the patch soon. Regards,n -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Oct 11, 2024 at 3:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Please find the attached patches. > @@ -343,9 +343,9 @@ ReorderBufferAllocate(void) */ buffer->tup_context = GenerationContextCreate(new_ctx, "Tuples", - SLAB_LARGE_BLOCK_SIZE, - SLAB_LARGE_BLOCK_SIZE, - SLAB_LARGE_BLOCK_SIZE); + SLAB_DEFAULT_BLOCK_SIZE, + SLAB_DEFAULT_BLOCK_SIZE, + SLAB_DEFAULT_BLOCK_SIZE); Shouldn't we change the comment atop this change [1] which states that we should benchmark the existing values? One more thing we kept the max size as SLAB_DEFAULT_BLOCK_SIZE instead of something like we do with ALLOCSET_DEFAULT_SIZES, so we can probably write a comment as to why we choose to use the max_size same as init_size. BTW, can we once try to use the max size as SLAB_LARGE_BLOCK_SIZE? Can it lead to the same problem with concurrent transactions where freeing larger blocks could be a problem, if so, we can at least write a comment for future reference. [1] - /* * XXX the allocation sizes used below pre-date generation context's block * growing code. These values should likely be benchmarked and set to * more suitable values. */ -- With Regards, Amit Kapila.
On Tue, Oct 15, 2024 at 11:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sun, Oct 13, 2024 at 11:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Oct 11, 2024 at 3:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > Please find the attached patches. > > > > > Thank you for reviewing the patch! > > > > > @@ -343,9 +343,9 @@ ReorderBufferAllocate(void) > > */ > > buffer->tup_context = GenerationContextCreate(new_ctx, > > "Tuples", > > - SLAB_LARGE_BLOCK_SIZE, > > - SLAB_LARGE_BLOCK_SIZE, > > - SLAB_LARGE_BLOCK_SIZE); > > + SLAB_DEFAULT_BLOCK_SIZE, > > + SLAB_DEFAULT_BLOCK_SIZE, > > + SLAB_DEFAULT_BLOCK_SIZE); > > > > Shouldn't we change the comment atop this change [1] which states that > > we should benchmark the existing values? > > Agreed. > Can we slightly tweak the comments as follows so that it doesn't sound like a fix for a bug? /* To minimize memory fragmentation caused by long-running transactions with changes spanning multiple memory blocks, we use a single fixed-size memory block for decoded tuple storage. The tests showed that the default memory block size maintains logical decoding performance without causing fragmentation due to concurrent transactions. One might think that we can use the max size as SLAB_LARGE_BLOCK_SIZE but the test also showed it doesn't help resolve the memory fragmentation. Other than that the changes in the patch look good to me. Note that I haven't tested the patch by myself but the test results shown by you and others in the thread seem sufficient to proceed with this change. -- With Regards, Amit Kapila.
On Wed, Oct 16, 2024 at 10:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Oct 15, 2024 at 9:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Oct 15, 2024 at 11:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Sun, Oct 13, 2024 at 11:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Fri, Oct 11, 2024 at 3:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > Please find the attached patches. > > > > > > > > > > > Thank you for reviewing the patch! > > > > > > > > > > > @@ -343,9 +343,9 @@ ReorderBufferAllocate(void) > > > > */ > > > > buffer->tup_context = GenerationContextCreate(new_ctx, > > > > "Tuples", > > > > - SLAB_LARGE_BLOCK_SIZE, > > > > - SLAB_LARGE_BLOCK_SIZE, > > > > - SLAB_LARGE_BLOCK_SIZE); > > > > + SLAB_DEFAULT_BLOCK_SIZE, > > > > + SLAB_DEFAULT_BLOCK_SIZE, > > > > + SLAB_DEFAULT_BLOCK_SIZE); > > > > > > > > Shouldn't we change the comment atop this change [1] which states that > > > > we should benchmark the existing values? > > > > > > Agreed. > > > > > > > Can we slightly tweak the comments as follows so that it doesn't sound > > like a fix for a bug? > > > > /* To minimize memory fragmentation caused by long-running > > transactions with changes spanning multiple memory blocks, we use a > > single fixed-size memory block for decoded tuple storage. The tests > > showed that the default memory block size maintains logical decoding > > performance without causing fragmentation due to concurrent > > transactions. One might think that we can use the max size as > > SLAB_LARGE_BLOCK_SIZE but the test also showed it doesn't help resolve > > the memory fragmentation. > > Agreed. I've incorporated your comment in the latest patches. I'm > going to push them today, barring any objections or further comments. Pushed. Thank you all for reviewing and testing the patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com