Improving the memory allocator - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Improving the memory allocator |
Date | |
Msg-id | 201104252345.33408.andres@anarazel.de Whole thread Raw |
Responses |
Re: Improving the memory allocator
Re: Improving the memory allocator Re: Improving the memory allocator |
List | pgsql-hackers |
Hi, In the thread http://archives.postgresql.org/message-id/4DA69D60.4000108@2ndquadrant.com / "Single client performance on trivial SELECTs" I wondered whether it might be possible to increase the performance of the current allocator by using some slab allocator like concept. While avoiding doing my tax returns and having a long and delayed train ride I started to play around. The profile I used in this case was: pgbench -h /tmp/ -p5433 -s 30 pgbench -S -T 20 I hacked together a patch which records every allocation to a file. That patch obviously slows everything down quite a bit so that I only get about a third of the normal tps. The current profile for that looks like that (hope thats recognizable). I guess the only unclear column is "compile_time". Thats the result of __builtin_constant_p(size) for the size argument of the various allocation operations which is true for the case where size is known at compile time. action | file | func | line | size | compile_time | count --------------------------+------------------+--------------------------------------+-------+-------+--------------+-------- chunk_alloc | list.c | new_list | 68 | 16 | t | 940753chunk_alloc | list.c | new_list | 72 | 24 | t |940753 chunk_alloc | list.c | new_tail_cell | 112 | 16 | t | 202908 chunk_alloc_zero | bitmapset.c | bms_make_singleton | 189 | 8 | f | 129122 chunk_alloc_zero_aligned | value.c | makeString | 55 | 16 | t | 129122 chunk_alloc_zero_aligned | makefuncs.c | makeVar | 73 | 40 | t | 110676 chunk_alloc_zero_aligned | makefuncs.c | makeTargetEntry | 216 | 48 | t | 92230 chunk_alloc | stringinfo.c | initStringInfo | 50 | 1024 | t | 73799 chunk_alloc | bitmapset.c | bms_copy | 119 | 8 |f | 73784 chunk_alloc | lock.c | LockAcquireExtended | 559 | 128| t | 55441 chunk_alloc | nodeFuncs.c | expression_tree_mutator | 2036 | 40 | t | 55338 chunk_alloc | snapmgr.c | PushActiveSnapshot | 282 | 24 | t | 55338 chunk_alloc | nbtsearch.c | _bt_search | 121| 24 | t | 36908 chunk_alloc | resowner.c | ResourceOwnerEnlargeCatCacheRefs | 612 | 128 | t | 36893 chunk_alloc | resowner.c | ResourceOwnerEnlargeRelationRefs | 754 | 128 | t | 36893 chunk_alloc_zero | resowner.c | ResourceOwnerCreate | 135 | 168 | t | 36893 chunk_alloc | nodeFuncs.c | expression_tree_mutator | 2027 | 40 | t | 36892 chunk_alloc | snapmgr.c | CopySnapshot | 217 | 72 | f | 36892 chunk_alloc_zero_aligned | copyfuncs.c | _copyVar | 1041 | 40 | t | 36892 chunk_alloc_zero_aligned | equivclass.c | add_eq_member | 452 | 32 | t | 36892 chunk_alloc_zero_aligned | execQual.c | ExecInitExpr | 4231 | 24 | t | 36892 chunk_alloc_zero_aligned | execTuples.c | MakeTupleTableSlot | 118 | 96 | t | 36892 chunk_alloc_zero_aligned | gram.y | makeColumnRef | 12286 | 24 | t | 36892 chunk_alloc | list.c | new_head_cell | 93 | 16 | t | 18529 chunk_alloc | genam.c | RelationGetIndexScan | 77 | 104 | t | 18489 chunk_alloc | nbtree.c |btbeginscan | 385 | 6608 | t | 18489 chunk_alloc | tupdesc.c | CreateTemplateTupleDesc | 63 | 160 | f | 18479 chunk_alloc | genam.c | RelationGetIndexScan | 89 | 72 | f | 18471 chunk_alloc | nbtree.c | btbeginscan | 388 | 72 | f | 18471 chunk_alloc | resowner.c | ResourceOwnerEnlargeBuffers | 524 | 64 | t | 18448 chunk_alloc |trigger.c | AfterTriggerBeginXact | 3607 | 112 | t | 18447 chunk_alloc | trigger.c | AfterTriggerBeginXact | 3619 | 192 | t | 18447 splitted to contexts: self | action | file | func | line | size | compile_time | count -----------------------------------------+--------------------------+------------------+--------------------------------------+-------+-------+--------------+-------- MessageContext | chunk_alloc | list.c | new_list | 68 | 16 | t | 848520 MessageContext | chunk_alloc | list.c | new_list | 72 | 24 | t | 848520 MessageContext | chunk_alloc | list.c | new_tail_cell | 112 | 16 | t | 166016 MessageContext | chunk_alloc_zero | bitmapset.c | bms_make_singleton | 189 | 8 | f | 129122 MessageContext | chunk_alloc_zero_aligned| value.c | makeString | 55 | 16 | t | 129122MessageContext | chunk_alloc_zero_aligned | makefuncs.c | makeVar | 73 | 40 | t | 110676 ExecutorState | chunk_alloc | list.c | new_list | 68 | 16 | t | 92230 ExecutorState | chunk_alloc | list.c | new_list | 72 | 24 | t | 92230 MessageContext | chunk_alloc_zero_aligned | makefuncs.c | makeTargetEntry | 216 | 48 | t | 92230 MessageContext | chunk_alloc | bitmapset.c | bms_copy | 119 | 8 | f | 73784TopMemoryContext | chunk_alloc | lock.c | LockAcquireExtended | 559 | 128 | t | 55441 MessageContext | chunk_alloc | nodeFuncs.c | expression_tree_mutator | 2036 | 40 | t | 55338 TopTransactionContext | chunk_alloc | snapmgr.c | PushActiveSnapshot | 282 | 24 | t | 55338 MessageContext | chunk_alloc | stringinfo.c | initStringInfo | 50 | 1024 | t | 36894 TopMemoryContext | chunk_alloc | resowner.c | ResourceOwnerEnlargeCatCacheRefs | 612 | 128 | t | 36893TopMemoryContext | chunk_alloc | resowner.c | ResourceOwnerEnlargeRelationRefs | 754 | 128 | t | 36893 TopMemoryContext | chunk_alloc_zero | resowner.c | ResourceOwnerCreate | 135 | 168 | t | 36893ExecutorState | chunk_alloc | list.c | new_tail_cell | 112 | 16 | t | 36892 ExecutorState | chunk_alloc | nbtsearch.c | _bt_search | 121 | 24 | t | 36892 ExecutorState | chunk_alloc | stringinfo.c | initStringInfo | 50 | 1024 | t | 36892 ExecutorState | chunk_alloc_zero_aligned | execQual.c | ExecInitExpr | 4231 | 24 | t | 36892 ExecutorState | chunk_alloc_zero_aligned| execTuples.c | MakeTupleTableSlot | 118 | 96 | t | 36892MessageContext | chunk_alloc | nodeFuncs.c | expression_tree_mutator | 2027 | 40 | t | 36892 MessageContext | chunk_alloc_zero_aligned | copyfuncs.c | _copyVar | 1041 | 40 | t | 36892 MessageContext | chunk_alloc_zero_aligned | equivclass.c | add_eq_member | 452 | 32 | t | 36892 MessageContext | chunk_alloc_zero_aligned | gram.y | makeColumnRef | 12286 | 24 | t | 36892 TopTransactionContext | chunk_alloc | snapmgr.c | CopySnapshot | 217 | 72 | f | 36892 TopMemoryContext | chunk_alloc | resowner.c | ResourceOwnerEnlargeBuffers | 524 | 64 | t | 18448 ExecutorState | chunk_alloc | genam.c | RelationGetIndexScan | 77 | 104 | t | 18447 ExecutorState | chunk_alloc | genam.c | RelationGetIndexScan | 89 | 72 | f | 18447 ExecutorState | chunk_alloc | nbtree.c | btbeginscan | 385 | 6608 | t | 18447 ExecutorState | chunk_alloc | nbtree.c | btbeginscan | 388 | 72 | f | 18447 MessageContext | chunk_alloc | list.c | new_head_cell | 93 | 16 | t | 18447 TopTransactionContext | chunk_alloc | trigger.c | AfterTriggerBeginXact | 3607 | 112 | t | 18447 TopTransactionContext | chunk_alloc | trigger.c | AfterTriggerBeginXact | 3619 | 192 | t | 18447 1: For that allocation profile I found that one significant part of the costs is AllocSetFreeIndex (which is inlined in optimized mode, but still). If we would move more stuff to the headers those very common cases with compile time known sizes could eliminate the AllocSetFreeIndex computation at runtime as its only dependent on the size. The problem with that is obviously that it would violate the whole mctx.c abstraction as it has to be known at compile time which memory manager is used. Are we willing to reduce that abstraction? 2: aset.c fragments horribly for longliving contexts. I haven't generated sensibly accessible data for that as I currently just store it in an sql database and sql (at least my queries :P) isn't really that efficient for figuring out when how much was allocated from that: Table "public.allocations" Column | Type | Modifiers --------------+---------+----------------------------------------------------------id | integer | not null defaultnextval('allocations_id_seq'::regclass) action | text | parent | text | self | text | size | bigint | addr | bigint | file | text | func | text | line | integer | compile_time | boolean | schema when the data is sensibly large. A sensible benchmark easily produces three digit million rows... I guess some custom implementation for evaluating that data should be way much better at that. That fragmentation causes two main problems. 1 space wastage and 2. (more importantly imo) cache misses. 3: Given the frequentness of very small allocations the current space overhead of at least 16 bytes on 64bit Platforms seems quite high. I don't think its too hard to devise a scheme (actually I have started that) where the overhead is only sizeof(*void). But again that would reduce the the abstraction because it would make it harder to implement something like pfree/repalloc/GetMemoryChunkContext which need to figure out the context from a pointer. If we had a separate api for small allocations we could even devise a schema where a single chunk wouldn't have any memory overhead, but that seems to complicated for now. So after all this my question basically is: How important do we think the mctx.c abstraction is? I personally found the speed of AllocSetAlloc being the most limiting factor in several workloads so I am willing to sacrifice some abstraction to gain speed here. Especially as I hope its possible to write a single allocator which is "good enough" for everything. I currently started working on two pieces: * A new memory manager implementation. For that the equivalent ofAllocSetFreeIdx seems to be the biggest limit so far. Itsnot really readyfor anything yet though. * recording and replaying MemoryContext* to get a somewhat reasonableand reproducable micro benchmark. That unfortunatelydoesn't model theeffects of cache misses that well but I got no better idea so far. Any opinions, input, ideas? Thanks for reading, Andres
pgsql-hackers by date: