Bulk loading performance improvements - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Bulk loading performance improvements |
Date | |
Msg-id | 1204037198.4252.305.camel@ebony.site Whole thread Raw |
Responses |
Re: Bulk loading performance improvements
|
List | pgsql-hackers |
Looking at the profile for COPY and then a profile for CREATE TABLE AS SELECT (CTAS) there is clearly too much overhead caused by inserting the rows one at a time. Flat profile of CTAS: (2 cols of this output removed for clarity) Each sample counts as 0.01 seconds. % cumulative self time seconds seconds calls 5.63 0.52 0.52 21302604 LWLockRelease 5.63 1.04 0.52 21193900 LWLockAcquire 4.77 1.48 0.44 10386370 hash_any 4.77 1.92 0.44 10139843 ReadBuffer_common 4.66 2.35 0.43 10000842 PinBuffer 4.55 2.77 0.42 10000001 ExecProject 4.44 3.18 0.41 10000014 heap_insert 4.23 3.57 0.39 10143770 UnpinBuffer 3.79 3.92 0.35 10000057 MarkBufferDirty 3.36 4.23 0.31 10000001 ExecMakeFunctionResult 3.25 4.53 0.30 10000014 RelationGetBufferForTuple 2.71 4.78 0.25 10000028 heap_fill_tuple 2.60 5.02 0.24 10528273 hash_search_with_hash_value 2.38 5.24 0.22 10000042 PageAddItem 2.38 5.46 0.22 10000000 heap_form_tuple 1.73 5.82 0.16 10143788 ResourceOwnerForgetBuffer 1.73 5.98 0.16 10000014 pgstat_count_heap_insert 1.63 6.13 0.1520280380 LockBuffer 1.63 6.28 0.15 10139682 ReleaseBuffer 1.63 6.43 0.15 10000001 ExecProcNode1.52 6.57 0.14 10004864 AllocSetAlloc 1.52 6.71 0.14 10003013 AllocSetFree 1.52 6.85 0.14 10000003 ExecEvalConst 1.52 6.99 0.14 10000000 intorel_receive 1.41 7.12 0.13 20139096 BufferGetBlockNumber 1.41 7.25 0.13 10000029 CacheInvalidateH The above profile is probably fairly worst-case and differs between CTAS and COPY. It also reduces as row length increases. Thin tables are an important use case since many Associative tables (Many-Many) are larger than the tables that they link together. Those tables are often just two integer columns. So the above results are relevant to many large database tables. The system is beginning to be I/O bound during these operations, but CPU savings are still helpful. If we loaded data 1 block at a time, it seems like we make the following calls 1 per block rather than 1 per row. That would reduce these calls to perhaps 1% of their costs, depending upon row length. 5.63 0.52 0.52 21302604 LWLockRelease 5.63 1.04 0.52 21193900 LWLockAcquire 4.77 1.48 0.4410386370 hash_any 4.77 1.92 0.44 10139843 ReadBuffer_common 4.66 2.35 0.43 10000842 PinBuffer4.44 3.18 0.41 10000014 heap_insert 4.23 3.57 0.39 10143770 UnpinBuffer 3.79 3.92 0.35 10000057 MarkBufferDirty 3.25 4.53 0.30 10000014 RelationGetBufferForTuple 1.73 5.82 0.1610143788 ResourceOwnerForgetBuffer 1.73 5.98 0.16 10000014 pgstat_count_heap_insert 1.63 6.13 0.15 20280380 LockBuffer 1.63 6.28 0.15 10139682 ReleaseBuffer 1.41 7.12 0.13 20139096 BufferGetBlockNumber1.41 7.25 0.13 10000029 CacheInvalidateHeapTuple which together account for more than 50% of CPU. So that means we would keep the buffer pinned and locked while we do everything else during loading. That is difficult to achieve if we have indexes on the table and also of questionable value in concurrent situations. Discussing with Heikki we could just kept the buffer pinned, but lock and unlock it for each insert then we would save on most of the above. The following possible savings would still exist... 5.63 0.52 0.52 21302604 LWLockRelease 5.63 1.04 0.52 21193900 LWLockAcquire 3.79 3.92 0.3510000057 MarkBufferDirty 1.73 5.98 0.16 10000014 pgstat_count_heap_insert 1.63 6.13 0.15 20280380 LockBuffer 1.63 6.28 0.15 10139682 ReleaseBuffer which account for about 18% of CPU. So the proposals for improving COPY are: 1) When running a COPY operation into a table with indexes and/or triggers defined we will do this for each block * identify the block to insert into * pin it and keep it pinned * perform all inserts until full * unpin it I'm posting a patch now to pgsql-patches that does this: "Bulk Insert tuning". I measure gains of around 20% for CTAS and for 15-17% for COPY, though the variability of results is fairly high. 2) When running CREATE TABLE AS SELECT or COPY into a table with zero indexes and zero triggers (yet) we will *also* do these actions for each block * load data into a private buffer, then when full * identify block to insert into * lock buffer, memcpy() block into place, set tuple pointers, unlock * mark buffer dirty once at end * copy whole buffer to WAL That will remove all of the block lock/unlock overhead, reduce WAL locking and reduce overall WAL volume from large loads. Objective: reduce CPU overhead for initial loads by around 25-40%, or 15-20% more benefit than achieved by (1) above. 3) Maintain a full block list within the buffer manager. Like the opposite of a freelist. Any bulk operation that finishes writing a block adds it to the fulllist. bgwriter cleans blocks on the fulllist first before it starts normal clock sweep, up to a limit of half of the blocks it is allowed to clean. This ensures that backends which are the source of many dirty blocks are targeted more accurately by the bgwriter (but not exclusively). Potentially improve chances of OS/disk controller being able to combine writes and perform sequential write I/O. Objective: improve hit rate of bgwriter during bulk insert ops 4) Allow bulk insert operations to utilise a BufferAccessStrategy so that they reuse the same buffers in the bufferpool and avoid spoiling the cache for everybody else. This on its own will slow down bulk operations since they will be forced to write their own dirty blocks. Together with (3) this should actually speed up bulk ops. Objective: avoid cache spoiling and attempt to improve L2 cache usage These would be done as separate patches, in that order. I'm assuming that we'll be able to tune index maintenance as well, though I'll leave that to Heikki and/or Itagaki. Comments? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
pgsql-hackers by date: