Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager |
Date | |
Msg-id | 20200212045437.ya2d3pgpwssn3xcc@alap3.anarazel.de Whole thread Raw |
In response to | Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
|
List | pgsql-hackers |
Hi, On 2020-02-11 08:01:34 +0530, Amit Kapila wrote: > I don't see much downside with the patch, rather there is a > performance increase of 3-9% in various scenarios. As I wrote in [1] I started to look at this patch. My problem with itis that it just seems like the wrong direction architecturally to me. There's two main aspects to this: 1) It basically builds a another, more lightweight but less capable, of a lock manager that can lock more objects than we can have distinct locks for. It is faster because it uses *one* hashtable without conflict handling, because it has fewer lock modes, and because it doesn't support detecting deadlocks. And probably some other things. 2) A lot of the contention around file extension comes from us doing multiple expensive things under one lock (determining current relation size, searching victim buffer, extending file), and in tiny increments (growing a 1TB table by 8kb). This patch doesn't address that at all. I've focused on 1) in the email referenced above ([1]). Here I'll focus on 2). To quantify my concerns I instrumented postgres to measure the time for various operations that are part of extending a file (all per process). The hardware is a pretty fast nvme, with unlogged tables, on a 20/40 core/threads machine. The workload is copying a scale 10 pgbench_accounts into an unindexed, unlogged table using pgbench. Here are the instrumentations for various client counts, when just measuring 20s: 1 client: LOG: extension time: lock wait: 0.00 lock held: 3.19 filesystem: 1.29 buffersearch: 1.58 2 clients: LOG: extension time: lock wait: 0.47 lock held: 2.99 filesystem: 1.24 buffersearch: 1.43 LOG: extension time: lock wait: 0.60 lock held: 3.05 filesystem: 1.23 buffersearch: 1.50 4 clients: LOG: extension time: lock wait: 3.92 lock held: 2.69 filesystem: 1.10 buffersearch: 1.29 LOG: extension time: lock wait: 4.40 lock held: 2.02 filesystem: 0.81 buffersearch: 0.93 LOG: extension time: lock wait: 3.86 lock held: 2.59 filesystem: 1.06 buffersearch: 1.22 LOG: extension time: lock wait: 4.00 lock held: 2.65 filesystem: 1.08 buffersearch: 1.26 8 clients: LOG: extension time: lock wait: 6.94 lock held: 1.74 filesystem: 0.70 buffersearch: 0.80 LOG: extension time: lock wait: 7.16 lock held: 1.81 filesystem: 0.73 buffersearch: 0.82 LOG: extension time: lock wait: 6.93 lock held: 1.95 filesystem: 0.80 buffersearch: 0.89 LOG: extension time: lock wait: 7.08 lock held: 1.87 filesystem: 0.76 buffersearch: 0.86 LOG: extension time: lock wait: 6.95 lock held: 1.95 filesystem: 0.80 buffersearch: 0.89 LOG: extension time: lock wait: 6.88 lock held: 2.01 filesystem: 0.83 buffersearch: 0.93 LOG: extension time: lock wait: 6.94 lock held: 2.02 filesystem: 0.82 buffersearch: 0.93 LOG: extension time: lock wait: 7.02 lock held: 1.95 filesystem: 0.80 buffersearch: 0.89 16 clients: LOG: extension time: lock wait: 10.37 lock held: 0.88 filesystem: 0.36 buffersearch: 0.39 LOG: extension time: lock wait: 10.53 lock held: 0.90 filesystem: 0.37 buffersearch: 0.40 LOG: extension time: lock wait: 10.72 lock held: 1.01 filesystem: 0.42 buffersearch: 0.45 LOG: extension time: lock wait: 10.45 lock held: 1.25 filesystem: 0.52 buffersearch: 0.55 LOG: extension time: lock wait: 10.66 lock held: 0.94 filesystem: 0.38 buffersearch: 0.41 LOG: extension time: lock wait: 10.50 lock held: 1.27 filesystem: 0.53 buffersearch: 0.56 LOG: extension time: lock wait: 10.53 lock held: 1.19 filesystem: 0.49 buffersearch: 0.53 LOG: extension time: lock wait: 10.57 lock held: 1.22 filesystem: 0.50 buffersearch: 0.53 LOG: extension time: lock wait: 10.72 lock held: 1.17 filesystem: 0.48 buffersearch: 0.52 LOG: extension time: lock wait: 10.67 lock held: 1.32 filesystem: 0.55 buffersearch: 0.58 LOG: extension time: lock wait: 10.95 lock held: 0.92 filesystem: 0.38 buffersearch: 0.40 LOG: extension time: lock wait: 10.81 lock held: 1.24 filesystem: 0.51 buffersearch: 0.56 LOG: extension time: lock wait: 10.62 lock held: 1.27 filesystem: 0.53 buffersearch: 0.56 LOG: extension time: lock wait: 11.14 lock held: 0.94 filesystem: 0.38 buffersearch: 0.41 LOG: extension time: lock wait: 11.20 lock held: 0.96 filesystem: 0.39 buffersearch: 0.42 LOG: extension time: lock wait: 10.75 lock held: 1.41 filesystem: 0.58 buffersearch: 0.63 0.88 + 0.90 + 1.01 + 1.25 + 0.94 + 1.27 + 1.19 + 1.22 + 1.17 + 1.32 + 0.92 + 1.24 + 1.27 + 0.94 + 0.96 + 1.41 in *none* of these cases the drive gets even close to being saturated. Like not even 1/3. If you consider the total time with the lock held, and the total time of the test, it becomes very quickly obvious that pretty quickly we spend the majority of the total time with the lock held. client count 1: 3.18/20 = 0.16 client count 2: 6.04/20 = 0.30 client count 4: 9.95/20 = 0.50 client count 8: 15.30/20 = 0.76 client count 16: 17.89/20 = 0.89 In other words, the reason that relation extension scales terribly isn't, to a significant degree, because the locking is slow. It's because we hold locks for the majority of the benchmark's time starting even at just 4 clients. Focusing on making the locking faster is just optimizing for the wrong thing. Amdahl's law will just restrict the benefits to a pretty small amount. Looking at a CPU time profile (i.e. it'll not include the time spent waiting for a lock, once sleeping in the kernel) for time spent within RelationGetBufferForTuple(): - 19.16% 0.29% postgres postgres [.] RelationGetBufferForTuple - 18.88% RelationGetBufferForTuple - 13.18% ReadBufferExtended - ReadBuffer_common + 5.02% mdextend + 4.77% FlushBuffer.part.0 + 0.61% BufTableLookup 0.52% __memset_avx2_erms + 1.65% PageInit - 1.18% LockRelationForExtension - 1.16% LockAcquireExtended - 1.07% WaitOnLock - 1.01% ProcSleep - 0.88% WaitLatchOrSocket 0.52% WaitEventSetWait 0.65% RecordAndGetPageWithFreeSpace the same workload using an assert enabled build, to get a simpler to interpret profile: - 13.28% 0.19% postgres postgres [.] RelationGetBufferForTuple - 13.09% RelationGetBufferForTuple - 8.35% RelationAddExtraBlocks - 7.67% ReadBufferBI - 7.54% ReadBufferExtended - 7.52% ReadBuffer_common - 3.64% BufferAlloc + 2.39% FlushBuffer + 0.27% BufTableLookup + 0.24% BufTableDelete + 0.15% LWLockAcquire 0.14% StrategyGetBuffer + 0.13% BufTableHashCode - 2.96% smgrextend + mdextend + 0.52% __memset_avx2_erms + 0.14% smgrnblocks 0.11% __GI___clock_gettime (inlined) + 0.57% RecordPageWithFreeSpace - 1.23% RecordAndGetPageWithFreeSpace - 1.03% fsm_set_and_search + 0.50% fsm_readbuf + 0.20% LockBuffer + 0.18% UnlockReleaseBuffer 0.11% fsm_set_avail 0.19% fsm_search - 0.86% ReadBufferBI - 0.72% ReadBufferExtended - ReadBuffer_common - 0.58% BufferAlloc + 0.20% BufTableLookup + 0.10% LWLockAcquire + 0.81% PageInit - 0.67% LockRelationForExtension - 0.67% LockAcquire - LockAcquireExtended + 0.60% WaitOnLock Which, I think, pretty clearly shows a few things: 1) It's crucial to move acquiring a victim buffer to the outside of the extension lock, as for copy acquiring the victim buffer will commonly cause a buffer having to be written out, due to the ringbuffer. This is even more crucial when using a logged table, as the writeout then also will often also trigger a WAL flush. While doing so will sometimes add a round of acquiring the buffer mapping locks, having to do the FlushBuffer while holding the extension lock is a huge problem. This'd also move a good bit of the cost of finding (i.e. clock sweep / ringbuffer replacement) and invalidating the old buffer mapping out of the lock. 2) We need to make the smgrwrite more efficient, it is costing a lot of time. A small additional experiment shows the cost of doing 8kb writes: I wrote a small program that just iteratively writes a 32GB file: pwrite using 8kb blocks: 0.24user 17.88system 0:18.16 elapsed 99%CPU pwrite using 128kb blocks: 0.00user 16.71system 0:17.01 elapsed 98%CPU pwrite using 256kb blocks: 0.00user 15.95system 0:16.03 elapsed 99%CPU pwritev() using 16 8kb blocks to write 128kb at once: 0.02user 15.94system 0:16.09 elapsed 99%CPU pwritev() using 32 8kb blocks to write 256kb at once: 0.01user 14.90system 0:14.93 elapsed 99%CPU pwritev() using 128 8kb blocks to write 1MB at once: 0.00user 13.96system 0:13.96 elapsed 99%CPU if I instead just use posix_fallocate() with 8kb blocks: 0.28user 23.49system 0:23.78elapsed 99%CPU (0avgtext+0avgdata 1212maxresident)k 0inputs+0outputs (0major+66minor)pagefaults 0swaps if I instead just use posix_fallocate() with 32 8kb blocks: 0.01user 1.18system 0:01.19elapsed 99%CPU (0avgtext+0avgdata 1200maxresident)k 0inputs+0outputs (0major+67minor)pagefaults 0swaps obviously fallocate doesn't quite have the same behaviour, and may incur a bit higher overhead for a later write. using a version that instead uses O_DIRECT + async IO, I get (but only when also doing posix_fallocate in larger chunks): 0.05user 5.53system 0:12.53 elapsed 44%CPU So we get considerably higher write throughput, at a considerably lower CPU usage (because DMA replaces the CPU doing a memcpy()). So it looks like extending the file with posix_fallocate() might be a winner, but only if we actually can do so in larger chunks than 8kb at once. Alternatively it could be worthwhile to rejigger things so we don't extend the files with zeroes once, just to then immediately overwrite it with actual content. For some users it's probably possible to pre-generate a page with contents when extending the file (would need fiddling with block numbers etc). 3) We should move the PageInit() that's currently done with the extension lock held, to the outside. Since we get the buffer with RBM_ZERO_AND_LOCK these days, that should be safe. Also, we don't need to zero the entire buffer both in RelationGetBufferForTuple()'s PageInit(), and in ReadBuffer_common() before calling smgrextend(). Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20200211042229.msv23badgqljrdg2%40alap3.anarazel.de
pgsql-hackers by date: