Home > mailing lists
Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager
Date	February 12, 2020 04:54:37
Msg-id	20200212045437.ya2d3pgpwssn3xcc@alap3.anarazel.de Whole thread Raw
In response to	Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager (Amit Kapila <amit.kapila16@gmail.com>)
Responses	Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
List	pgsql-hackers
Tree view
Hi,

On 2020-02-11 08:01:34 +0530, Amit Kapila wrote:
> I don't see much downside with the patch, rather there is a
> performance increase of 3-9% in various scenarios.

As I wrote in [1] I started to look at this patch. My problem with itis
that it just seems like the wrong direction architecturally to
me. There's two main aspects to this:

1) It basically builds a another, more lightweight but less capable, of
   a lock manager that can lock more objects than we can have distinct
   locks for.  It is faster because it uses *one* hashtable without
   conflict handling, because it has fewer lock modes, and because it
   doesn't support detecting deadlocks. And probably some other things.

2) A lot of the contention around file extension comes from us doing
   multiple expensive things under one lock (determining current
   relation size, searching victim buffer, extending file), and in tiny
   increments (growing a 1TB table by 8kb). This patch doesn't address
   that at all.

I've focused on 1) in the email referenced above ([1]). Here I'll focus
on 2).

To quantify my concerns I instrumented postgres to measure the time for
various operations that are part of extending a file (all per
process). The hardware is a pretty fast nvme, with unlogged tables, on a
20/40 core/threads machine. The workload is copying a scale 10
pgbench_accounts into an unindexed, unlogged table using pgbench.

Here are the instrumentations for various client counts, when just
measuring 20s:

1 client:
LOG:  extension time: lock wait: 0.00 lock held: 3.19 filesystem: 1.29 buffersearch: 1.58

2 clients:
LOG:  extension time: lock wait: 0.47 lock held: 2.99 filesystem: 1.24 buffersearch: 1.43
LOG:  extension time: lock wait: 0.60 lock held: 3.05 filesystem: 1.23 buffersearch: 1.50

4 clients:
LOG:  extension time: lock wait: 3.92 lock held: 2.69 filesystem: 1.10 buffersearch: 1.29
LOG:  extension time: lock wait: 4.40 lock held: 2.02 filesystem: 0.81 buffersearch: 0.93
LOG:  extension time: lock wait: 3.86 lock held: 2.59 filesystem: 1.06 buffersearch: 1.22
LOG:  extension time: lock wait: 4.00 lock held: 2.65 filesystem: 1.08 buffersearch: 1.26

8 clients:
LOG:  extension time: lock wait: 6.94 lock held: 1.74 filesystem: 0.70 buffersearch: 0.80
LOG:  extension time: lock wait: 7.16 lock held: 1.81 filesystem: 0.73 buffersearch: 0.82
LOG:  extension time: lock wait: 6.93 lock held: 1.95 filesystem: 0.80 buffersearch: 0.89
LOG:  extension time: lock wait: 7.08 lock held: 1.87 filesystem: 0.76 buffersearch: 0.86
LOG:  extension time: lock wait: 6.95 lock held: 1.95 filesystem: 0.80 buffersearch: 0.89
LOG:  extension time: lock wait: 6.88 lock held: 2.01 filesystem: 0.83 buffersearch: 0.93
LOG:  extension time: lock wait: 6.94 lock held: 2.02 filesystem: 0.82 buffersearch: 0.93
LOG:  extension time: lock wait: 7.02 lock held: 1.95 filesystem: 0.80 buffersearch: 0.89

16 clients:
LOG:  extension time: lock wait: 10.37 lock held: 0.88 filesystem: 0.36 buffersearch: 0.39
LOG:  extension time: lock wait: 10.53 lock held: 0.90 filesystem: 0.37 buffersearch: 0.40
LOG:  extension time: lock wait: 10.72 lock held: 1.01 filesystem: 0.42 buffersearch: 0.45
LOG:  extension time: lock wait: 10.45 lock held: 1.25 filesystem: 0.52 buffersearch: 0.55
LOG:  extension time: lock wait: 10.66 lock held: 0.94 filesystem: 0.38 buffersearch: 0.41
LOG:  extension time: lock wait: 10.50 lock held: 1.27 filesystem: 0.53 buffersearch: 0.56
LOG:  extension time: lock wait: 10.53 lock held: 1.19 filesystem: 0.49 buffersearch: 0.53
LOG:  extension time: lock wait: 10.57 lock held: 1.22 filesystem: 0.50 buffersearch: 0.53
LOG:  extension time: lock wait: 10.72 lock held: 1.17 filesystem: 0.48 buffersearch: 0.52
LOG:  extension time: lock wait: 10.67 lock held: 1.32 filesystem: 0.55 buffersearch: 0.58
LOG:  extension time: lock wait: 10.95 lock held: 0.92 filesystem: 0.38 buffersearch: 0.40
LOG:  extension time: lock wait: 10.81 lock held: 1.24 filesystem: 0.51 buffersearch: 0.56
LOG:  extension time: lock wait: 10.62 lock held: 1.27 filesystem: 0.53 buffersearch: 0.56
LOG:  extension time: lock wait: 11.14 lock held: 0.94 filesystem: 0.38 buffersearch: 0.41
LOG:  extension time: lock wait: 11.20 lock held: 0.96 filesystem: 0.39 buffersearch: 0.42
LOG:  extension time: lock wait: 10.75 lock held: 1.41 filesystem: 0.58 buffersearch: 0.63
0.88 + 0.90 + 1.01 + 1.25 + 0.94 + 1.27 + 1.19 + 1.22 + 1.17 + 1.32 + 0.92 + 1.24 + 1.27 + 0.94 + 0.96 + 1.41
in *none* of these cases the drive gets even close to being
saturated. Like not even 1/3.


If you consider the total time with the lock held, and the total time of
the test, it becomes very quickly obvious that pretty quickly we spend
the majority of the total time with the lock held.
client count 1: 3.18/20 = 0.16
client count 2: 6.04/20 = 0.30
client count 4: 9.95/20 = 0.50
client count 8: 15.30/20 = 0.76
client count 16: 17.89/20 = 0.89

In other words, the reason that relation extension scales terribly
isn't, to a significant degree, because the locking is slow. It's
because we hold locks for the majority of the benchmark's time starting
even at just 4 clients.  Focusing on making the locking faster is just
optimizing for the wrong thing.  Amdahl's law will just restrict the
benefits to a pretty small amount.

Looking at a CPU time profile (i.e. it'll not include the time spent
waiting for a lock, once sleeping in the kernel) for time spent within
RelationGetBufferForTuple():

-   19.16%     0.29%  postgres  postgres            [.] RelationGetBufferForTuple
   - 18.88% RelationGetBufferForTuple
      - 13.18% ReadBufferExtended
         - ReadBuffer_common
            + 5.02% mdextend
            + 4.77% FlushBuffer.part.0
            + 0.61% BufTableLookup
              0.52% __memset_avx2_erms
      + 1.65% PageInit
      - 1.18% LockRelationForExtension
         - 1.16% LockAcquireExtended
            - 1.07% WaitOnLock
               - 1.01% ProcSleep
                  - 0.88% WaitLatchOrSocket
                       0.52% WaitEventSetWait
        0.65% RecordAndGetPageWithFreeSpace

the same workload using an assert enabled build, to get a simpler to
interpret profile:
-   13.28%     0.19%  postgres  postgres            [.] RelationGetBufferForTuple
   - 13.09% RelationGetBufferForTuple
      - 8.35% RelationAddExtraBlocks
         - 7.67% ReadBufferBI
            - 7.54% ReadBufferExtended
               - 7.52% ReadBuffer_common
                  - 3.64% BufferAlloc
                     + 2.39% FlushBuffer
                     + 0.27% BufTableLookup
                     + 0.24% BufTableDelete
                     + 0.15% LWLockAcquire
                       0.14% StrategyGetBuffer
                     + 0.13% BufTableHashCode
                  - 2.96% smgrextend
                     + mdextend
                  + 0.52% __memset_avx2_erms
                  + 0.14% smgrnblocks
                    0.11% __GI___clock_gettime (inlined)
         + 0.57% RecordPageWithFreeSpace
      - 1.23% RecordAndGetPageWithFreeSpace
         - 1.03% fsm_set_and_search
            + 0.50% fsm_readbuf
            + 0.20% LockBuffer
            + 0.18% UnlockReleaseBuffer
              0.11% fsm_set_avail
           0.19% fsm_search
      - 0.86% ReadBufferBI
         - 0.72% ReadBufferExtended
            - ReadBuffer_common
               - 0.58% BufferAlloc
                  + 0.20% BufTableLookup
                  + 0.10% LWLockAcquire
      + 0.81% PageInit
      - 0.67% LockRelationForExtension
         - 0.67% LockAcquire
            - LockAcquireExtended
               + 0.60% WaitOnLock


Which, I think, pretty clearly shows a few things:

1) It's crucial to move acquiring a victim buffer to the outside of the
   extension lock, as for copy acquiring the victim buffer will commonly
   cause a buffer having to be written out, due to the ringbuffer. This
   is even more crucial when using a logged table, as the writeout then
   also will often also trigger a WAL flush.

   While doing so will sometimes add a round of acquiring the buffer
   mapping locks, having to do the FlushBuffer while holding the
   extension lock is a huge problem.

   This'd also move a good bit of the cost of finding (i.e. clock sweep
   / ringbuffer replacement) and invalidating the old buffer mapping out
   of the lock.

2) We need to make the smgrwrite more efficient, it is costing a lot of
   time. A small additional experiment shows the cost of doing 8kb
   writes:

   I wrote a small program that just iteratively writes a 32GB file:

   pwrite using 8kb blocks:
   0.24user 17.88system 0:18.16 elapsed 99%CPU

   pwrite using 128kb blocks:
   0.00user 16.71system 0:17.01 elapsed 98%CPU

   pwrite using 256kb blocks:
   0.00user 15.95system 0:16.03 elapsed 99%CPU

   pwritev() using 16 8kb blocks to write 128kb at once:
   0.02user 15.94system 0:16.09 elapsed 99%CPU

   pwritev() using 32 8kb blocks to write 256kb at once:
   0.01user 14.90system 0:14.93 elapsed 99%CPU

   pwritev() using 128 8kb blocks to write 1MB at once:
   0.00user 13.96system 0:13.96 elapsed 99%CPU


   if I instead just use posix_fallocate() with 8kb blocks:
   0.28user 23.49system 0:23.78elapsed 99%CPU (0avgtext+0avgdata 1212maxresident)k
   0inputs+0outputs (0major+66minor)pagefaults 0swaps

   if I instead just use posix_fallocate() with 32 8kb blocks:
   0.01user 1.18system 0:01.19elapsed 99%CPU (0avgtext+0avgdata 1200maxresident)k
   0inputs+0outputs (0major+67minor)pagefaults 0swaps

   obviously fallocate doesn't quite have the same behaviour, and may incur
   a bit higher overhead for a later write.


   using a version that instead uses O_DIRECT + async IO, I get (but
   only when also doing posix_fallocate in larger chunks):
   0.05user 5.53system 0:12.53 elapsed 44%CPU

   So we get considerably higher write throughput, at a considerably
   lower CPU usage (because DMA replaces the CPU doing a memcpy()).


   So it looks like extending the file with posix_fallocate() might be a
   winner, but only if we actually can do so in larger chunks than 8kb
   at once.


   Alternatively it could be worthwhile to rejigger things so we don't
   extend the files with zeroes once, just to then immediately overwrite
   it with actual content. For some users it's probably possible to
   pre-generate a page with contents when extending the file (would need
   fiddling with block numbers etc).


3) We should move the PageInit() that's currently done with the
   extension lock held, to the outside. Since we get the buffer with
   RBM_ZERO_AND_LOCK these days, that should be safe.  Also, we don't
   need to zero the entire buffer both in RelationGetBufferForTuple()'s
   PageInit(), and in ReadBuffer_common() before calling smgrextend().


Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20200211042229.msv23badgqljrdg2%40alap3.anarazel.de
pgsql-hackers by date:
From: "imai.yoshikazu@fujitsu.com"
Date: 12 February 2020, 04:35:57
Subject: RE: [Proposal] Add accumulated statistics for wait event
From: Masahiko Sawada
Date: 12 February 2020, 05:10:17
Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager - Mailing list pgsql-hackers

Previous

Next