Re: measuring lwlock-related latency spikes - Mailing list pgsql-hackers
| From | Robert Haas | 
|---|---|
| Subject | Re: measuring lwlock-related latency spikes | 
| Date | |
| Msg-id | CA+TgmoY073r4-Hp7C3ULat2oLPFZ=udueC+EPj4T+xS_ZWwiTA@mail.gmail.com Whole thread Raw | 
| In response to | Re: measuring lwlock-related latency spikes (Robert Haas <robertmhaas@gmail.com>) | 
| Responses | Re: measuring lwlock-related latency spikes Re: measuring lwlock-related latency spikes Re: measuring lwlock-related latency spikes | 
| List | pgsql-hackers | 
On Sat, Mar 31, 2012 at 9:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I've discovered a bug in my code that was causing it to print at most
> 2 histogram buckets per lwlock, which obviously means that my previous
> results were totally inaccurate.  Ah, the joys of benchmarking.  I
> found the problem when I added code to log a message any time an
> lwlock wait exceeded a certain time threshold, and it fired far more
> often than the previous results would have indicated.  In particular,
> it turns out that long waits for WALInsertLock are extremely common
> and not, as the previous results appeared to indicated, unheard-of.
> I'm rerunning my tests now and will post the updated,
> hopefully-accurate results when that's done.
>
> /me attempts to remove egg from face.
All right, so with the aforementioned bug fixed (see attached, revised
patch), there are now massive latency spikes popping out all over the
place: on my latest run, there were 377 distinct lwlocks that took >=
1s to acquire on at least one occasion during this 30-minute run.
Some of those locks, of course, had more than one problem event.  In
total, somebody waited >= 1 s for a lock 4897 times during this test.
These break down as follows.  Note that the "blocked by" is the person
who had most recently acquired the lock as of the start of the wait,
and is not necessarily solely responsible for the full duration of the
wait due to shared locks and queue jumping.
      1 waited at heapam.c:1651 blocked by bufmgr.c:2475
      1 waited at heapam.c:2844 blocked by heapam.c:2758
      1 waited at hio.c:335 blocked by heapam.c:1651
      1 waited at hio.c:336 blocked by hio.c:336
      1 waited at indexam.c:521 blocked by hio.c:345
      1 waited at xlog.c:2090 blocked by xlog.c:2090
      2 waited at bufmgr.c:1671 blocked by bufmgr.c:2475
      2 waited at indexam.c:521 blocked by heapam.c:3464
      2 waited at nbtpage.c:650 blocked by nbtinsert.c:124
      2 waited at xlog.c:1502 blocked by xlog.c:2090
      2 waited at xlog.c:2241 blocked by xlog.c:1502
      3 waited at slru.c:310 blocked by slru.c:404
      4 waited at indexam.c:521 blocked by hio.c:335
      4 waited at indexam.c:521 blocked by hio.c:336
      4 waited at xlog.c:2241 blocked by xlog.c:2090
      6 waited at hio.c:336 blocked by heapam.c:2758
     12 waited at indexam.c:521 blocked by bufmgr.c:2475
     20 waited at xlog.c:2090 blocked by xlog.c:2241
     26 waited at heapam.c:2758 blocked by indexam.c:521
     29 waited at heapam.c:2758 blocked by heapam.c:2758
     80 waited at xlog.c:1502 blocked by xlog.c:2241
     89 waited at indexam.c:521 blocked by heapam.c:2758
    115 waited at varsup.c:65 blocked by varsup.c:65
   1540 waited at slru.c:310 blocked by slru.c:526
   2948 waited at xlog.c:909 blocked by xlog.c:909
xlog.c:909 is the LWLockAcquire of WALInsertLock from within
XLogInsert.  slru.c:310 is in SimpleLruWaitIO(), where we attempt to
grab the SLRU buffer lock in shared mode after releasing the control
lock.  slru.c:526 is in SlruInternalWritePage(), where we hold the
buffer lock while writing the page.  This is commit
194b5ea3d0722f94e8a6ba9cec03b858cc8c9370, if you want to look up an of
the other line numbers.
If I filter for waits greater than 8s, a somewhat different picture emerges:
      2 waited at indexam.c:521 blocked by bufmgr.c:2475
    212 waited at slru.c:310 blocked by slru.c:526
In other words, some of the waits for SLRU pages to be written are...
really long.  There were 126 that exceeded 10 seconds and 56 that
exceeded 12 seconds.  "Painful" is putting it mildly.
I suppose one interesting question is to figure out if there's a way I
can optimize the disk configuration in this machine, or the Linux I/O
scheduler, or something, so as to reduce the amount of time it spends
waiting for the disk.  But the other thing is why we're waiting for
SLRU page writes to begin with.  My guess based on previous testing is
that what's happening here is (1) we examine a tuple on an old page
and decide we must look up its XID, (2) the relevant CLOG page isn't
in cache so we decide to read it, but (3) the page we decide to evict
happens to be dirty, so we have to write it first.  That sure seems
like something that a smart background writer ought to be able to fix
for us.  Simon previously posted a patch for that:
http://archives.postgresql.org/pgsql-hackers/2012-01/msg00571.php
...but the testing I did at the time didn't seem to show a real clear benefit:
http://wiki.postgresql.org/wiki/Robert_Haas_9.2CF4_Benchmark_Results
The obvious question here is: was that a problem with the patch, or a
problem with my testing methodology, or is it just that the
performance characteristics of the machine I used for that test (Nate
Boley's 32-core AMD box) were different from this one (IBM POWER7)?  I
don't know, and I think I'm out of time to play with this for this
weekend, but I'll investigate further when time permits.  Other
thoughts welcome.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
		
	Attachment
pgsql-hackers by date: