Re: measuring lwlock-related latency spikes - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: measuring lwlock-related latency spikes |
Date | |
Msg-id | CA+TgmoY073r4-Hp7C3ULat2oLPFZ=udueC+EPj4T+xS_ZWwiTA@mail.gmail.com Whole thread Raw |
In response to | Re: measuring lwlock-related latency spikes (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: measuring lwlock-related latency spikes
Re: measuring lwlock-related latency spikes Re: measuring lwlock-related latency spikes |
List | pgsql-hackers |
On Sat, Mar 31, 2012 at 9:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I've discovered a bug in my code that was causing it to print at most > 2 histogram buckets per lwlock, which obviously means that my previous > results were totally inaccurate. Ah, the joys of benchmarking. I > found the problem when I added code to log a message any time an > lwlock wait exceeded a certain time threshold, and it fired far more > often than the previous results would have indicated. In particular, > it turns out that long waits for WALInsertLock are extremely common > and not, as the previous results appeared to indicated, unheard-of. > I'm rerunning my tests now and will post the updated, > hopefully-accurate results when that's done. > > /me attempts to remove egg from face. All right, so with the aforementioned bug fixed (see attached, revised patch), there are now massive latency spikes popping out all over the place: on my latest run, there were 377 distinct lwlocks that took >= 1s to acquire on at least one occasion during this 30-minute run. Some of those locks, of course, had more than one problem event. In total, somebody waited >= 1 s for a lock 4897 times during this test. These break down as follows. Note that the "blocked by" is the person who had most recently acquired the lock as of the start of the wait, and is not necessarily solely responsible for the full duration of the wait due to shared locks and queue jumping. 1 waited at heapam.c:1651 blocked by bufmgr.c:2475 1 waited at heapam.c:2844 blocked by heapam.c:2758 1 waited at hio.c:335 blocked by heapam.c:1651 1 waited at hio.c:336 blocked by hio.c:336 1 waited at indexam.c:521 blocked by hio.c:345 1 waited at xlog.c:2090 blocked by xlog.c:2090 2 waited at bufmgr.c:1671 blocked by bufmgr.c:2475 2 waited at indexam.c:521 blocked by heapam.c:3464 2 waited at nbtpage.c:650 blocked by nbtinsert.c:124 2 waited at xlog.c:1502 blocked by xlog.c:2090 2 waited at xlog.c:2241 blocked by xlog.c:1502 3 waited at slru.c:310 blocked by slru.c:404 4 waited at indexam.c:521 blocked by hio.c:335 4 waited at indexam.c:521 blocked by hio.c:336 4 waited at xlog.c:2241 blocked by xlog.c:2090 6 waited at hio.c:336 blocked by heapam.c:2758 12 waited at indexam.c:521 blocked by bufmgr.c:2475 20 waited at xlog.c:2090 blocked by xlog.c:2241 26 waited at heapam.c:2758 blocked by indexam.c:521 29 waited at heapam.c:2758 blocked by heapam.c:2758 80 waited at xlog.c:1502 blocked by xlog.c:2241 89 waited at indexam.c:521 blocked by heapam.c:2758 115 waited at varsup.c:65 blocked by varsup.c:65 1540 waited at slru.c:310 blocked by slru.c:526 2948 waited at xlog.c:909 blocked by xlog.c:909 xlog.c:909 is the LWLockAcquire of WALInsertLock from within XLogInsert. slru.c:310 is in SimpleLruWaitIO(), where we attempt to grab the SLRU buffer lock in shared mode after releasing the control lock. slru.c:526 is in SlruInternalWritePage(), where we hold the buffer lock while writing the page. This is commit 194b5ea3d0722f94e8a6ba9cec03b858cc8c9370, if you want to look up an of the other line numbers. If I filter for waits greater than 8s, a somewhat different picture emerges: 2 waited at indexam.c:521 blocked by bufmgr.c:2475 212 waited at slru.c:310 blocked by slru.c:526 In other words, some of the waits for SLRU pages to be written are... really long. There were 126 that exceeded 10 seconds and 56 that exceeded 12 seconds. "Painful" is putting it mildly. I suppose one interesting question is to figure out if there's a way I can optimize the disk configuration in this machine, or the Linux I/O scheduler, or something, so as to reduce the amount of time it spends waiting for the disk. But the other thing is why we're waiting for SLRU page writes to begin with. My guess based on previous testing is that what's happening here is (1) we examine a tuple on an old page and decide we must look up its XID, (2) the relevant CLOG page isn't in cache so we decide to read it, but (3) the page we decide to evict happens to be dirty, so we have to write it first. That sure seems like something that a smart background writer ought to be able to fix for us. Simon previously posted a patch for that: http://archives.postgresql.org/pgsql-hackers/2012-01/msg00571.php ...but the testing I did at the time didn't seem to show a real clear benefit: http://wiki.postgresql.org/wiki/Robert_Haas_9.2CF4_Benchmark_Results The obvious question here is: was that a problem with the patch, or a problem with my testing methodology, or is it just that the performance characteristics of the machine I used for that test (Nate Boley's 32-core AMD box) were different from this one (IBM POWER7)? I don't know, and I think I'm out of time to play with this for this weekend, but I'll investigate further when time permits. Other thoughts welcome. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: