Re: High SYS CPU - need advise - Mailing list pgsql-general
From | Vlad Marchenko |
---|---|
Subject | Re: High SYS CPU - need advise |
Date | |
Msg-id | 50ACF33D.20609@gmail.com Whole thread Raw |
In response to | Re: High SYS CPU - need advise (Merlin Moncure <mmoncure@gmail.com>) |
Responses |
Re: High SYS CPU - need advise
Re: High SYS CPU - need advise |
List | pgsql-general |
update on my problem: despite pgbouncer, the problem still occures on my end. Also, interesting observation - I ran several tests with pgbench, using queries that I think are prone to trigger high-sys-cpu-stall. What I noticed is when pgbench is started with prepared mode, the system behaves fine during stress-test (high user cpu - 85-90%, low sys cpu - 5-7%), high TPS. Though when I used non-prepared modes, the sys cpu portion jumps to 40% (and tps drops dramatically as well, but this is understandable). The test queries are pretty long (10kb+), with couple of outer joins across 1000-record tables with indexes. Maybe, we are looking in a wrong place and the issue is somewhere within planer/parser? Is there some extensive locking used in there? Another observation - it's harder to trigger high-sys-cpu stall on a freshly restarted postgresql. Though if it was running for a while, then it's much easier to do. -- vlad On 11/19/12 8:33 AM, Merlin Moncure wrote: > On Sun, Nov 18, 2012 at 4:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> On Fri, Nov 16, 2012 at 12:13 PM, Vlad <marchenko@gmail.com> wrote: >>> ok, I've applied that patch and ran. The stall started around 13:50:45...50 >>> and lasted until the end >>> >>> https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log >> That isn't as much log as I expected. But I guess only the tip of the >> iceberg gets logged, it could be that most spinlocks are contended but >> then get acquired just before the sleep (and log) would have happened. >> >> I'm not sure how to figure out just how much spinning is going on >> below the surface, but my gut feeling is there isn't enough of it to >> explain the amount of slowdown you are seeing (and it would probably >> be accounted for as user time rather than system time) > > yes. classic spinlock contention symptoms are high user cpu and almost > work getting done. something else is going on here? > >> Maybe we could whip up a dummy standalone program that uses PG's >> spinlock code in a intentionally contended way and see how your >> system reports resource usage for that. Maybe your system reports >> fighting for cache lines or whatever is going on behind the scenes as >> system time rather than user time, or something. > I don't think this is the case. It's looking more and more like > scheduler issues, although it's unclear if that is due to some kind of > regression or just symptoms of backends piling up. I'm also starting > to wonder if the spinlock contention we are seeing (which according to > your patch really isn't all that much really) is just symptomatic of > scheduler issues. > >> It would be a little easier to figure out what was going on there if >> the log included the pid (%p) so we can see if it is the same process >> doing a bunch of consecutive sleeps, or a bunch of different >> processes. Also, logging the millisecond (%m rather than %t) would be >> helpful. >> >> But the way that the point of contention jumps around a lot points to >> a scheduler issue rather than a postgres-specific issue. A backend >> gets pre-empted while holding a spinlock, and then the kernel is >> resistant to rescheduling it even though all other processes are piled >> up behind it. But again, that theory would lead to more log lines >> than we see, I would think. > yeah. I asked for some longer strace logs but then everyone went home > for the weekend. > >> Anyway, pgbouncer might help by ensuring that a connection gets >> pre-empted by pgbouncer at a point where it has no locks, rather than >> by the kernel at a point where it does hold locks. > yes -- I think pgbouncer is the key here because it's going to tell us > if the scheduler is the culprit here. If you look carefully at the > strace logs, you'll see that system call times become unpredictable as > they are getting scheduled out...particularly yielding calls like > select(). So we need to determine if this is normal high load > behavior or if there is an underlying mechanic. > >> Any chance you could try an older kernel and see if that reduces the problem? > This is one thing to try. Personally I think pgbouncer/transaction > mode is low hanging fruit if it 'drops in' to OP's app. If problem > goes away this will pretty much point the finger at Linux AFAICT (in > which case we can start messing around with the kernel). > > merlin
pgsql-general by date: