Home > mailing lists

Re: High SYS CPU - need advise - Mailing list pgsql-general

From	Vlad Marchenko
Subject	Re: High SYS CPU - need advise
Date	November 21, 2012 15:29:06
Msg-id	50ACF33D.20609@gmail.com Whole thread Raw
In response to	Re: High SYS CPU - need advise (Merlin Moncure <mmoncure@gmail.com>)
Responses	Re: High SYS CPU - need advise Re: High SYS CPU - need advise
List	pgsql-general

Tree view

update on my problem: despite pgbouncer, the problem still occures on my
end.

Also, interesting observation - I ran several tests with pgbench, using
queries that I think are prone to trigger high-sys-cpu-stall. What I
noticed is when pgbench is started with prepared mode, the system
behaves fine during stress-test (high user cpu - 85-90%, low sys cpu -
5-7%), high TPS. Though when I used non-prepared modes, the sys cpu
portion jumps to 40% (and tps drops dramatically as well, but this is
understandable).  The test queries are pretty long (10kb+), with couple
of outer joins across 1000-record tables with indexes.

Maybe, we are looking in a wrong place and the issue is somewhere within
planer/parser? Is there some extensive locking used in there?

Another observation - it's harder to trigger high-sys-cpu stall on a
freshly restarted postgresql. Though if it was running for a while, then
it's much easier to do.

-- vlad

On 11/19/12 8:33 AM, Merlin Moncure wrote:
> On Sun, Nov 18, 2012 at 4:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Fri, Nov 16, 2012 at 12:13 PM, Vlad <marchenko@gmail.com> wrote:
>>> ok, I've applied that patch and ran. The stall started around 13:50:45...50
>>> and lasted until the end
>>>
>>> https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log
>> That isn't as much log as I expected.  But I guess only the tip of the
>> iceberg gets logged, it could be that most spinlocks are contended but
>> then get acquired just before the sleep (and log) would have happened.
>>
>> I'm not sure how to figure out just how much spinning is going on
>> below the surface, but my gut feeling is there isn't enough of it to
>> explain the amount of slowdown you are seeing (and it would probably
>> be accounted for as user time rather than system time)
>
> yes. classic spinlock contention symptoms are high user cpu and almost
> work getting done.   something else is going on here?
>
>> Maybe we could whip up a dummy standalone  program that uses PG's
>> spinlock code in a intentionally contended  way and see how your
>> system reports resource usage for that.  Maybe your system reports
>> fighting for cache lines or whatever is going on behind the scenes as
>> system time rather than user time, or something.
> I don't think this is the case.  It's looking more and more like
> scheduler issues, although it's unclear if that is due to some kind of
> regression or just symptoms of backends piling up.  I'm also starting
> to wonder if the spinlock contention we are seeing (which according to
> your patch really isn't all that much really) is just symptomatic of
> scheduler issues.
>
>> It would be a little easier to figure out what was going on there if
>> the log included the pid (%p) so we can see if it is the same process
>> doing a bunch of consecutive sleeps, or a bunch of different
>> processes.  Also, logging the millisecond (%m rather than %t) would be
>> helpful.
>>
>> But the way that the point of contention jumps around a lot points to
>> a scheduler issue rather than a postgres-specific issue.  A backend
>> gets pre-empted while holding a spinlock, and then the kernel is
>> resistant to rescheduling it even though all other processes are piled
>> up behind it.  But again, that theory would lead to more log lines
>> than we see, I would think.
> yeah.  I asked for some longer strace logs but then everyone went home
> for the weekend.
>
>> Anyway, pgbouncer might help by ensuring that a connection gets
>> pre-empted by pgbouncer at a point where it has no locks, rather than
>> by the kernel at a point where it does hold locks.
> yes -- I think pgbouncer is the key here because it's going to tell us
> if the scheduler is the culprit here.  If you look carefully at the
> strace logs, you'll see that system call times become unpredictable as
> they are getting scheduled out...particularly yielding calls like
> select().  So we need to determine if this is normal high load
> behavior or if there is an underlying mechanic.
>
>> Any chance you could try an older kernel and see if that reduces the problem?
> This is one thing to try.  Personally I think pgbouncer/transaction
> mode is low hanging fruit if it 'drops in' to OP's app. If problem
> goes away this will pretty much point the finger at Linux AFAICT (in
> which case we can start messing around with the kernel).
>
> merlin

pgsql-general by date:

From: "Granile, Gonzalo A"
Date: 21 November 2012, 15:23:33
Subject: Postgres Security Audit/Checklist

From: Merlin Moncure
Date: 21 November 2012, 15:49:15
Subject: Re: High SYS CPU - need advise

Re: High SYS CPU - need advise - Mailing list pgsql-general

Previous

Next