Re: Two Necessary Kernel Tweaks for Linux Systems - Mailing list pgsql-performance
From | AJ Weber |
---|---|
Subject | Re: Two Necessary Kernel Tweaks for Linux Systems |
Date | |
Msg-id | 50EC7C24.2020203@comcast.net Whole thread Raw |
In response to | Re: Two Necessary Kernel Tweaks for Linux Systems (Shaun Thomas <sthomas@optionshouse.com>) |
Responses |
Re: Two Necessary Kernel Tweaks for Linux Systems
|
List | pgsql-performance |
When I checked these, both of these settings exist on my CentOS 6.x host (2.6.32-279.5.1.el6.x86_64). However, the autogroup_enabled was already set to 0. (The migration_cost was set to the 0.5ms, default noted in the OP.) So I don't know if this is strictly limited to kernel 3.0. Is there an "easy" way to tell what scheduler my OS is using? -AJ On 1/8/2013 2:32 PM, Shaun Thomas wrote: > On 01/08/2013 01:04 PM, Scott Marlowe wrote: > >> Assembly language on the brain. of course I meant NOOP. > > Ok, in that case, these are completely separate things. For IO > scheduling, there's the Completely Fair Queue (CFQ), NOOP, Deadline, > and so on. > > For process scheduling, at least recently, there's Completely Fair > Scheduler or nothing. So far as I can tell, there is no alternative > process scheduler. Just as I can't find an alternative memory manager > that I can tell to stop flushing my freaking active file cache due to > phantom memory pressure. ;) > > The tweaks I was discussing in this thread effectively do two things: > > 1. Stop process grouping by TTY. > > On servers, this really is a net performance loss. Especially on > heavily forked apps like PG. System % is about 5% lower since the > scheduler is doing less work, but at the cost of less spreading across > available CPUs. Our systems see a 30% performance hit with grouping > enabled, others may see more or less. > > 2. Less aggressive process scheduling. > > The O(log N) scheduler heuristics collapse at high process counts for > some reason, causing the scheduler to spend more and more time > planning CPU assignments until it spirals completely out of control. > I've seen this behavior on 3.0 kernels straight to 3.5, so it looks > like an inherent weakness of CFS. By increasing migration cost, we > make the scheduler do less work less often, so that weird 70+% system > CPU spike vanishes. > > My guess is the increased migration cost basically offsets the point > at which the scheduler would freak out. I've tested up to 2000 > connections, and it responds fine, whereas before we were seeing flaky > results as early as 700 connections. > > My guess as to why this is? I think it's due to VSZ as perceived by > the scheduler. To swap processes, it also has to preload L2 and L3 > cache for the assigned process. As the number of PG connections > increase, all with their own VSZ/RSS allocations, the scheduler has > more thinking to do. At a point when the sum of VSZ/RSS eclipses the > amount of available RAM, the scheduler loses nearly all > decision-making ability and craps its pants. > > This would also explain why I'm seeing something similar with memory. > At high connection counts, even though %used is fine, and we have over > 40GB free for caching. VSZ/RSS are both way bigger than available > cache, so memory pressure causes kswapd to continuously purge the > active cache pool into inactive, and inactive into free, all while the > device attempts to fill the active pool. It's an IO feedback loop, and > around the same number of connections that used to make the process > scheduler die. Too much of a coincidence, in my opinion. > > But unlike the process scheduler, there are no good knobs to turn that > will fix the memory manager's behavior. At least, not in 3.0, 3.2, or > 3.4 kernels. > > But I freely admit I'm just speculating based on observed behavior. I > know neither jack, nor squat about internal kernel mechanics. Anyone > who actually *isn't* talking out of his ass is free to interject. :) >
pgsql-performance by date: