Re: CPU spikes and transactions - Mailing list pgsql-performance

From Merlin Moncure
Subject Re: CPU spikes and transactions
Date
Msg-id CAHyXU0yXYpCXN4=81ZDRQu-oGzrcq2qNAXDpyz4oiQPPAGk4ew@mail.gmail.com
Whole thread Raw
In response to Re: CPU spikes and transactions  (Dave Owens <dave@teamunify.com>)
List pgsql-performance
On Tue, May 13, 2014 at 6:04 PM, Dave Owens <dave@teamunify.com> wrote:
Hi,

Apologies for resurrecting this old thread, but it seems like this is better than starting a new conversation.

We are now running 9.1.13 and have doubled the CPU and memory.  So 2x 16 Opteron 6276 (32 cores total), and 64GB memory.  shared_buffers set to 20G, effective_cache_size set to 40GB.

We were able to record perf data during the latest incident of high CPU utilization. perf report is below:

Samples: 31M of event 'cycles', Event count (approx.): 16289978380877 
 44.74%       postmaster  [kernel.kallsyms]             [k] _spin_lock_irqsave                                     
 15.03%       postmaster  postgres                      [.] 0x00000000002ea937                                     
  3.14%       postmaster  postgres                      [.] s_lock                                                 
  2.30%       postmaster  [kernel.kallsyms]             [k] compaction_alloc                                       
  2.21%       postmaster  postgres                      [.] HeapTupleSatisfiesMVCC                                 
  1.75%       postmaster  postgres                      [.] hash_search_with_hash_value                            
  1.25%       postmaster  postgres                      [.] ExecScanHashBucket                                     
  1.20%       postmaster  postgres                      [.] SHMQueueNext                                           
  1.05%       postmaster  postgres                      [.] slot_getattr                                           
  1.04%             init  [kernel.kallsyms]             [k] native_safe_halt                                       
  0.73%       postmaster  postgres                      [.] LWLockAcquire                                          
  0.59%       postmaster  [kernel.kallsyms]             [k] page_fault                                             
  0.52%       postmaster  postgres                      [.] ExecQual                                               
  0.40%       postmaster  postgres                      [.] ExecStoreTuple                                         
  0.38%       postmaster  postgres                      [.] ExecScan                                               
  0.37%       postmaster  postgres                      [.] check_stack_depth                                      
  0.35%       postmaster  postgres                      [.] SearchCatCache                                         
  0.35%       postmaster  postgres                      [.] CheckForSerializableConflictOut                        
  0.34%       postmaster  postgres                      [.] LWLockRelease                                          
  0.30%       postmaster  postgres                      [.] _bt_checkkeys                                          
  0.28%       postmaster  libc-2.12.so                  [.] memcpy                                                 
  0.27%       postmaster  [kernel.kallsyms]             [k] get_pageblock_flags_group                              
  0.27%       postmaster  postgres                      [.] int4eq                                                 
  0.27%       postmaster  postgres                      [.] heap_page_prune_opt                                    
  0.27%       postmaster  postgres                      [.] pgstat_init_function_usage                             
  0.26%       postmaster  [kernel.kallsyms]             [k] _spin_lock                                             
  0.25%       postmaster  postgres                      [.] _bt_compare                                            
  0.24%       postmaster  postgres                      [.] pgstat_end_function_usage

...please let me know if we need to produce the report differently to be useful.

We will begin reducing shared_buffers incrementally over the coming days.


This is definitely pointing at THP compaction which is increasingly emerging as a possible culprit for suddenly occurring (and just as suddenly resolving) cpu spikes.  The evidence I see is:

*) Lots of time in kernel
*) "compaction_alloc"
*) otherwise normal postgres profile (not lots of time in s_lock, LWLock, or other weird things)


Please check the value of THP (see here: http://structureddata.org/2012/06/18/linux-6-transparent-huge-pages-and-hadaoop-workloads/) and various other workloads.   If it is enabled consider disabling it...this will revert to pre linux 6 behavior.  If you are going to attack this from the point of view of lowering shared buffers, do not bother with incremental...head straight for 2GB or it's unlikely the problem will be fixed.   THP compaction is not a postgres problem...mysql is affected as is other server platforms.  If THP is indeed causing the problem, it couldn't hurt to get on the horn withe linux guys.  Last I heard they claimed this kind of thing was fixed but I don't know where things stand now.

merlin

pgsql-performance by date:

Previous
From: Dave Owens
Date:
Subject: Re: CPU spikes and transactions
Next
From: Jeff Janes
Date:
Subject: Re: CPU spikes and transactions