Re: BUG #9635: Wal sender process is using 100% CPU - Mailing list pgsql-bugs
From | Heikki Linnakangas |
---|---|
Subject | Re: BUG #9635: Wal sender process is using 100% CPU |
Date | |
Msg-id | 53691CB3.3020908@vmware.com Whole thread Raw |
In response to | Re: BUG #9635: Wal sender process is using 100% CPU (Jamie Koceniak <jkoceniak@mediamath.com>) |
Responses |
Re: BUG #9635: Wal sender process is using 100% CPU
|
List | pgsql-bugs |
On 05/06/2014 07:44 PM, Jamie Koceniak wrote: > Hi Heikki, > > Attached is a trace route for one of the pids that was running at 100% cpu. > Note: when the wal_senders hit 100% cpu, our slave servers tend to lag the master for up to around 3-4 minutes until thewal_sender cpu utilization drops. This is very problematic for us since we are trying to use the slaves for reads in ourapplication. > > Here is perf top results: > 39.49% libz.so.1.2.7 [.] 0x000000000000301e > 32.97% postgres [.] 0x00000000000cf150 > 6.50% postgres [.] nocachegetattr > 2.63% postgres [.] btint4cmp > 1.64% postgres [.] _start > 1.40% [unknown] [.] 0x00007f9dfca6a34b > 1.06% libc-2.13.so [.] 0x000000000012934b > 0.94% postgres [.] slot_getattr > 0.85% libcrypto.so.1.0.0 [.] 0x000000000009f48f > 0.69% postgres [.] ExecProject > 0.67% libz.so.1.2.7 [.] adler32 > 0.66% postgres [.] GetMemoryChunkSpace > 0.58% [kernel] [k] copy_user_generic_string > 0.45% postgres [.] heap_fill_tuple > 0.38% postgres [.] execTuplesMatch > 0.27% postgres [.] FunctionCall2Coll > 0.27% postgres [.] ExecProcNode > 0.25% postgres [.] heap_tuple_untoast_attr > 0.22% postgres [.] MemoryContextAlloc > 0.21% postgres [.] heap_form_minimal_tuple > 0.21% postgres [.] pfree > 0.19% postgres [.] heap_compute_data_size Hmm, these results seem to include all processes, not just the WAL sender > Here is some data using perf record and perf report: > > + 33.93% 193019 postgres postgres [.] 0x0000000000369e27 > + 30.06% 173825 postgres libz.so.1.2.7 [.] 0x00000000000036c0 > + 10.21% 60618 postgres [unknown] [.] 0x00007f9dfc2fd529 > + 4.81% 27275 postgres postgres [.] nocachegetattr > + 2.55% 14443 postgres postgres [.] btint4cmp > + 1.22% 6916 postgres postgres [.] ExecProject > + 1.20% 6837 postgres postgres [.] _start > + 1.10% 6495 postgres libc-2.13.so [.] 0x000000000012b8e9 > + 0.84% 4760 postgres postgres [.] heap_fill_tuple > + 0.74% 4265 postgres [kernel.kallsyms] [k] copy_user_generic_string > + 0.71% 4001 postgres postgres [.] GetMemoryChunkSpace > + 0.66% 3881 postgres libcrypto.so.1.0.0 [.] 0x0000000000081ec7 > + 0.63% 3673 postgres libz.so.1.2.7 [.] adler32 > + 0.43% 2448 postgres postgres [.] slot_getattr > + 0.40% 2277 postgres postgres [.] ExecProcNode > + 0.40% 2250 postgres postgres [.] heap_form_minimal_tuple > + 0.37% 2078 postgres postgres [.] heap_compute_data_size > + 0.31% 1739 postgres postgres [.] ExecScan > + 0.27% 1779 postgres [kernel.kallsyms] [k] page_fault > + 0.24% 1374 postgres postgres [.] LogicalTapeWrite > + 0.24% 4206 swapper [kernel.kallsyms] [k] intel_idle > + 0.21% 1206 postgres postgres [.] slot_getsomeattrs Same here. > Also, here are our wal settings on the master. > Do these settings appear ok? > > wal_block_size | 8192 | > wal_buffers | 16MB | > wal_keep_segments | 5000 | > wal_level | hot_standby | > wal_receiver_status_interval | 10s | > wal_segment_size | 16MB | > wal_sender_delay | 1s | > wal_sync_method | fdatasync | > wal_writer_delay | 200ms | Looks reasonable to me. So, the strace output repeats this, with small variations in the number of bytes written in each sendto() call: recvfrom(10, 0x7f9dfec1f753, 5, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) read(7, ..., 131072) = 131072 sendto(10, ..., 3690, 0, NULL, 0) = 3690 sendto(10, ..., 3605, 0, NULL, 0) = 3605 sendto(10, ..., 3669, 0, NULL, 0) = 3669 sendto(10, ..., 3653, 0, NULL, 0) = 3653 sendto(10, ..., 3637, 0, NULL, 0) = 3637 sendto(10, ..., 3621, 0, NULL, 0) = 3621 sendto(10, ..., 3669, 0, NULL, 0) = 3669 sendto(10, ..., 3605, 0, NULL, 0) = 3605 sendto(10, ..., 53, 0, NULL, 0) = 53 getppid() = 5474 The read() reads 128kB of WAL from the file in pg_xlog, and it is then sent to the client. Now, it's pretty surprising that the number of bytes written with sendto() doesn't match the number of bytes read from the file. Unless, you're using SSL with compression. That would also explain the high amount of CPU time spent in libz in the perf output. Try disabling SSL compression, with sslcompression=0 option in primary_conninfo, or disable SSL altogether if the network is trustworthy enough for that. - Heikki
pgsql-bugs by date: