pgbench vs. wait events - Mailing list pgsql-hackers
| From | Robert Haas |
|---|---|
| Subject | pgbench vs. wait events |
| Date | |
| Msg-id | CA+Tgmoav9Q5v5ZGT3+wP_1tQjT6TGYXrwrDcTRrWimC+ZY7RRA@mail.gmail.com Whole thread Raw |
| Responses |
Re: pgbench vs. wait events
Re: pgbench vs. wait events Re: pgbench vs. wait events Re: pgbench vs. wait events Re: pgbench vs. wait events Re: pgbench vs. wait events Re: pgbench vs. wait events |
| List | pgsql-hackers |
Hi,
I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench. I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on. During each run, I ran this psql script in another window and
captured the output:
\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5
Then, I used a little shell-scripting to count up the number of times
each wait event occurred in the output. First, I tried scale factor
3000 with 32 clients and got these results:
1 LWLockTranche | buffer_mapping 9 LWLockNamed | CLogControlLock 14 LWLockNamed |
ProcArrayLock 16 Lock | tuple 25 LWLockNamed | CheckpointerCommLock 49 LWLockNamed |
WALBufMappingLock 122 LWLockTranche | clog 182 Lock | transactionid 287 LWLockNamed |
XidGenLock 1300 Client | ClientRead 1375 LWLockTranche | buffer_content 3990 Lock | extend
21014 LWLockNamed | WALWriteLock 28497 | 58279 LWLockTranche | wal_insert
tps = 1150.803133 (including connections establishing)
What jumps out here is, at least to me, is that there is furious
contention on both the wal_insert locks and on WALWriteLock.
Apparently, the system simply can't get WAL on disk fast enough to
keep up with this workload. Relation extension locks and
buffer_content locks also are also pretty common, both ahead of
ClientRead, a relatively uncommon wait event on this test. The load
average on the system was only about 3 during this test, indicating
that most processes are in fact spending most of their time off-CPU.
The first thing I tried was switching to unlogged tables, which
produces these results:
1 BufferPin | BufferPin 1 LWLockTranche | lock_manager 2 LWLockTranche | buffer_mapping 8
LWLockNamed | ProcArrayLock 9 LWLockNamed | CheckpointerCommLock 9 LWLockNamed | CLogControlLock
11 LWLockTranche | buffer_content 37 LWLockTranche | clog 153 Lock | tuple 388 LWLockNamed
| XidGenLock 827 Lock | transactionid 1267 Client | ClientRead 20631 Lock |
extend91767 |
tps = 1223.239416 (including connections establishing)
If you don't look at the TPS number, these results look like a vast
improvement. The overall amount of time spent not waiting for
anything is now much higher, and the problematic locks have largely
disappeared from the picture. However, the load average now shoots up
to about 30, because most of the time that the backends are "not
waiting for anything" they are in fact in kernel wait state D; that
is, they're stuck doing I/O. This suggests that we might want to
consider advertising a wait state when a backend is doing I/O, so we
can measure this sort of thing.
Next, I tried lowering the scale factor to something that fits in
shared buffers. Here are the results at scale factor 300:
14 Lock | tuple 22 LWLockTranche | lock_manager 39 LWLockNamed | WALBufMappingLock 331
LWLockNamed | CLogControlLock 461 LWLockNamed | ProcArrayLock 536 Lock | transactionid 552
Lock | extend 716 LWLockTranche | buffer_content 763 LWLockNamed | XidGenLock 2113 LWLockNamed
| WALWriteLock 6190 LWLockTranche | wal_insert 25002 Client | ClientRead 78466 |
tps = 27651.562835 (including connections establishing)
Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work. ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare. Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples. It's not
clear how much of the WAL-related stuff is artificial lock contention
and how much of it is the limited speed at which the disk can rotate.
However, I was curious about what's going on with CLogControlLock,
persuant to the other thread on that topic, so I reran this with
unlogged tables. At scale factor 300, 32 clients, unlogged tables, I
get these results:
6 LWLockTranche | wal_insert 12 Lock | tuple 26 LWLockTranche | lock_manager 39
LWLockTranche | buffer_content 353 LWLockNamed | CLogControlLock 377 Lock | extend 491 Lock
| transactionid 500 LWLockNamed | ProcArrayLock 696 LWLockNamed | XidGenLock 27685 Client
| ClientRead 84930 |
WAL contention is eliminated, and CLog contention doesn't increase.
Next, I tried ramping up the client count to see what effect that had.
These next three results are all at scale factor 300 with unlogged
tables.
64 clients:
1 BufferPin | BufferPin 2 LWLockTranche | buffer_mapping 51 LWLockTranche | wal_insert 52
Lock | relation 132 LWLockTranche | lock_manager 154 LWLockTranche | buffer_content 209 Lock
| tuple 940 Lock | extend 1412 LWLockNamed | XidGenLock 2337 LWLockNamed |
CLogControlLock 2813 LWLockNamed | ProcArrayLock 3859 Lock | transactionid 83760 Client |
ClientRead133840 |
tps = 36672.484524 (including connections establishing)
96 clients:
1 LWLockTranche | proc 53 LWLockTranche | wal_insert 305 LWLockTranche | buffer_content 391
LWLockTranche | lock_manager 797 Lock | tuple 1074 Lock | extend 2492 LWLockNamed |
XidGenLock 4185 LWLockNamed | ProcArrayLock 6003 LWLockNamed | CLogControlLock 11348 Lock |
transactionid142881 Client | ClientRead174011 |
tps = 37214.257661 (including connections establishing)
192 clients:
1 LWLockTranche | buffer_mapping 2 BufferPin | BufferPin 4 LWLockTranche | clog 8
LWLockTranche | proc 87 LWLockTranche | wal_insert 708 LWLockTranche | buffer_content 1892 LWLockTranche
| lock_manager 2225 Lock | extend 4796 LWLockNamed | ProcArrayLock 5105 LWLockNamed |
XidGenLock 7982 Lock | tuple 11882 LWLockNamed | CLogControlLock 64091 Lock |
transactionid275481 |308627 Client | ClientRead
tps = 35018.906863 (including connections establishing)
There are a couple of interesting things here. First, CLogControlLock
contention becomes more significant as the number of the clients goes
up. At 32 clients, it is the 6th-most frequent wait event; at 192
clients, it's moved up to 4th, and it's clearly growing at a faster
rate than some of the others. As we go from 32 to 64 to 96 to 192
clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples,
but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so
it goes from being only 2/3 of ProcArrayLock to being more than twice
ProcArrayLock. However, in absolute terms, it's still not very
significant. Even at 192 clients, where it's the fourth-most-common
wait event, it appears in only 1.7% of samples, which means backends
are only waiting for it about 1.7% of the time. It appears that, on
this system, this workload is NEVER going to get meaningfully faster
by improving CLogControlLock.
Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources. Third,
Lock/transactionid and Lock/tuple become more and more common as the
number of clients increases; these happen when two different pgbench
threads deciding to hit the same row at the same time. Due to the
birthday paradox this increases pretty quickly as the number of
clients ramps up, but it's not really a server issue: there's nothing
the server can do about the possibility that two or more clients pick
the same random number at the same time.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
pgsql-hackers by date: