Re: Analysis of ganged WAL writes - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Analysis of ganged WAL writes |
Date | |
Msg-id | 14533.1033945650@sss.pgh.pa.us Whole thread Raw |
In response to | Analysis of ganged WAL writes (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Analysis of ganged WAL writes
Re: Analysis of ganged WAL writes |
List | pgsql-hackers |
I said: > There is a simple error > in the current code that is easily corrected: in XLogFlush(), the > wait to acquire WALWriteLock should occur before, not after, we try > to acquire WALInsertLock and advance our local copy of the write > request pointer. (To be exact, xlog.c lines 1255-1269 in CVS tip > ought to be moved down to before line 1275, inside the "if" that > tests whether we are going to call XLogWrite.) That patch was not quite right, as it didn't actually flush the later-arriving data. The correct patch is *** src/backend/access/transam/xlog.c.orig Thu Sep 26 18:58:33 2002 --- src/backend/access/transam/xlog.c Sun Oct 6 18:45:57 2002 *************** *** 1252,1279 **** /* done already? */ if (!XLByteLE(record, LogwrtResult.Flush)) { - /* if something was added to log cache then try to flush this too */ - if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE)) - { - XLogCtlInsert *Insert = &XLogCtl->Insert; - uint32 freespace = INSERT_FREESPACE(Insert); - - if (freespace < SizeOfXLogRecord) /* buffer is full */ - WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx]; - else - { - WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx]; - WriteRqstPtr.xrecoff -= freespace; - } - LWLockRelease(WALInsertLock); - } /* now wait for the write lock */ LWLockAcquire(WALWriteLock, LW_EXCLUSIVE); LogwrtResult= XLogCtl->Write.LogwrtResult; if (!XLByteLE(record, LogwrtResult.Flush)) { ! WriteRqst.Write = WriteRqstPtr; ! WriteRqst.Flush = record; XLogWrite(WriteRqst); } LWLockRelease(WALWriteLock); --- 1252,1284 ---- /* done already? */ if (!XLByteLE(record, LogwrtResult.Flush)) { /* now wait for thewrite lock */ LWLockAcquire(WALWriteLock, LW_EXCLUSIVE); LogwrtResult = XLogCtl->Write.LogwrtResult; if (!XLByteLE(record, LogwrtResult.Flush)) { ! /* try to write/flush later additions to XLOG as well */ ! if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE)) ! { ! XLogCtlInsert *Insert = &XLogCtl->Insert; ! uint32 freespace = INSERT_FREESPACE(Insert); ! ! if (freespace < SizeOfXLogRecord) /* buffer is full */ ! WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx]; ! else ! { ! WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx]; ! WriteRqstPtr.xrecoff -= freespace; ! } ! LWLockRelease(WALInsertLock); ! WriteRqst.Write = WriteRqstPtr; ! WriteRqst.Flush = WriteRqstPtr; ! } ! else ! { ! WriteRqst.Write = WriteRqstPtr; ! WriteRqst.Flush = record; ! } XLogWrite(WriteRqst); } LWLockRelease(WALWriteLock); To test this, I made a modified version of pgbench in which each transaction consists of a simpleinsert into table_NNN values(0); where each client thread has a separate insertion target table. This is about the simplest transaction I could think of that would generate a WAL record each time. Running this modified pgbench with postmaster parameterspostmaster -i -N 120 -B 1000 --wal_buffers=250 and all other configuration settings at default, CVS tip code gives me a pretty consistent 115-118 transactions per second for anywhere from 1 to 100 pgbench client threads. This is exactly what I expected, since the database (including WAL file) is on a 7200 RPM SCSI drive. The theoretical maximum rate of sync'd writes to the WAL file is therefore 120 per second (one per disk revolution), but we lose a little because once in awhile the disk has to seek to a data file. Inserting the above patch, and keeping all else the same, I get: $ mybench -c 1 -t 10000 bench1 number of clients: 1 number of transactions per client: 10000 number of transactions actually processed: 10000/10000 tps = 116.694205 (including connections establishing) tps = 116.722648 (excluding connections establishing) $ mybench -c 5 -t 2000 -S -n bench1 number of clients: 5 number of transactions per client: 2000 number of transactions actually processed: 10000/10000 tps = 282.808341 (including connections establishing) tps = 283.656898 (excluding connections establishing) $ mybench -c 10 -t 1000 bench1 number of clients: 10 number of transactions per client: 1000 number of transactions actually processed: 10000/10000 tps = 443.131083 (including connections establishing) tps = 447.406534 (excluding connections establishing) $ mybench -c 50 -t 200 bench1 number of clients: 50 number of transactions per client: 200 number of transactions actually processed: 10000/10000 tps = 416.154173 (including connections establishing) tps = 436.748642 (excluding connections establishing) $ mybench -c 100 -t 100 bench1 number of clients: 100 number of transactions per client: 100 number of transactions actually processed: 10000/10000 tps = 336.449110 (including connections establishing) tps = 405.174237 (excluding connections establishing) CPU loading goes from 80% idle at 1 client to 50% idle at 5 clients to <10% idle at 10 or more. So this does seem to be a nice win, and unless I hear objections I will apply it ... regards, tom lane
pgsql-hackers by date: