Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock) - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock) |
Date | |
Msg-id | 4F54EEBA.5070504@enterprisedb.com Whole thread Raw |
In response to | Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock) (Fujii Masao <masao.fujii@gmail.com>) |
Responses |
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)
Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock) Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock) |
List | pgsql-hackers |
On 21.02.2012 13:19, Fujii Masao wrote: > In some places, the spinlock "insertpos_lck" is taken while another > spinlock "info_lck" is being held. Is this OK? What if unfortunately > inner spinlock takes long to be taken? Hmm, that's only done at a checkpoint (and a restartpoint), so I doubt that's a big issue in practice. We had the same pattern before the patch, just with WALInsertLock instead of insertpos_lck. Holding a spinlock longer is much worse than holding a lwlock longer, but nevertheless I don't think that's a problem. If it does become a problem, we could provide a second copy of RedoRecPtr that could be read while holding info_lck, and would be allowed to lag behind slightly, while requiring insertpos_lck to read/update the main shared copy of RedoRecPtr. The only place that reads RedoRecPtr while holding info_lck is GetRedoRecPtr(), which would be happy with a value that lags behind a few milliseconds. We could still update that copy right after releasing insertpos_lck, so the delay between the two would be tiny. > + * An xlog-switch record consumes all the remaining space on the > + * WAL segment. We have already reserved it for us, but we still need > + * to make sure it's been allocated and zeroed in the WAL buffers so > + * that when the caller (or someone else) does XLogWrite(), it can > + * really write out all the zeros. > > Why do we need to write out all the remaining space with zeros? In > current master, we don't do that. A recovery code ignores the data > following XLOG_SWITCH record, so I don't think that's required. It's to keep the logic simpler. Before the patch, an xlog-switch just initialized the next page in the WAL buffers to insert to, to be the first page in the next segment. With this patch, we rely on a simple linear mapping from an XLogRecPtr to the WAL buffer that should contain that page (see XLogRecPtrToBufIdx()). Such a mapping is not possible if you sometimes skip over pages in the WAL buffers, so we allocate the buffers for those empty pages, too. Note that this means that an xlog-switch can blow through all your WAL buffers. We could probably optimize that so that you don't need to actually write() and fsync() all the zeros, perhaps by setting a flag on the WAL buffer to indicate that it only contains padding for an xlog-switch. However, I don't see any easy way to avoid blowing the cache. I'm thinking that xlog-switching happens so seldom, and typically on a fairly idle system, so we don't need to optimize it much. I guess we should measure the impact, though.. > + /* XXX: before this patch, TRACE_POSTGRESQL_XLOG_SWITCH was not called > + * if the xlog switch had no work to do, ie. if we were already at the > + * beginning of a new XLOG segment. You can check if RecPtr points to > + * beginning of a segment if you want to keep the distinction. > + */ > + TRACE_POSTGRESQL_XLOG_SWITCH(); > > If so, RecPtr (or the flag indicating whether the xlog switch has no > work to do) should > be given to TRACE_POSTGRESQL_XLOG_SWITCH() as an argument? I think I'll just move that call, so that the current behavior is retained. > The followings are trivial comments: Thanks, fixed these! On 22.02.2012 03:34, Fujii Masao wrote: > When I ran the long-running performance test, I encountered the following > panic error. > > PANIC: could not find WAL buffer for 0/FF000000 > > 0/FF000000 is the xlog file boundary, so the patch seems to handle > the xlog file boundary incorrectly. In the patch, current insertion lsn > is advanced by directly incrementing XLogRecPtr.xrecoff as follows. > But to handle the xlog file boundary correctly, we should use > XLByteAdvance() for that, instead? Thanks, fixed this, too. I made the locking a bit more strict in WaitXLogInsertionsToFinish(), so that it grabs the insertpos_lck to read nextslot. I previously thought that was not necessary, assuming that reading/writing an int32 is atomic, but I'm afraid there might be memory-ordering issues where the CurrPos of the most recent slot has not become visible to other backends yet, while the advancing of nextslot has. That particular issue would be very hard to hit in practice, so I don't know if this could explain the recovery failures that Jeff saw. I got the test script running (thanks for that Jeff!), but unfortunately have not seen any failures yet (aside from the issue with crossing xlogid boundary), with either this or the older version of the patch. Attached is a new version of the patch. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
pgsql-hackers by date: