Re: Reduce ProcArrayLock contention - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Reduce ProcArrayLock contention |
Date | |
Msg-id | CA+TgmobiSW7Ox7zVHHmT8__th+GL4WJM9guzjzk4bAm=8Bt3Hw@mail.gmail.com Whole thread Raw |
In response to | Re: Reduce ProcArrayLock contention (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Reduce ProcArrayLock contention
|
List | pgsql-hackers |
On Wed, Aug 19, 2015 at 11:39 AM, Andres Freund <andres@anarazel.de> wrote: > There's a bunch of issues with those two blocks afaics: > > 1) The first block (in one backend) might read INVALID_PGPROCNO before > ever locking the semaphore if a second backend quickly enough writes > INVALID_PGPROCNO. That way the semaphore will be permanently out of > "balance". Yes, this is a clear bug. I think the fix is to do one unconditional PGSemaphoreLock(&proc->sem) just prior to the loop. > 2) There's no memory barriers around dealing with nextClearXidElem in > the first block. Afaics there needs to be a read barrier before > returning, otherwise it's e.g. not guaranteed that the woken up > backend sees its own xid set to InvalidTransactionId. I can't believe it actually works like that. Surely a semaphore operation is a full barrier. Otherwise this could fail: P1: a = 0; P1: PGSemaphoreLock(&P1); P2: a = 1; P2: PGSemaphoreUnlock(&P1); P1: Assert(a == 1); Do you really think that fails anywhere? I'd find that shocking. And if it's true, then let's put the barrier in PGSemaphore(L|Unl)ock, because everybody who uses semaphores for anything is going to have the same problem. This is exactly what semaphore are for. I find it extremely hard to believe that user-level code that uses OS-provided synchronization primitives has to insert its own barriers also. I think we can assume that if I do something and wake you up, you'll see everything I did before waking you. (This would of course be needed if we didn't add the PGSemaphoreLock contemplated by the previous point, because then there'd be no OS-level synchronization primitive in some cases. But since we have to add that anyway I think this is is a non-issue.) > 3) If a follower returns before the leader has actually finished woken > that respective backend up we can get into trouble: > > Consider what happens if such a follower enqueues in another > transaction. It is not, as far as I could find out, guaranteed on all > types of cpus that a third backend can actually see nextClearXidElem > as INVALID_PGPROCNO. That'd likely require SMT/HT cores and multiple > sockets. If the write to nextClearXidElem is entered into the local > store buffer (leader #1) a hyper-threaded process (leader #2) can > possibly see it (store forwarding) while another backend doesn't > yet. Let's call the process that does two commits P, and it's PGPROC structure proc. For the first commit, L1 is the leader; for the second, L2 is the leader. The intended sequence of operations on nextClearXidElem is: (1) P sets proc->nextClearXidElem (2) L1 clears proc->nextClearXidElem (3) P sets proc->nextClearXidElem (4) L2 clears proc->nextClearXidElem P uses an atomic compare-and-exchange operation on procglobal->nextClearXidElem after step (1), and L1 can't attempt step (2) until it uses an atomic compare-and-exchange operation on procglobal->nextClearXidElem -- that's how it finds out the identity of P. Since a compare-and-exchange operation is a full barrier, those two compare-and-exchange operations form a barrier pair, and (2) must happen after (1). Synchronization between (2) and (3) is based on the assumption that P must do PGSemaphoreLock and L1 must do PGSemaphoreUnlock. I assume, as noted above, that those are barriers, that they form a pair, and thus (3) must happen after (2). (4) must happen after (3) for the same reasons (2) must happen after (1). So I don't see the problem. The "third backend", which I guess is L2, doesn't need to observe proc->nextClearXidElem until P has reset it; and procglobal->nextClearXidElem is only manipulated with atomics, so access to that location had better be totally ordered. > Going through the patch: > > +/* > + * ProcArrayGroupClearXid -- group XID clearing > + * > + * When we cannot immediately acquire ProcArrayLock in exclusive mode at > + * commit time, add ourselves to a list of processes that need their XIDs > + * cleared. The first process to add itself to the list will acquire > + * ProcArrayLock in exclusive mode and perform ProcArrayEndTransactionInternal > + * on behalf of all group members. This avoids a great deal of context > + * switching when many processes are trying to commit at once, since the lock > + * only needs to be handed from the last share-locker to one process waiting > + * for the exclusive lock, rather than to each one in turn. > + */ > +static void > +ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid) > +{ > > This comment, in my opinion, is rather misleading. If you have a > workload that primarily is slowed down due to transaction commits, this > patch doesn't actually change the number of context switches very > much. Previously all backends enqueued in the lwlock and got woken up > one-by-one. Safe backends 'jumping' the queue while a lock has been > released but the woken up backend doesn't yet run, there'll be exactly > as many context switches as today. > > The difference is that only one backend has to actually acquire the > lock. So what has changed is the number of times, and the total > duration, the lock is actually held in exclusive mode. I don't mind if you want to improve the phrasing of the comment. I think my thinking was warped by versions of the patch that used LWLockAcquireOrWait. In those versions, every time somebody committed, all of the waiting backends woke up and fought over who got to be leader. That was a lot less good than this approach. So the comment *should* be pointing out that we have avoided that danger, but it should not be making it sound like that's the advantage vs. not having the mechanism at all. > + /* Support for group XID clearing. */ > + volatile pg_atomic_uint32 nextClearXidElem; > > + /* First pgproc waiting for group XID clear */ > + volatile pg_atomic_uint32 nextClearXidElem; > > Superfluous volatiles. Hmm, I thought I needed that to avoid compiler warnings, but I guess not. And I see that lwlock.h doesn't mention volatile. So, yeah, we can remove that. > I don't think it's a good idea to use the variable name in PROC_HDR and > PGPROC, it's confusing. It made sense to me, but I don't care that much if you want to change it. > How hard did you try checking whether this causes regressions? This > increases the number of atomics in the commit path a fair bit. I doubt > it's really bad, but it seems like a good idea to benchmark something > like a single full-throttle writer and a large number of readers. Hmm, that's an interesting point. My analysis was that it really only increased the atomics in the cases where we otherwise would have gone to sleep, and I figured that the extra atomics were a small price to pay for not sleeping. One writer many readers is a good case to test, though, because the new code will get used a lot but we'll never actually manage a group commit. So that is worth testing. It's worth point out, though, that your lwlock.c atomics changes have the same effect. Before, if there were any shared lockers on an LWLock and an exclusive-locker came along, the exclusive locker would take and release the spinlock and go to sleep. Now, it will use CAS to try to acquire the lock, then acquire and release the spinlock to add itself to the wait queue, then use CAS to attempt the lock again, and then go to sleep - unless of course the second lock acquisition succeeds, in which case it will first acquire and release the spinlock yet again to take itself back out of the list of waiters. The changes we made to the clock sweep are more of the same. Instead of taking an lwlock, sweeping the clock hand across many buffers, and releasing the lwlock, we now perform an atomic operation for every buffer. That's obviously "worse" in terms of the total number of atomic operations, and if you construct a case where every backend sweeps 10 or 20 buffers yet none of them would have contended with each other, it might lose. But it's actually much better and more scalable in the real world. I think this is a pattern we'll see with atomics over and over again: more atomic operations overall in order to reduce the length of time for which particular resources are unavailable to other processes. Obviously, we need to be careful about that, and we need to be particularly careful not to regress the single-client performance, which is one thing that I liked about this patch - if ProcArrayLock can be taken in exclusive mode immediately, there's no difference vs. older versions. So I think in general this is going shift pretty smoothly from the old behavior in low client-count situations to the group behavior in high client-count situations, but it does seem possible that if you have say 1 writer and 100 readers you could lose. I don't think you'll lose much, but let's test that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: