Thread: Reducing the cost of sinval messaging
I happened to be looking at sinvaladt.c and noticed the loop added in commit b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4: /* * Now that the maxMsgNum change is globally visible, we give everyone * a swift kick to make surethey read the newly added messages. * Releasing SInvalWriteLock will enforce a full memory barrier, so * these (unlocked) changes will be committed to memory before we exit * the function. */ for (i = 0;i < segP->lastBackend; i++) { ProcState *stateP = &segP->procState[i]; stateP->hasMessages = true; } This seems fairly awful when there are a large number of backends. Why could we not remove the hasMessages flags again, and change the unlocked test if (!stateP->hasMessages) return 0; into if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState) return 0; If we are looking at stale shared state, this test might produce a false positive, but I don't see why it's any less safe than testing a hasMessages boolean. regards, tom lane
On Mon, Oct 27, 2014 at 3:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I happened to be looking at sinvaladt.c and noticed the loop added in > commit b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4: > > /* > * Now that the maxMsgNum change is globally visible, we give everyone > * a swift kick to make sure they read the newly added messages. > * Releasing SInvalWriteLock will enforce a full memory barrier, so > * these (unlocked) changes will be committed to memory before we exit > * the function. > */ > for (i = 0; i < segP->lastBackend; i++) > { > ProcState *stateP = &segP->procState[i]; > > stateP->hasMessages = true; > } > > This seems fairly awful when there are a large number of backends. We did quite a bit of testing at the time and found that it was a clear improvement over what we had been doing previously. Of course, that's not to say we can't or shouldn't improve it further. > Why could we not remove the hasMessages flags again, and change the > unlocked test > > if (!stateP->hasMessages) > return 0; > > into > > if (stateP->nextMsgNum == segP->maxMsgNum && > !stateP->resetState) > return 0; > > If we are looking at stale shared state, this test might produce a > false positive, but I don't see why it's any less safe than testing a > hasMessages boolean. It was discussed at the time: http://www.postgresql.org/message-id/CA+TgmoY3Q56ZR6i8h+iGhXCw6rCZyvdWJ3RQT=PMVev4-=+N_g@mail.gmail.com http://www.postgresql.org/message-id/13077.1311702188@sss.pgh.pa.us -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Oct 27, 2014 at 3:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Why could we not remove the hasMessages flags again, and change the >> unlocked test >> >> if (!stateP->hasMessages) >> return 0; >> >> into >> >> if (stateP->nextMsgNum == segP->maxMsgNum && >> !stateP->resetState) >> return 0; >> >> If we are looking at stale shared state, this test might produce a >> false positive, but I don't see why it's any less safe than testing a >> hasMessages boolean. > It was discussed at the time: > http://www.postgresql.org/message-id/CA+TgmoY3Q56ZR6i8h+iGhXCw6rCZyvdWJ3RQT=PMVev4-=+N_g@mail.gmail.com > http://www.postgresql.org/message-id/13077.1311702188@sss.pgh.pa.us Neither of those messages seem to me to bear on this point. AFAICS, the unlocked hasMessages test has a race condition, which the comment just above it argues isn't a problem in practice. What I propose above would have an absolutely indistinguishable race condition, which again wouldn't be a problem in practice if you believe the comment's argument. regards, tom lane
On Mon, Oct 27, 2014 at 4:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Oct 27, 2014 at 3:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Why could we not remove the hasMessages flags again, and change the >>> unlocked test >>> >>> if (!stateP->hasMessages) >>> return 0; >>> >>> into >>> >>> if (stateP->nextMsgNum == segP->maxMsgNum && >>> !stateP->resetState) >>> return 0; >>> >>> If we are looking at stale shared state, this test might produce a >>> false positive, but I don't see why it's any less safe than testing a >>> hasMessages boolean. > >> It was discussed at the time: > >> http://www.postgresql.org/message-id/CA+TgmoY3Q56ZR6i8h+iGhXCw6rCZyvdWJ3RQT=PMVev4-=+N_g@mail.gmail.com >> http://www.postgresql.org/message-id/13077.1311702188@sss.pgh.pa.us > > Neither of those messages seem to me to bear on this point. AFAICS, > the unlocked hasMessages test has a race condition, which the comment > just above it argues isn't a problem in practice. What I propose above > would have an absolutely indistinguishable race condition, which again > wouldn't be a problem in practice if you believe the comment's argument. Yes, the read from hasMessages has a race condition, which is in fact not a problem in practice if you believe the comment's argument. However, your proposed formulation has a second race condition, which is discussed in those emails, and which is precisely why we rejected the design you're now proposing three years ago. That second race condition is that the read from stateP->nextMsgNum doesn't have to happen at the same time as the read from stateP->resetState. No CPU or compiler barrier separates them, so either can happen before the other. SICleanupQueue updates both values, so you have the problem first described precisely by Noah: # (In theory, you could # hit a case where the load of resetState gives an ancient "false" just as the # counters wrap to match. While it practically seems very unlikely that this would ever really happen, some guy named Tom Lane told us we should worry about it, so I did. That led to this new design: http://www.postgresql.org/message-id/CA+TgmobefJvyQDjVBb6TSs996s8ZKvyRzTtPx0ChYo3hve52tg@mail.gmail.com ...which is what ended up getting committed. Personally, I think trying to further optimize this is most likely a waste of time. I'm quite sure that there's a whole lot of stuff in the sinval code that could be further optimized, but it executes so infrequently that I believe it doesn't matter. Noah beat on this pretty hard at the time and found that O(N) loop you're complaining about - in his testing - never went above 0.26% of CPU time and reduced overall CPU usage even in a sinval-intensive workload with 50 clients per core: http://www.postgresql.org/message-id/20110729030514.GB30354@tornado.leadboat.com If you've got some evidence that this is a real problem as opposed to a hypothetical one, we could certainly reopen the debate about whether ignoring the possibility that the loads from stateP->nextMsgNum and stateP->resetState will be widely-enough spaced in time for a quarter-million messages to be queued up in the meantime is sufficiently unlikely to ignore. However, absent evidence of a tangible performance problem, I'd be quite disinclined to throw the guaranteed safety of the current approach out the window. We could do some testing where we crank that 262,144 value down to progressively lower numbers until the race condition actually manifests, just to see how far we are from disaster. But presumably the goalposts there shift every time we get a new generation of hardware; and I don't much like coding on the basis of "eh, seems unlikely the CPU will ever delay a read by *that* much." -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Oct 27, 2014 at 4:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Neither of those messages seem to me to bear on this point. AFAICS, >> the unlocked hasMessages test has a race condition, which the comment >> just above it argues isn't a problem in practice. What I propose above >> would have an absolutely indistinguishable race condition, which again >> wouldn't be a problem in practice if you believe the comment's argument. > Yes, the read from hasMessages has a race condition, which is in fact > not a problem in practice if you believe the comment's argument. > However, your proposed formulation has a second race condition, which > is discussed in those emails, and which is precisely why we rejected > the design you're now proposing three years ago. That second race > condition is that the read from stateP->nextMsgNum doesn't have to > happen at the same time as the read from stateP->resetState. No CPU or > compiler barrier separates them, so either can happen before the > other. SICleanupQueue updates both values, so you have the problem > first described precisely by Noah: > # (In theory, you could > # hit a case where the load of resetState gives an ancient "false" just as the > # counters wrap to match. > While it practically seems very unlikely that this would ever really > happen, some guy named Tom Lane told us we should worry about it, so I > did. That argument is nonsense. I complained about a lack of close analysis, but with close analysis I think this is perfectly safe; or at least no less safe than what's there now, with its not terribly bulletproof assumption that a suitable memory barrier has happened recently. You'll grant that reads and writes of the integer msgNum variables are atomic, I trust (if not, we've got lots of bugs elsewhere). So what we have to worry about does not include totally-corrupted values, but only stale values, including possibly out-of-order reads. Ignoring the possibility of MSGNUMWRAPAROUND wraparound for a moment, only our own process ever changes nextMsgNum, so we should certainly see a current value of that. So the argument for a hazard is that maxMsgNum could have been advanced 2^32 times since our process last read messages, and then we come along and see that value, but we *don't* see our resetState flag set even though that happened 2^32 - MAXNUMMESSAGES messages ago, and was certainly flushed into shared memory shortly thereafter. That's nonsense. It's especially nonsense if you assume, as the existing code already does, that the reading process "recently" executed a read barrier. Now, as to what happens when MSGNUMWRAPAROUND wraparound occurs: that code decrements all msgNum variables whether they belong to hopelessly-behind backends or not. That's why the time until a possible chance match is 2^32 messages and not something else. However, since the proposed new unlocked test might come along and see a partially-complete wraparound update, it could see either nextMsgNum or maxMsgNum offset by MSGNUMWRAPAROUND compared to the other. What would most likely happen is that this would make for a false decision that they're unequal, resulting in a useless trip through the locked part of SIGetDataEntries, which is harmless. If our backend had fallen behind by exactly 2^32-MSGNUMWRAPAROUND messages or 2^32+MSGNUMWRAPAROUND messages, we could make a false conclusion that nextMsgNum and maxMsgNum are equal --- but, again, our resetState must have become set a billion or so sinval messages back, and it's just not credible that we're not going to see that flag set. So I still say that the current code is wasting a lot of cycles for no actual gain in safety. > Personally, I think trying to further optimize this is most likely a > waste of time. I'm quite sure that there's a whole lot of stuff in > the sinval code that could be further optimized, but it executes so > infrequently that I believe it doesn't matter. Maybe not today, but it's not hard to imagine usage patterns where this would result in O(N^2) total work in the sinval code. > If you've got some evidence that this is a real problem as opposed to > a hypothetical one, we could certainly reopen the debate about whether > ignoring the possibility that the loads from stateP->nextMsgNum and > stateP->resetState will be widely-enough spaced in time for a > quarter-million messages to be queued up in the meantime is > sufficiently unlikely to ignore. It's a billion messages, not a quarter million... > However, absent evidence of a > tangible performance problem, I'd be quite disinclined to throw the > guaranteed safety of the current approach out the window. I fail to find any "guaranteed safety" in the current approach. It's dependent on the assumption of a recently-executed read barrier in the reading backend, which assumption is sufficient for what I'm proposing as well. Another thought here is that the whole thing becomes moot if we stick a read barrier into the "unlocked" test. We did not have that option during the last go-round with this code, but I'm inclined to think we ought to revisit sinvaladt.c with that tool in our belts anyway; replacing the spinlock around maxMsgNum with suitable read/write barriers has been on the to-do list since forever. regards, tom lane
On Mon, Oct 27, 2014 at 8:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > That argument is nonsense. I complained about a lack of close analysis, > but with close analysis I think this is perfectly safe; or at least no > less safe than what's there now, with its not terribly bulletproof > assumption that a suitable memory barrier has happened recently. I don't think there's any reason to think that assumption is any less than clad in iron, if not tungsten. The entire sinval mechanism relies for its correctness on heavyweight locking: at transaction commit, we queue up sinval messages before releasing locks; and we check for sinval messages - at least - after acquiring locks. For this to be correct, the lock held when queueing up a sinval message must be strong enough to prevent backends who might need to see that message from acquiring a lock of their own. In that way, by the time a process manages to acquire a lock, it can be confident that the invalidation messages it needs to see have already been queued. Since a heavyweight lock acquisition or release is a memory barrier many times over, it's fair to say that by the time the process queueing the invalidation messages releases locks, the messages it wrote must be committed to memory. And conversely, when a process has just acquired a lock, any invalidations committed to memory by another backend will be visible to it by the time the lock acquisition has completed. The only way this isn't a sufficient interlock is if the lock acquired by the second backend doesn't conflict with the lock held by the first one. But if that's the case, the whole system is broken far beyond the ability of a memory barrier to add or detract. > You'll grant that reads and writes of the integer msgNum variables are > atomic, I trust (if not, we've got lots of bugs elsewhere). So what we > have to worry about does not include totally-corrupted values, but only > stale values, including possibly out-of-order reads. Check and check. > Ignoring the > possibility of MSGNUMWRAPAROUND wraparound for a moment, only our own > process ever changes nextMsgNum, so we should certainly see a current > value of that. OK... > So the argument for a hazard is that maxMsgNum could have > been advanced 2^32 times since our process last read messages, and then we > come along and see that value, but we *don't* see our resetState flag set > even though that happened 2^32 - MAXNUMMESSAGES messages ago, and was > certainly flushed into shared memory shortly thereafter. That's nonsense. > It's especially nonsense if you assume, as the existing code already does, > that the reading process "recently" executed a read barrier. I'm not sure I follow this part. > Now, as to what happens when MSGNUMWRAPAROUND wraparound occurs: that code > decrements all msgNum variables whether they belong to hopelessly-behind > backends or not. That's why the time until a possible chance match is > 2^32 messages and not something else. However, since the proposed new > unlocked test might come along and see a partially-complete wraparound > update, it could see either nextMsgNum or maxMsgNum offset by > MSGNUMWRAPAROUND compared to the other. I definitely agree with that last sentence, which I think is the key point, but I don't understand how 2^32 comes into it. It seems to me that a chance match is a possibility every MSGNUMWRAPAROUND messages, precisely because one might be offset by that amount compared to the other. > If our backend had fallen behind by exactly 2^32-MSGNUMWRAPAROUND > messages or 2^32+MSGNUMWRAPAROUND messages, we could make a false > conclusion that nextMsgNum and maxMsgNum are equal --- but, again, our > resetState must have become set a billion or so sinval messages back, and > it's just not credible that we're not going to see that flag set. I'd like a more precise argument than "not credible". How many sinval messages back does it have to be before we consider it safe? Why that number and not twice as many or half as many? > So I still say that the current code is wasting a lot of cycles for > no actual gain in safety > >> Personally, I think trying to further optimize this is most likely a >> waste of time. I'm quite sure that there's a whole lot of stuff in >> the sinval code that could be further optimized, but it executes so >> infrequently that I believe it doesn't matter. > > Maybe not today, but it's not hard to imagine usage patterns where this > would result in O(N^2) total work in the sinval code. To all of this I say: show me the profile or usage pattern where this actually matters. We worked hard when this patch was in development and under review to find a usage pattern where this was a significant impact and were unable to do so. > Another thought here is that the whole thing becomes moot if we stick a > read barrier into the "unlocked" test. We did not have that option during > the last go-round with this code, but I'm inclined to think we ought to > revisit sinvaladt.c with that tool in our belts anyway; replacing the > spinlock around maxMsgNum with suitable read/write barriers has been on > the to-do list since forever. I agree we should replace the spinlock with the appropriate barrier primitives, for cleanliness if nothing else, but I bet inserting a read barrier into the fast-path that current uses no barriers or locks at all would have a very measurable negative effect on performance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company