Closed Bug 1259134 Opened 9 years ago Closed 9 years ago

Make mozilla::CondVar resilient to spurious wakeup

Categories

(Core :: XPCOM, defect)

defect
Not set
critical

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox48 --- affected

People

(Reporter: mayhemer, Assigned: mayhemer)

Details

Attachments

(1 file)

According findings in bug 1257611 (see its c#7), mozilla::CondVar::Wait() may return spuriously w/o actual notification on it being made. This also applies to mozilla::Monitor::Wait that simply forwards to its CondVar. We should make CondVar resilient to this to take the burden of a specific notification flag and loop around every Wait() call from its consumers. And also because it can be easily omitted - which is a serious bug!
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Attached patch v1Splinter Review
This patch does (have to) a little bit more: - protects against spurious wake up - makes Wait() return when Notify() or NotifyAll() is called before Wait() is actually entered on the waiting thread - handles spurious wake up and timeout < inf (checks the sleep time, and if less than the required timeout, loops again with accordingly adjusted timeout value) How does it work: - there is mNotify flag - initially false - we have a sentinel loop while this flag is false - we set this flag in Notify() and NotifyAll() - we drop this flag after the sentinel loop exited - there is also mNotifyAllGeneration - it's a counter of number of calls to NotifyAll - we need this, since when more than one thread wait for the same cond var, the first woken up drops the mNotify flag and others would just continue the loop since they'd believe it was just a spurious wake up - but we break the sentinel loop when the generation value is different from the one when we entered Wait() There is tho one inconsistency with the functionality of being able to call NotifyAll() sooner than Wait() has been entered on the other thread. Normally (no patch), this will not wake the thread. However, the mNotify flag raised in NotifyAll() will prevent PR_WaitCondVar be even entered and let the thread do its job (as if it has been notified). This works well in case of a single waiting thread, but has an issue when there are more threads waiting: the first one entering Wait() will exit immediately, but others will wait. I kind think this is inconsistent and maybe the reversed sequencing (call of Notify() before Wait() while Wait() should not block) should be allowed only for Notify() and not for NotifyAll(). Technically it means nothing more then just remove of mNotify = true from NotifyAll. Note that the Notify before Wait bit is part of this patch because otherwise this patch would not be needed at all. Consumers would then need their own sentinel loop over wait anyway. I forgot about this issue more then once myself - I actually deadlocked the other thread. Solution was always the loop and flag. This patch should remove that burden altogether. What do you think? try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=557e69998658
Attachment #8734355 - Flags: review?(nfroyd)
Comment on attachment 8734355 [details] [diff] [review] v1 Review of attachment 8734355 [details] [diff] [review]: ----------------------------------------------------------------- I haven't thought too hard about whether the NotifyAll parts of this patch work, or whether mixing Notify and NotifyAll does the right thing, but the Notify part seems to have at least one issue, see below. ::: xpcom/glue/CondVar.h @@ +132,1 @@ > return PR_NotifyCondVar(mCvar) == PR_SUCCESS ? NS_OK : NS_ERROR_FAILURE; I don't think this works in all cases. Consider the following sequence of events: T1: Wait() on CV T2: Wait() on CV T3: Notify() on CV, mNotify is now true T4: Notify() on CV, mNotify is now true T1: wakes up, sets mNotify to false T2: wakes up, sees that mNotify is false, and goes back to waiting We now have a lost Notify() that nobody is going to handle. @@ +186,5 @@ > + // only one of them is going to wake up. > + // > + // XXX: should we rather be consistent and allow this functionality > + // only for Notify? > + bool mNotify; At the very least, these additional members need to be Atomic<>.
Attachment #8734355 - Flags: review?(nfroyd) → review-
(In reply to Nathan Froyd [:froydnj] from comment #2) > Comment on attachment 8734355 [details] [diff] [review] > v1 > > Review of attachment 8734355 [details] [diff] [review]: > ----------------------------------------------------------------- > > I haven't thought too hard about whether the NotifyAll parts of this patch > work, or whether mixing Notify and NotifyAll does the right thing, but the > Notify part seems to have at least one issue, see below. > > ::: xpcom/glue/CondVar.h > @@ +132,1 @@ > > return PR_NotifyCondVar(mCvar) == PR_SUCCESS ? NS_OK : NS_ERROR_FAILURE; > > I don't think this works in all cases. Consider the following sequence of > events: > > T1: Wait() on CV > T2: Wait() on CV > T3: Notify() on CV, mNotify is now true > T4: Notify() on CV, mNotify is now true > T1: wakes up, sets mNotify to false > T2: wakes up, sees that mNotify is false, and goes back to waiting > > We now have a lost Notify() that nobody is going to handle. Good catch, I missed that one. Thanks for checking on this so quickly. Seems like doing what I do for NotifyAll() also for Notify() is the way to go here: T1: Wait() on CV, gen = 0, mGen = 0 T2: Wait() on CV, gen = 0, mGen = 0 T3: Notify() on CV, mGen = 1 T4: Notify() on CV, mGen = 2 T1: wakes up, gen(0) != mGen(2) -> exit T2: wakes up, gen(0) != mGen(2) -> exit and to verify: T1: Wait() on CV, gen = 0, mGen = 0 T2: Wait() on CV, gen = 0, mGen = 0 T3: Notify() on CV, mGen = 1 T1: wakes up, gen(0) != mGen(1) -> exit .. T3: Notify() on CV, mGen = 2 T2: wakes up, gen(0) != mGen(2) -> exit makes sense? also nicely answers the dilemma of notify/notifyAll separate logic. only problem could be an 2^32 overlap of the counter when a certain condvar is actually woken up exactly after 2^32 notifications (mGen == gen). but that seems extremely unlikely. > > @@ +186,5 @@ > > + // only one of them is going to wake up. > > + // > > + // XXX: should we rather be consistent and allow this functionality > > + // only for Notify? > > + bool mNotify; > > At the very least, these additional members need to be Atomic<>. No, they are always accessed under the lock. It's not explicitly visible here, but when you call on any Wait() or Notify*() you must hold the paired lock over it. Worth a comment.
Flags: needinfo?(nfroyd)
The approach with just counting generations is not resilient to spurious wake up in the following scenario: T1: Wait() T2: Wait() T3: Notify() T2 may now spuriously wake up and exit from Wait() at any time. I'm getting out of ideas here. Candidate for WONTFIX? Nathan?
Honestly, I don't see a lot of places in the code that Wait() without a surrounding loop, which suggests that the right place to handle this is in more application-y code rather than library code. So yeah, let's mark this WONTFIX for now and we can revisit if need be.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Flags: needinfo?(nfroyd)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: