AsyncExecute helper classes can complete() before they are added to mPendingEvents

RESOLVED FIXED in mozilla1.9.1b2

Status

()

defect
RESOLVED FIXED
11 years ago
7 years ago

People

(Reporter: asuth, Assigned: sdwilsh)

Tracking

({intermittent-failure})

Trunk
mozilla1.9.1b2
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 4 obsolete attachments)

davida has seen the AsyncExecute destructor's NS_ASSERTION(mPendingEvents.Count() == 0, "Still pending events!") fire a bunch of times in succession.  Based on the fact that there was high latency at the time I am inclined to believe that a full GC with cycle collector was happening.  I assume some AsyncExecutes had completed execution, so were no longer directly alive, but were being sustained by a CompletionNotifier's reference count.  Based on my understanding of the rather undocumented nsRefPtr's source, I doubt that the cycle collector is able to follow the nsRefPtr link from the 'live' CompletionNotifier to the AsyncExecute.

I don't have ironclad proof of this, but my previous auditing of the code had suggested it was otherwise impossible for this to happen without memory corruption.
Assignee

Comment 1

11 years ago
We shouldn't be touched by the cycle collector because we don't tell it about us.  I'm quite confused as to how we managed to have mPendingEvents.Count() be greater non-zero when we hit the constructor.  Every single pending event we post ends up holding a reference to the AsyncExecute object...
So, I never realized that mCallingThread->Dispatch is called before the mPendingEventsMutex is taken and mPendingEvents.AppendObject is called.

I had the debugger trigger on the assertion.  mPendingEvents has 3 completed CallbackResultNotifiers in it (they are not canceled and mCompletionNotifier was nulled out).  Each of their reference counts was 1.  This suggests that they must have run to completion before AsyncExecute got a chance to add them to mPendingEvents.

Is there any reason not to move the call to dispatch inside the mutex?

ps: Apologies for crying wolf, I confess I was hoping to get the problem solved for me without firing up a debugger or taking a break to properly read up on the cycle collector ;).  (It turns out I only had to read a few lines, of course...)
Summary: Notifier use of nsRefPtr rather than nsCOMPtr can result in live AsyncExecute being cycled collected → AsyncExecute helper classes can complete() before they are added to mPendingEvents
(In reply to comment #2)
> Is there any reason not to move the call to dispatch inside the mutex?

Yes, you shouldn't call out to other modules while holding a lock as it is a deadlock hazard. But there's definitely a race here that needs to be addressed.
Maybe as simple as this?

+  nsRefPtr<CallbackResultNotifier> notifier = ...
+  {
+    nsAutoLock mutex(mPendingEventsMutex);
+    (void)mPendingEvents.AppendObject(notifier);
+  }
+
+  nsresult status = mCallingThread->Dispatch(notifier, NS_DISPATCH_NORMAL);
+
+  if (NS_FAILED(status)) {
+    nsAutoLock mutex(mPendingEventsMutex);
+    (void)mPendingEvents.RemoveObject(notifier);
+  }
Assignee

Comment 5

11 years ago
comment 4 is the right approach here I think
Assignee

Comment 6

11 years ago
in fact, r=sdwilsh for a patch that does what comment 4 says, but with no newline after mCallingThread->Dispatch(...)
Assignee

Comment 7

11 years ago
Posted patch v1.0 (obsolete) — Splinter Review
This is what bent said, but in patch form.  Needs bug 458756 applied to apply cleanly I suspect.
Attachment #342474 - Flags: review+
Assignee

Updated

11 years ago
Assignee: nobody → bent.mozilla
Status: NEW → ASSIGNED
OS: Mac OS X → All
Hardware: PC → All
Target Milestone: --- → mozilla1.9.1b2
Shouldn't the patch also address the identical case for CompletionNotifier and the similar, but more complicated, case for ErrorNotifier?  (ErrorNotifier::Dispatch creates the ErrorNotifier instance and dispatches it before actually returning control to the AsyncExecute instance who maintains mPendingEvents.)
Assignee

Comment 9

11 years ago
hmm, yeah - those need it too.
Assignee

Comment 10

11 years ago
I think tinderbox is seeing this occasionally.
Assignee: bent.mozilla → sdwilsh
Assignee

Updated

11 years ago
Whiteboard: [needs patch]
Assignee

Comment 11

11 years ago
Posted patch v2.0 (obsolete) — Splinter Review
OK.  This is a much better solution.  All event dispatching now goes through Notify, and I've added NotifyError and NotifyResults.  Also renamed Complete to NotifyComplete.  There is now only one place in the code where pending events are added, and two where they are removed, making this logic much simpler to follow.
Attachment #342474 - Attachment is obsolete: true
Attachment #343473 - Flags: review?(bent.mozilla)
Assignee

Comment 12

11 years ago
Comment on attachment 343473 [details] [diff] [review]
v2.0

Just to be clear...

>-          {
>-            nsAutoLock mutex(mStateMutex);
>-            mState = ERROR;
>-          }
>+          nsAutoLock mutex(mStateMutex);
>+          mState = ERROR;
This change is unrelated, but I noticed we were not holding the state mutex, which is an invariant of calling NotifyComplete.
Assignee

Updated

11 years ago
Whiteboard: [needs patch] → [has patch][needs review bent]
Comment on attachment 343473 [details] [diff] [review]
v2.0

I would recommend using the atomic set operations rather than locking all over just to set an integer value. That would let you always lock inside the methods you'd like to rather than assuming the caller does the right thing.
Assignee

Comment 14

11 years ago
(In reply to comment #13)
> (From update of attachment 343473 [details] [diff] [review])
> I would recommend using the atomic set operations rather than locking all over
> just to set an integer value. That would let you always lock inside the methods
> you'd like to rather than assuming the caller does the right thing.
I hadn't done that originally because I wasn't sure if we really have atomic reads on all platforms (I'm pretty sure we don't actually unless we use the volatile keyword, which we can't actually use with PR_AtomicSet), but I've got a patch coming with that.
Assignee

Comment 15

11 years ago
Posted patch v2.1 (obsolete) — Splinter Review
Attachment #343473 - Attachment is obsolete: true
Attachment #344011 - Flags: review?(bent.mozilla)
Attachment #343473 - Flags: review?(bent.mozilla)
Assignee

Updated

11 years ago
Duplicate of this bug: 460242
Assignee

Comment 17

11 years ago
OK - this bug really sucks, for the record.  It is quite possible to call cancel when we've already finished.  In the past, it would throw (I'd say that is not desirable), but with the approach that bent suggested to me, it means we end up having it be quite possible to say that our reason for completion in handleCompletion is REASON_FINISHED instead of REASON_CANCELED.

I almost think that we have to keep the mState lock and throw if we try to cancel when we are not currently in a pending state (or at least atomically set on the background, and read only on the calling thread).  Regardless, the test needs to be fixed to reflect that...
Assignee

Comment 18

11 years ago
But that also seems like a really bad idea.  I'm not sure what the right behavior is here...
Assignee

Comment 19

11 years ago
Posted patch v3.0 (obsolete) — Splinter Review
This stuff is hard...

This has an interface change for mozIStoragePendingStatement::cancel.  It now returns a boolean indicating if it successfully canceled or not.  Discussed this at length with bent, and it makes our tests more reliable.
Attachment #344011 - Attachment is obsolete: true
Attachment #344560 - Flags: review?(bent.mozilla)
Attachment #344011 - Flags: review?(bent.mozilla)
Attachment #344560 - Flags: review?(bent.mozilla) → review-
Comment on attachment 344560 [details] [diff] [review]
v3.0

>+   * Cancels a pending statement, if possible.  This will only fail if you try
>+   * cancel more than once.

Typo, should be "try to cancel".

>     for (PRUint32 i = 0; i < mStatements.Length(); i++) {

So, for the record, I think this while-within-a-for-loop construct is tough to follow and that it could be simplified by breaking the while loop into a separate function. I further think that would let you use the scoped locks more effectively. But since we're tight on time and it looks like you've done everything correctly I guess that can wait for a followup.

>+#ifdef DEBUG
>+    PRBool onCallingThread = PR_FALSE;
>+    (void)mCallingThread->IsOnCurrentThread(&onCallingThread);
>+    NS_ASSERTION(onCallingThread, "Not canceling from the calling thread!");
>+#endif

Handy! Maybe file a followup about adding these in more places :)

>+    // If we have already canceled, we have an error, but always indicate that
>+    // we are trying to cancel.
>+    NS_ENSURE_FALSE(PR_AtomicSet(&mTryToCancel, PR_TRUE), NS_ERROR_UNEXPECTED);

I'd avoid calling functions within any of the nsDebug macros. Just make a stack variable.

And don't use the PR_AtomicSet on a PRBool. That will break when PRBool and PRInt32 are no longer typedef'd to the same thing.

>+    // Establish if we can cancel
>     {
>+      nsAutoLock mutex(mLock);
>+      *_successful = (mState == PENDING);
>     }

Your cancel could have already succeeded, need to also check mState == CANCELED. Or move the set for mTryToCancel inside the mutex after you've tested.

>+    nsAutoLock::DestroyLock(mLock);
>+    mLock = nsnull;

Nulling this out in the destructor isn't necessary

>     if (mCallback) {
>       nsRefPtr<CompletionNotifier> completionEvent =
>+        new CompletionNotifier(mCallback, mState);
>+      NS_ENSURE_TRUE(completionEvent, NS_ERROR_OUT_OF_MEMORY);
>+
>+      (void)mCallingThread->Dispatch(completionEvent, NS_DISPATCH_NORMAL);
> 
>       // We no longer own mCallback (the CompletionNotifier takes ownership).
>       mCallback = nsnull;

I'd move this above the dispatch, then, to avoid any funny races.

>+  nsresult NotifyError(PRInt32 aErrorCode, const char *aMessage)

I can't find a single place where you care about the return value in any of the Notify functions except NotifyComplete... You're always discarding the return. Maybe make them return void?

I'd also document that all these Notify functions expect to *not* have the lock in place when called.

And I was thinking about it, it would probably be trivial to subclass nsAutoLock in debug builds to expose the locked/unlocked status so you could assert that. Followup fodder.

>+  PRBool mTryToCancel;

I think I like 'mCancelRequested' or 'mCancelAttempted' better... Your call though.

>+   * This is the lock that protects our state from changing.

Maybe list the variables it protects instead since that's pretty vague.
Assignee

Comment 21

11 years ago
Posted patch v3.1Splinter Review
Addresses review comments
Attachment #344560 - Attachment is obsolete: true
Attachment #344679 - Flags: review?(bent.mozilla)
Assignee

Comment 22

11 years ago
er...with the exception of the nulling out in the destructor which I've fixed locally.
Comment on attachment 344679 [details] [diff] [review]
v3.1

>+      // We need to indicate that we want to try and cancel now.
>+      mCancelRequested = PR_TRUE;
>+
>+      // Establish if we can cancel
>+      *_successful = (mState == PENDING);

Add some comments here about why you're canceling the previously queued results even if you're going to return false.
Attachment #344679 - Flags: review?(bent.mozilla) → review+
Assignee

Comment 24

11 years ago
http://hg.mozilla.org/mozilla-central/rev/d3cdb61f7d29
Status: ASSIGNED → RESOLVED
Last Resolved: 11 years ago
Resolution: --- → FIXED
Whiteboard: [has patch][needs review bent]
It looks like CompletionNotifier::cancel was accidentally left around?
Assignee

Comment 26

11 years ago
huh, missed one then.  Could you please file a bug to remove that?
(In reply to comment #26)

I filed bug 461902.
Assignee

Comment 28

11 years ago
So, I do not think we managed to get this race condition nailed out given a recent tinderbox failure:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1225387432.1225397005.17478.gz#err0

The test in question is setup to be fault tolerant of our reason being either finished or canceled.  It's that, or we've managed to say we can cancel, but do not properly update state somewhere.
Assignee

Comment 29

11 years ago
I know what's up with comment 28 - I'll file a bug about it later.  For now there is a sticky note on my desk on what to do.
Whiteboard: [orange]
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.