Closed Bug 458998 Opened 16 years ago Closed 16 years ago

AsyncExecute helper classes can complete() before they are added to mPendingEvents

Categories

(Toolkit :: Storage, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.9.1b2

People

(Reporter: asuth, Assigned: sdwilsh)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 4 obsolete files)

davida has seen the AsyncExecute destructor's NS_ASSERTION(mPendingEvents.Count() == 0, "Still pending events!") fire a bunch of times in succession.  Based on the fact that there was high latency at the time I am inclined to believe that a full GC with cycle collector was happening.  I assume some AsyncExecutes had completed execution, so were no longer directly alive, but were being sustained by a CompletionNotifier's reference count.  Based on my understanding of the rather undocumented nsRefPtr's source, I doubt that the cycle collector is able to follow the nsRefPtr link from the 'live' CompletionNotifier to the AsyncExecute.

I don't have ironclad proof of this, but my previous auditing of the code had suggested it was otherwise impossible for this to happen without memory corruption.
We shouldn't be touched by the cycle collector because we don't tell it about us.  I'm quite confused as to how we managed to have mPendingEvents.Count() be greater non-zero when we hit the constructor.  Every single pending event we post ends up holding a reference to the AsyncExecute object...
So, I never realized that mCallingThread->Dispatch is called before the mPendingEventsMutex is taken and mPendingEvents.AppendObject is called.

I had the debugger trigger on the assertion.  mPendingEvents has 3 completed CallbackResultNotifiers in it (they are not canceled and mCompletionNotifier was nulled out).  Each of their reference counts was 1.  This suggests that they must have run to completion before AsyncExecute got a chance to add them to mPendingEvents.

Is there any reason not to move the call to dispatch inside the mutex?

ps: Apologies for crying wolf, I confess I was hoping to get the problem solved for me without firing up a debugger or taking a break to properly read up on the cycle collector ;).  (It turns out I only had to read a few lines, of course...)
Summary: Notifier use of nsRefPtr rather than nsCOMPtr can result in live AsyncExecute being cycled collected → AsyncExecute helper classes can complete() before they are added to mPendingEvents
(In reply to comment #2)
> Is there any reason not to move the call to dispatch inside the mutex?

Yes, you shouldn't call out to other modules while holding a lock as it is a deadlock hazard. But there's definitely a race here that needs to be addressed.
Maybe as simple as this?

+  nsRefPtr<CallbackResultNotifier> notifier = ...
+  {
+    nsAutoLock mutex(mPendingEventsMutex);
+    (void)mPendingEvents.AppendObject(notifier);
+  }
+
+  nsresult status = mCallingThread->Dispatch(notifier, NS_DISPATCH_NORMAL);
+
+  if (NS_FAILED(status)) {
+    nsAutoLock mutex(mPendingEventsMutex);
+    (void)mPendingEvents.RemoveObject(notifier);
+  }
comment 4 is the right approach here I think
in fact, r=sdwilsh for a patch that does what comment 4 says, but with no newline after mCallingThread->Dispatch(...)
Attached patch v1.0 (obsolete) — Splinter Review
This is what bent said, but in patch form.  Needs bug 458756 applied to apply cleanly I suspect.
Attachment #342474 - Flags: review+
Assignee: nobody → bent.mozilla
Status: NEW → ASSIGNED
OS: Mac OS X → All
Hardware: PC → All
Target Milestone: --- → mozilla1.9.1b2
Shouldn't the patch also address the identical case for CompletionNotifier and the similar, but more complicated, case for ErrorNotifier?  (ErrorNotifier::Dispatch creates the ErrorNotifier instance and dispatches it before actually returning control to the AsyncExecute instance who maintains mPendingEvents.)
hmm, yeah - those need it too.
I think tinderbox is seeing this occasionally.
Assignee: bent.mozilla → sdwilsh
Whiteboard: [needs patch]
Attached patch v2.0 (obsolete) — Splinter Review
OK.  This is a much better solution.  All event dispatching now goes through Notify, and I've added NotifyError and NotifyResults.  Also renamed Complete to NotifyComplete.  There is now only one place in the code where pending events are added, and two where they are removed, making this logic much simpler to follow.
Attachment #342474 - Attachment is obsolete: true
Attachment #343473 - Flags: review?(bent.mozilla)
Comment on attachment 343473 [details] [diff] [review]
v2.0

Just to be clear...

>-          {
>-            nsAutoLock mutex(mStateMutex);
>-            mState = ERROR;
>-          }
>+          nsAutoLock mutex(mStateMutex);
>+          mState = ERROR;
This change is unrelated, but I noticed we were not holding the state mutex, which is an invariant of calling NotifyComplete.
Whiteboard: [needs patch] → [has patch][needs review bent]
Comment on attachment 343473 [details] [diff] [review]
v2.0

I would recommend using the atomic set operations rather than locking all over just to set an integer value. That would let you always lock inside the methods you'd like to rather than assuming the caller does the right thing.
(In reply to comment #13)
> (From update of attachment 343473 [details] [diff] [review])
> I would recommend using the atomic set operations rather than locking all over
> just to set an integer value. That would let you always lock inside the methods
> you'd like to rather than assuming the caller does the right thing.
I hadn't done that originally because I wasn't sure if we really have atomic reads on all platforms (I'm pretty sure we don't actually unless we use the volatile keyword, which we can't actually use with PR_AtomicSet), but I've got a patch coming with that.
Attached patch v2.1 (obsolete) — Splinter Review
Attachment #343473 - Attachment is obsolete: true
Attachment #344011 - Flags: review?(bent.mozilla)
Attachment #343473 - Flags: review?(bent.mozilla)
OK - this bug really sucks, for the record.  It is quite possible to call cancel when we've already finished.  In the past, it would throw (I'd say that is not desirable), but with the approach that bent suggested to me, it means we end up having it be quite possible to say that our reason for completion in handleCompletion is REASON_FINISHED instead of REASON_CANCELED.

I almost think that we have to keep the mState lock and throw if we try to cancel when we are not currently in a pending state (or at least atomically set on the background, and read only on the calling thread).  Regardless, the test needs to be fixed to reflect that...
But that also seems like a really bad idea.  I'm not sure what the right behavior is here...
Attached patch v3.0 (obsolete) — Splinter Review
This stuff is hard...

This has an interface change for mozIStoragePendingStatement::cancel.  It now returns a boolean indicating if it successfully canceled or not.  Discussed this at length with bent, and it makes our tests more reliable.
Attachment #344011 - Attachment is obsolete: true
Attachment #344560 - Flags: review?(bent.mozilla)
Attachment #344011 - Flags: review?(bent.mozilla)
Attachment #344560 - Flags: review?(bent.mozilla) → review-
Comment on attachment 344560 [details] [diff] [review]
v3.0

>+   * Cancels a pending statement, if possible.  This will only fail if you try
>+   * cancel more than once.

Typo, should be "try to cancel".

>     for (PRUint32 i = 0; i < mStatements.Length(); i++) {

So, for the record, I think this while-within-a-for-loop construct is tough to follow and that it could be simplified by breaking the while loop into a separate function. I further think that would let you use the scoped locks more effectively. But since we're tight on time and it looks like you've done everything correctly I guess that can wait for a followup.

>+#ifdef DEBUG
>+    PRBool onCallingThread = PR_FALSE;
>+    (void)mCallingThread->IsOnCurrentThread(&onCallingThread);
>+    NS_ASSERTION(onCallingThread, "Not canceling from the calling thread!");
>+#endif

Handy! Maybe file a followup about adding these in more places :)

>+    // If we have already canceled, we have an error, but always indicate that
>+    // we are trying to cancel.
>+    NS_ENSURE_FALSE(PR_AtomicSet(&mTryToCancel, PR_TRUE), NS_ERROR_UNEXPECTED);

I'd avoid calling functions within any of the nsDebug macros. Just make a stack variable.

And don't use the PR_AtomicSet on a PRBool. That will break when PRBool and PRInt32 are no longer typedef'd to the same thing.

>+    // Establish if we can cancel
>     {
>+      nsAutoLock mutex(mLock);
>+      *_successful = (mState == PENDING);
>     }

Your cancel could have already succeeded, need to also check mState == CANCELED. Or move the set for mTryToCancel inside the mutex after you've tested.

>+    nsAutoLock::DestroyLock(mLock);
>+    mLock = nsnull;

Nulling this out in the destructor isn't necessary

>     if (mCallback) {
>       nsRefPtr<CompletionNotifier> completionEvent =
>+        new CompletionNotifier(mCallback, mState);
>+      NS_ENSURE_TRUE(completionEvent, NS_ERROR_OUT_OF_MEMORY);
>+
>+      (void)mCallingThread->Dispatch(completionEvent, NS_DISPATCH_NORMAL);
> 
>       // We no longer own mCallback (the CompletionNotifier takes ownership).
>       mCallback = nsnull;

I'd move this above the dispatch, then, to avoid any funny races.

>+  nsresult NotifyError(PRInt32 aErrorCode, const char *aMessage)

I can't find a single place where you care about the return value in any of the Notify functions except NotifyComplete... You're always discarding the return. Maybe make them return void?

I'd also document that all these Notify functions expect to *not* have the lock in place when called.

And I was thinking about it, it would probably be trivial to subclass nsAutoLock in debug builds to expose the locked/unlocked status so you could assert that. Followup fodder.

>+  PRBool mTryToCancel;

I think I like 'mCancelRequested' or 'mCancelAttempted' better... Your call though.

>+   * This is the lock that protects our state from changing.

Maybe list the variables it protects instead since that's pretty vague.
Attached patch v3.1Splinter Review
Addresses review comments
Attachment #344560 - Attachment is obsolete: true
Attachment #344679 - Flags: review?(bent.mozilla)
er...with the exception of the nulling out in the destructor which I've fixed locally.
Comment on attachment 344679 [details] [diff] [review]
v3.1

>+      // We need to indicate that we want to try and cancel now.
>+      mCancelRequested = PR_TRUE;
>+
>+      // Establish if we can cancel
>+      *_successful = (mState == PENDING);

Add some comments here about why you're canceling the previously queued results even if you're going to return false.
Attachment #344679 - Flags: review?(bent.mozilla) → review+
http://hg.mozilla.org/mozilla-central/rev/d3cdb61f7d29
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Whiteboard: [has patch][needs review bent]
It looks like CompletionNotifier::cancel was accidentally left around?
huh, missed one then.  Could you please file a bug to remove that?
(In reply to comment #26)

I filed bug 461902.
So, I do not think we managed to get this race condition nailed out given a recent tinderbox failure:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1225387432.1225397005.17478.gz#err0

The test in question is setup to be fault tolerant of our reason being either finished or canceled.  It's that, or we've managed to say we can cancel, but do not properly update state somewhere.
I know what's up with comment 28 - I'll file a bug about it later.  For now there is a sticky note on my desk on what to do.
Whiteboard: [orange]
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: