Closed Bug 458998 Opened 16 years ago Closed 16 years ago

AsyncExecute helper classes can complete() before they are added to mPendingEvents

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla1.9.1b2

People

(Reporter: asuth, Assigned: sdwilsh)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 4 obsolete files)

v1.0 16 years ago Shawn Wilsher :sdwilsh 837 bytes, patch	sdwilsh : review+	Details \| Diff \| Splinter Review
v2.0 16 years ago Shawn Wilsher :sdwilsh 11.03 KB, patch		Details \| Diff \| Splinter Review
v2.1 16 years ago Shawn Wilsher :sdwilsh 14.05 KB, patch		Details \| Diff \| Splinter Review
v3.0 16 years ago Shawn Wilsher :sdwilsh 23.11 KB, patch	bent.mozilla : review-	Details \| Diff \| Splinter Review
v3.1 16 years ago Shawn Wilsher :sdwilsh 23.91 KB, patch	bent.mozilla : review+	Details \| Diff \| Splinter Review

Andrew Sutherland [:asuth] (he/him)

Reporter

Description

•

16 years ago

davida has seen the AsyncExecute destructor's NS_ASSERTION(mPendingEvents.Count() == 0, "Still pending events!") fire a bunch of times in succession.  Based on the fact that there was high latency at the time I am inclined to believe that a full GC with cycle collector was happening.  I assume some AsyncExecutes had completed execution, so were no longer directly alive, but were being sustained by a CompletionNotifier's reference count.  Based on my understanding of the rather undocumented nsRefPtr's source, I doubt that the cycle collector is able to follow the nsRefPtr link from the 'live' CompletionNotifier to the AsyncExecute.

I don't have ironclad proof of this, but my previous auditing of the code had suggested it was otherwise impossible for this to happen without memory corruption.

Shawn Wilsher :sdwilsh

Assignee

Comment 1

•

16 years ago

We shouldn't be touched by the cycle collector because we don't tell it about us.  I'm quite confused as to how we managed to have mPendingEvents.Count() be greater non-zero when we hit the constructor.  Every single pending event we post ends up holding a reference to the AsyncExecute object...

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 2

•

16 years ago

So, I never realized that mCallingThread->Dispatch is called before the mPendingEventsMutex is taken and mPendingEvents.AppendObject is called.

I had the debugger trigger on the assertion.  mPendingEvents has 3 completed CallbackResultNotifiers in it (they are not canceled and mCompletionNotifier was nulled out).  Each of their reference counts was 1.  This suggests that they must have run to completion before AsyncExecute got a chance to add them to mPendingEvents.

Is there any reason not to move the call to dispatch inside the mutex?

ps: Apologies for crying wolf, I confess I was hoping to get the problem solved for me without firing up a debugger or taking a break to properly read up on the cycle collector ;).  (It turns out I only had to read a few lines, of course...)

Summary: Notifier use of nsRefPtr rather than nsCOMPtr can result in live AsyncExecute being cycled collected → AsyncExecute helper classes can complete() before they are added to mPendingEvents

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 3

•

16 years ago

(In reply to comment #2)
> Is there any reason not to move the call to dispatch inside the mutex?

Yes, you shouldn't call out to other modules while holding a lock as it is a deadlock hazard. But there's definitely a race here that needs to be addressed.

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 4

•

16 years ago

Maybe as simple as this?

+  nsRefPtr<CallbackResultNotifier> notifier = ...
+  {
+    nsAutoLock mutex(mPendingEventsMutex);
+    (void)mPendingEvents.AppendObject(notifier);
+  }
+
+  nsresult status = mCallingThread->Dispatch(notifier, NS_DISPATCH_NORMAL);
+
+  if (NS_FAILED(status)) {
+    nsAutoLock mutex(mPendingEventsMutex);
+    (void)mPendingEvents.RemoveObject(notifier);
+  }

Shawn Wilsher :sdwilsh

Assignee

Comment 5

•

16 years ago

comment 4 is the right approach here I think

Shawn Wilsher :sdwilsh

Assignee

Comment 6

•

16 years ago

in fact, r=sdwilsh for a patch that does what comment 4 says, but with no newline after mCallingThread->Dispatch(...)

Shawn Wilsher :sdwilsh

Assignee

Comment 7

•

16 years ago

Attached patch v1.0 (obsolete) — Details — Splinter Review

This is what bent said, but in patch form.  Needs bug 458756 applied to apply cleanly I suspect.

Attachment #342474 - Flags: review+

Shawn Wilsher :sdwilsh

Assignee

Updated

•

16 years ago

Assignee: nobody → bent.mozilla

Status: NEW → ASSIGNED

OS: Mac OS X → All

Hardware: PC → All

Target Milestone: --- → mozilla1.9.1b2

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 8

•

16 years ago

Shouldn't the patch also address the identical case for CompletionNotifier and the similar, but more complicated, case for ErrorNotifier?  (ErrorNotifier::Dispatch creates the ErrorNotifier instance and dispatches it before actually returning control to the AsyncExecute instance who maintains mPendingEvents.)

Shawn Wilsher :sdwilsh

Assignee

Comment 9

•

16 years ago

hmm, yeah - those need it too.

Shawn Wilsher :sdwilsh

Assignee

Comment 10

•

16 years ago

I think tinderbox is seeing this occasionally.

Assignee: bent.mozilla → sdwilsh

Shawn Wilsher :sdwilsh

Assignee

Updated

•

16 years ago

Whiteboard: [needs patch]

Shawn Wilsher :sdwilsh

Assignee

Comment 11

•

16 years ago

Attached patch v2.0 (obsolete) — Details — Splinter Review

OK.  This is a much better solution.  All event dispatching now goes through Notify, and I've added NotifyError and NotifyResults.  Also renamed Complete to NotifyComplete.  There is now only one place in the code where pending events are added, and two where they are removed, making this logic much simpler to follow.

Attachment #342474 - Attachment is obsolete: true

Attachment #343473 - Flags: review?(bent.mozilla)

Shawn Wilsher :sdwilsh

Assignee

Comment 12

•

16 years ago

Comment on attachment 343473 [details] [diff] [review]
v2.0

Just to be clear...

>-          {
>-            nsAutoLock mutex(mStateMutex);
>-            mState = ERROR;
>-          }
>+          nsAutoLock mutex(mStateMutex);
>+          mState = ERROR;
This change is unrelated, but I noticed we were not holding the state mutex, which is an invariant of calling NotifyComplete.

Shawn Wilsher :sdwilsh

Assignee

Updated

•

16 years ago

Whiteboard: [needs patch] → [has patch][needs review bent]

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 13

•

16 years ago

Comment on attachment 343473 [details] [diff] [review]
v2.0

I would recommend using the atomic set operations rather than locking all over just to set an integer value. That would let you always lock inside the methods you'd like to rather than assuming the caller does the right thing.

Shawn Wilsher :sdwilsh

Assignee

Comment 14

•

16 years ago

(In reply to comment #13)
> (From update of attachment 343473 [details] [diff] [review])
> I would recommend using the atomic set operations rather than locking all over
> just to set an integer value. That would let you always lock inside the methods
> you'd like to rather than assuming the caller does the right thing.
I hadn't done that originally because I wasn't sure if we really have atomic reads on all platforms (I'm pretty sure we don't actually unless we use the volatile keyword, which we can't actually use with PR_AtomicSet), but I've got a patch coming with that.

Shawn Wilsher :sdwilsh

Assignee

Comment 15

•

16 years ago

Attached patch v2.1 (obsolete) — Details — Splinter Review

Attachment #343473 - Attachment is obsolete: true

Attachment #344011 - Flags: review?(bent.mozilla)

Attachment #343473 - Flags: review?(bent.mozilla)

Serge Gautherie (:sgautherie)

Updated

•

16 years ago

Blocks: 438871

Shawn Wilsher :sdwilsh

Assignee

Comment 17

•

16 years ago

OK - this bug really sucks, for the record.  It is quite possible to call cancel when we've already finished.  In the past, it would throw (I'd say that is not desirable), but with the approach that bent suggested to me, it means we end up having it be quite possible to say that our reason for completion in handleCompletion is REASON_FINISHED instead of REASON_CANCELED.

I almost think that we have to keep the mState lock and throw if we try to cancel when we are not currently in a pending state (or at least atomically set on the background, and read only on the calling thread).  Regardless, the test needs to be fixed to reflect that...

Shawn Wilsher :sdwilsh

Assignee

Comment 18

•

16 years ago

But that also seems like a really bad idea.  I'm not sure what the right behavior is here...

Shawn Wilsher :sdwilsh

Assignee

Comment 19

•

16 years ago

Attached patch v3.0 (obsolete) — Details — Splinter Review

This stuff is hard...

This has an interface change for mozIStoragePendingStatement::cancel.  It now returns a boolean indicating if it successfully canceled or not.  Discussed this at length with bent, and it makes our tests more reliable.

Attachment #344011 - Attachment is obsolete: true

Attachment #344560 - Flags: review?(bent.mozilla)

Attachment #344011 - Flags: review?(bent.mozilla)

Ben Turner (not reading bugmail, use the needinfo flag!)

Updated

•

16 years ago

Attachment #344560 - Flags: review?(bent.mozilla) → review-

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 20

•

16 years ago

Comment on attachment 344560 [details] [diff] [review]
v3.0

>+   * Cancels a pending statement, if possible.  This will only fail if you try
>+   * cancel more than once.

Typo, should be "try to cancel".

>     for (PRUint32 i = 0; i < mStatements.Length(); i++) {

So, for the record, I think this while-within-a-for-loop construct is tough to follow and that it could be simplified by breaking the while loop into a separate function. I further think that would let you use the scoped locks more effectively. But since we're tight on time and it looks like you've done everything correctly I guess that can wait for a followup.

>+#ifdef DEBUG
>+    PRBool onCallingThread = PR_FALSE;
>+    (void)mCallingThread->IsOnCurrentThread(&onCallingThread);
>+    NS_ASSERTION(onCallingThread, "Not canceling from the calling thread!");
>+#endif

Handy! Maybe file a followup about adding these in more places :)

>+    // If we have already canceled, we have an error, but always indicate that
>+    // we are trying to cancel.
>+    NS_ENSURE_FALSE(PR_AtomicSet(&mTryToCancel, PR_TRUE), NS_ERROR_UNEXPECTED);

I'd avoid calling functions within any of the nsDebug macros. Just make a stack variable.

And don't use the PR_AtomicSet on a PRBool. That will break when PRBool and PRInt32 are no longer typedef'd to the same thing.

>+    // Establish if we can cancel
>     {
>+      nsAutoLock mutex(mLock);
>+      *_successful = (mState == PENDING);
>     }

Your cancel could have already succeeded, need to also check mState == CANCELED. Or move the set for mTryToCancel inside the mutex after you've tested.

>+    nsAutoLock::DestroyLock(mLock);
>+    mLock = nsnull;

Nulling this out in the destructor isn't necessary

>     if (mCallback) {
>       nsRefPtr<CompletionNotifier> completionEvent =
>+        new CompletionNotifier(mCallback, mState);
>+      NS_ENSURE_TRUE(completionEvent, NS_ERROR_OUT_OF_MEMORY);
>+
>+      (void)mCallingThread->Dispatch(completionEvent, NS_DISPATCH_NORMAL);
> 
>       // We no longer own mCallback (the CompletionNotifier takes ownership).
>       mCallback = nsnull;

I'd move this above the dispatch, then, to avoid any funny races.

>+  nsresult NotifyError(PRInt32 aErrorCode, const char *aMessage)

I can't find a single place where you care about the return value in any of the Notify functions except NotifyComplete... You're always discarding the return. Maybe make them return void?

I'd also document that all these Notify functions expect to *not* have the lock in place when called.

And I was thinking about it, it would probably be trivial to subclass nsAutoLock in debug builds to expose the locked/unlocked status so you could assert that. Followup fodder.

>+  PRBool mTryToCancel;

I think I like 'mCancelRequested' or 'mCancelAttempted' better... Your call though.

>+   * This is the lock that protects our state from changing.

Maybe list the variables it protects instead since that's pretty vague.

Shawn Wilsher :sdwilsh

Assignee

Comment 21

•

16 years ago

Attached patch v3.1 — Details — Splinter Review

Addresses review comments

Attachment #344560 - Attachment is obsolete: true

Attachment #344679 - Flags: review?(bent.mozilla)

Shawn Wilsher :sdwilsh

Assignee

Comment 22

•

16 years ago

er...with the exception of the nulling out in the destructor which I've fixed locally.

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 23

•

16 years ago

Comment on attachment 344679 [details] [diff] [review]
v3.1

>+      // We need to indicate that we want to try and cancel now.
>+      mCancelRequested = PR_TRUE;
>+
>+      // Establish if we can cancel
>+      *_successful = (mState == PENDING);

Add some comments here about why you're canceling the previously queued results even if you're going to return false.

Attachment #344679 - Flags: review?(bent.mozilla) → review+

Shawn Wilsher :sdwilsh

Assignee

Comment 24

•

16 years ago

http://hg.mozilla.org/mozilla-central/rev/d3cdb61f7d29

Status: ASSIGNED → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Whiteboard: [has patch][needs review bent]

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 25

•

16 years ago

It looks like CompletionNotifier::cancel was accidentally left around?

Shawn Wilsher :sdwilsh

Assignee

Comment 26

•

16 years ago

huh, missed one then.  Could you please file a bug to remove that?

Serge Gautherie (:sgautherie)

Updated

•

16 years ago

Blocks: 461902

Serge Gautherie (:sgautherie)

Comment 27

•

16 years ago

(In reply to comment #26)

I filed bug 461902.

Shawn Wilsher :sdwilsh

Assignee

Comment 28

•

16 years ago

So, I do not think we managed to get this race condition nailed out given a recent tinderbox failure:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1225387432.1225397005.17478.gz#err0

The test in question is setup to be fault tolerant of our reason being either finished or canceled.  It's that, or we've managed to say we can cancel, but do not properly update state somewhere.

Shawn Wilsher :sdwilsh

Assignee

Comment 29

•

16 years ago

I know what's up with comment 28 - I'll file a bug about it later.  For now there is a sticky note on my desk on what to do.

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

15 years ago

Whiteboard: [orange]

Nobody; OK to take it and work on it

Updated

•

12 years ago

Keywords: intermittent-failure

Nobody; OK to take it and work on it

Updated

•

12 years ago

Whiteboard: [orange]

You need to log in before you can comment on or make changes to this bug.