Open Bug 1334383 Opened 3 years ago Updated 2 months ago

Intermittent application crashed [@ TimerThread::Run]

Categories

(Core :: DOM: Core & HTML, defect, P5)

defect

Tracking

()

ASSIGNED

People

(Reporter: intermittent-bug-filer, Assigned: froydnj)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure)

Crash Data

Attachments

(3 files)

This is crashing on the nsTimer assertion requiring timers to be released on their target thread:

https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/TimerThread.cpp#509

We're hitting this because Workers now use an nsIEventTarget that can fail dispatch if the worker has shutdown.

The assertion seems to ignore the fact, though, that timers can be canceled.  The WorkerPrivate is gone and can't take a dispatch any more, but before it gets to this state it cancels its timers.  Presumably the cancel is triggered just after we start the machinery to do this dispatch on TimerThread.

It seems like we have a couple different options:

1) Relax the assertion in the case where the timer was canceled.
2) Add more locking so its impossible for a cancel to occur while we are in the middle of trying to dispatch the timer.

Nathan, what do you think?
Flags: needinfo?(nfroyd)
So the racy sequence is:

WorkerThread: Cancel timer
TimerThread: Start the process of trying to dispatch the timer
WorkerThread: Release the timer reference
TimerThread: Try to dispatch timer, but WorkerThread is not accepting new runnables
TimerThread: Release the timer reference, and discover we're on the wrong thread

or something to that effect, is that correct?  I am not in favor of relaxing that assertion, because we don't want things being destroyed on the timer thread in general.  But I'm having a hard time seeing what more locking we could add to try to avoid this situation.  Can you explain?
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #2)
> WorkerThread: Cancel timer
> TimerThread: Start the process of trying to dispatch the timer
> WorkerThread: Release the timer reference
> TimerThread: Try to dispatch timer, but WorkerThread is not accepting new
> runnables
> TimerThread: Release the timer reference, and discover we're on the wrong
> thread

Yes, exactly.  Although maybe its more like the first two are swapped:

TimerThread: Start the process of trying to dispatch the timer
WorkerThread: Cancel timer
WorkerThread: Release the timer reference
TimerThread: Try to dispatch timer, but WorkerThread is not accepting new runnables
TimerThread: Release the timer reference, and discover we're on the wrong thread

TimerThread could lock a mutex specific to the timer in step 1.  That would make WorkerThread block when it tries to cancel the timer.

Right now it seems Cancel() is busted.  It should guarantee that either:

1) The timer is canceled and will not be dispatched.
2) The timer was already dispatched and the Cancel is a no-op.

Currently it seems its possible to reach a third possibility:

3) The timer is canceled but nsTimerThread still needs to dispatch it.
Flags: needinfo?(nfroyd)
> 1) The timer is canceled and will not be dispatched.
> 2) The timer was already dispatched and the Cancel is a no-op.

I guess these should really be:

1) The timer is canceled, will not be dispatched, and timer thread has dropped its ref.
2) The timer was already dispatched, dropped its ref, and the Cancel is a no-op.
(In reply to Ben Kelly [:bkelly] from comment #3)
> (In reply to Nathan Froyd [:froydnj] from comment #2)
> > WorkerThread: Cancel timer
> > TimerThread: Start the process of trying to dispatch the timer
> > WorkerThread: Release the timer reference
> > TimerThread: Try to dispatch timer, but WorkerThread is not accepting new
> > runnables
> > TimerThread: Release the timer reference, and discover we're on the wrong
> > thread
> 
> Yes, exactly.  Although maybe its more like the first two are swapped:
> 
> TimerThread: Start the process of trying to dispatch the timer
> WorkerThread: Cancel timer
> WorkerThread: Release the timer reference
> TimerThread: Try to dispatch timer, but WorkerThread is not accepting new
> runnables
> TimerThread: Release the timer reference, and discover we're on the wrong
> thread
> 
> TimerThread could lock a mutex specific to the timer in step 1.  That would
> make WorkerThread block when it tries to cancel the timer.

How long do you want that mutex to be locked?  Until the target thread accepts the Dispatch() of the timer runnable?  Locking around Dispatch() seems to be asking for bad things to happen, though maybe this is safe with the current worker code...

> Right now it seems Cancel() is busted.

I *think* bug 1328643 addresses this concern.

Alternatively, maybe the worker should cancel all its timers *before* it stops accepting events?
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #5)
> Alternatively, maybe the worker should cancel all its timers *before* it
> stops accepting events?

I believe this is what we do.  We cancel our GC timer in the worker Killing state here:

https://dxr.mozilla.org/mozilla-central/source/dom/workers/WorkerPrivate.cpp#4692

And we accept control runnables for dispatch until we are in the Dead state here:

https://dxr.mozilla.org/mozilla-central/source/dom/workers/WorkerPrivate.cpp#2550

Now, we do immediately set the state to Dead after cancelling the timer runnable, but the worker is accepting dispatches up until right after we call Cancel().
If nsTimerImpl::Cancel() dropped the callback references then perhaps we could just assert that the callbacks refs were released on the target thread instead of asserting where the nsTimerImpl itself is released.
Actually, we do release the callback in Cancel():

https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsTimerImpl.cpp#318

Nathan, what do you think about asserting we are on the target thread here:

https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsTimerImpl.h#74

And then dropping the assertion in comment 1.

I think the really important thing is that the caller's callback objects are released on the target thread.  It doesn't matter so much where the nsTimerImpl itself is released.
See proposal in comment 8.  ^^^
Flags: needinfo?(nfroyd)
(In reply to Ben Kelly [:bkelly] from comment #8)
> Nathan, what do you think about asserting we are on the target thread here:
> 
> https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsTimerImpl.h#74

I think this is a fine idea.  I am a little skeptical about us consistently calling Cancel on the target thread, though, see bug 1328643.  I'm pretty sure a short:

1. grep for SetTarget calls;
2. check for Cancel calls on said objects
3. find inconsistencies between where Cancel should be called and where it is called

exercise would be fruitful in finding bugs.

> I think the really important thing is that the caller's callback objects are
> released on the target thread.  It doesn't matter so much where the
> nsTimerImpl itself is released.

Yes.
Flags: needinfo?(nfroyd)
One thing we could try to do is avoid this unlock in TimerThread::PostTimerEvent():

https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/TimerThread.cpp#716

The comments suggest the only reason we unlock around the nsIEventTarget::Dispatch() is to avoid deadlock on self dispatch.  We could do a nsIEventTarget::IsOnCurrentThread() and only unlock if it returns true.

Nathan, what do you think?
Flags: needinfo?(nfroyd)
I keep losing this bug and have to search for it, so assigning to myself.
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
(In reply to Ben Kelly [not reviewing due to deadline][:bkelly] from comment #12)
> One thing we could try to do is avoid this unlock in
> TimerThread::PostTimerEvent():
> 
> https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/TimerThread.
> cpp#716
> 
> The comments suggest the only reason we unlock around the
> nsIEventTarget::Dispatch() is to avoid deadlock on self dispatch.  We could
> do a nsIEventTarget::IsOnCurrentThread() and only unlock if it returns true.
> 
> Nathan, what do you think?

I know I reviewed the patch that added that comment, but I'm unsure the problem it's describing exists: the timer can't be targeted at the timer thread, because the timer thread object itself is not nsIEventTarget, and even the mThread member isn't really a generally-accessible nsIEventTarget.  Even if you were dispatching at the mThread member, the event wouldn't run, because mThread isn't exactly running a normal event loop.

So your proposed scheme seems almost safe, except...We might deadlock if something's Dispatch() method attempted to re-enter the TimerThread code by adding or removing timers, which doesn't seem so far-fetched.
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #14)
> So your proposed scheme seems almost safe, except...We might deadlock if
> something's Dispatch() method attempted to re-enter the TimerThread code by
> adding or removing timers, which doesn't seem so far-fetched.

An nsIEventTarget::Dispatch() that tries to use nsITimer again?  That seems very strange to me for an event target.  Can we just document that code shouldn't do that?
Also it seems we could make nsITimer::Init() return an error if its called on the timer thread.  What do you think?
Flags: needinfo?(nfroyd)
See Also: 1266747
(In reply to Ben Kelly [not reviewing due to deadline][:bkelly] from comment #16)
> Also it seems we could make nsITimer::Init() return an error if its called
> on the timer thread.  What do you think?

I think this is completely reasonable.  I am less enthused about trying to restrict what you can do in nsIEventTarget::Dispatch.
Flags: needinfo?(nfroyd)
Crash Signature: [@ TimerThread::Run]
Summary: Intermittent dom/tests/mochitest/fetch/test_fetch_basic_http_sw_empty_reroute.html | application crashed [@ TimerThread::Run] → Intermittent application crashed [@ TimerThread::Run]
Duplicate of this bug: 1334701
Duplicate of this bug: 1338341
Duplicate of this bug: 1341814
Duplicate of this bug: 1344641
Duplicate of this bug: 1349384
Duplicate of this bug: 1351592
Duplicate of this bug: 1352587
Duplicate of this bug: 1354257
Byron, this bug (and the numerous dupes floating around out there) are really frequent. Any chance you could take a look?
Flags: needinfo?(docfaraday)
I know what this is and its on my list.  Its been a low priority for me because it didn't look that frequent and I believe its a DEBUG-only thing.  I don't think the assertions are actually a problem for release build in this case.

I'll try to get to this in FF55 time frame.
Flags: needinfo?(docfaraday)
I have a patch for this on another bug, let me link.
Duplicate of this bug: 1355315
Should I go ahead and move the patch over here?
Flags: needinfo?(bkelly)
(In reply to Byron Campen [:bwc] from comment #41)
> Should I go ahead and move the patch over here?

So, I talked to Nathan about loosening the assertion if Dispatch() failed there after a cancel.  He didn't like it previously.

I think we need to extend the lock over the Dispatch() call so that the Worker thread can't die while a timer is trying to dispatch to it.  This needs probably some extra flag or something on the timer target that says "lock over this target...".  Maybe a QI'able interface on the event target.
Flags: needinfo?(bkelly)
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #42)
> (In reply to Byron Campen [:bwc] from comment #41)
> > Should I go ahead and move the patch over here?
> 
> So, I talked to Nathan about loosening the assertion if Dispatch() failed
> there after a cancel.  He didn't like it previously.
> 
> I think we need to extend the lock over the Dispatch() call so that the
> Worker thread can't die while a timer is trying to dispatch to it.  This
> needs probably some extra flag or something on the timer target that says
> "lock over this target...".  Maybe a QI'able interface on the event target.

I don't see the value of doing this. If a thread stops accepting runnables while a timer is about to dispatch, so what?
(In reply to Byron Campen [:bwc] from comment #43)
> I don't see the value of doing this. If a thread stops accepting runnables
> while a timer is about to dispatch, so what?

Well, if Cancel() was not called on the nsITimer then this is a definite bug.  Many nsITimer callbacks and closures are not thread safe to release.  So we must ensure it gets released on the target thread.

For these particular crashes, though, the worker code is calling Cancel().  So I don't think this particular case is a problem for release builds.  We can't just relax the assertions, though, or other real problems would no longer be caught.
I will try to write a patch for this today.
Comment on attachment 8873028 [details] [diff] [review]
P1 Add the nsILockable mix-in interface. r=froydnj

Add a mix-in interface that nsIEventTargets can implement to signal that Dispatch() can be called while a lock is held.
Attachment #8873028 - Flags: review?(nfroyd)
Comment on attachment 8873029 [details] [diff] [review]
P2 Make TimerThread::PostTimerEvent() keep its lock if the event target implements nsILockable. r=froydnj

This patch makes TimerThread call Dispatch() with the lock held if the target implements nsILockable.
Attachment #8873029 - Flags: review?(nfroyd)
Comment on attachment 8873030 [details] [diff] [review]
P3 Make WorkerControlEventTarget implement nsILockable. r=froydnj

Make the Worker event target used for timers implement nsILockable.  It does not call into nsITimer directly, so deadlock is not possible.

I believe this should fix the assertions.  If TimerThread starts processing a worker nsITimer it will complete the Dispatch() before Cancel() can be called.  If the worker shuts down before the TimerThread gets to the timer, then Cancel() will be called and TimerThread will ignore it.
Attachment #8873030 - Flags: review?(nfroyd)
Duplicate of this bug: 1359152
Duplicate of this bug: 1354258
Duplicate of this bug: 1352591
Duplicate of this bug: 1351852
Duplicate of this bug: 1351451
Duplicate of this bug: 1351128
Duplicate of this bug: 1349768
Duplicate of this bug: 1349371
Duplicate of this bug: 1341381
Duplicate of this bug: 1339984
Duplicate of this bug: 1338710
Duplicate of this bug: 1337579
Duplicate of this bug: 1336250
Duplicate of this bug: 1336237
Duplicate of this bug: 1334745
Comment on attachment 8873028 [details] [diff] [review]
P1 Add the nsILockable mix-in interface. r=froydnj

Review of attachment 8873028 [details] [diff] [review]:
-----------------------------------------------------------------

This whole intermittent and its solution is just ugly, there's no way around it.  But I think a generic nsILockable interface promises too much: safe to invoke all of its methods while holding a lock?  That's a pretty tall order, and we'd have to QI just to make sure that we have one of these things, which is relatively expensive.

I think the better way forward is to define an infallible noscript method on nsIEventTarget like isWorkerControlThread, which is more targeted and more efficient.  It'd be ideal if we could implement this proposed method in just one place, but I don't see a good way for that to work.  Ideally the implementation using that method should be obvious; ping me on IRC if it's not.
Attachment #8873028 - Flags: review?(nfroyd) → review-
Attachment #8873029 - Flags: review?(nfroyd)
Attachment #8873030 - Flags: review?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #71)
> I think the better way forward is to define an infallible noscript method on
> nsIEventTarget like isWorkerControlThread, which is more targeted and more
> efficient.  It'd be ideal if we could implement this proposed method in just
> one place, but I don't see a good way for that to work.  Ideally the
> implementation using that method should be obvious; ping me on IRC if it's
> not.

I really disagree about making TimerThread look for something worker specific.  This is not worker specific.  There is nothing that stops other event targets from entering a shutdown state.

For example, this same problem can occur doing:

1. Create some private nsThread for a feature
2. Setup a timer targeting that nsThread
3. Feature calls nsITimer::Cancel() and then shutsdown the thread so nsThread::mEventsAreDoomed is set to true.
4. The nsThread::Dispatch() will start returning failure.

The TimerThread either needs to:

1. Lock around Dispatch() to avoid letting the Cancel() through
2. Accept that you might fail to dispatch an nsTimer if its Cancel() method has been called

Personally I think (2) would be fine and we could just loosen the assertion to permit canceled nsTimers.  I seem to recall you don't like that idea, though.

Can you suggest something you would accept that is not Worker specific?
Flags: needinfo?(nfroyd)
I would also be happy to rename my nsILockable interface to nsIDispatchLockableByTimerThread or something else specific.
Moving this over to Nathan as the xpcom owner.  I'm happy to help here, but I think he needs to determine which was he wants this to go first.  Just didn't want to leave myself assigned since I'm not actively working it at the moment.
Assignee: bkelly → nfroyd
Duplicate of this bug: 1374866
Duplicate of this bug: 1393991
See Also: → 1405225
Duplicate of this bug: 1410058
Duplicate of this bug: 1442320
Duplicate of this bug: 1443570
Duplicate of this bug: 1446147
Duplicate of this bug: 1446590
Duplicate of this bug: 1447514
Duplicate of this bug: 1448651
Duplicate of this bug: 1450426
Duplicate of this bug: 1452162
Duplicate of this bug: 1452318
Duplicate of this bug: 1453106
Duplicate of this bug: 1460169
Duplicate of this bug: 1462205
Priority: -- → P5
Component: DOM → DOM: Core & HTML
Duplicate of this bug: 1531624
Duplicate of this bug: 1532259
Duplicate of this bug: 1535495
Duplicate of this bug: 1535930
Duplicate of this bug: 1536588
Duplicate of this bug: 1538490
Duplicate of this bug: 1541082
Duplicate of this bug: 1544235
Duplicate of this bug: 1545209
Duplicate of this bug: 1546909
Crash Signature: [@ TimerThread::Run] → [@ TimerThread::Run] [@ TimerThread::Run()]
Crash Signature: [@ TimerThread::Run] [@ TimerThread::Run()] → [@ TimerThread::Run] [@ TimerThread::Run()]
Flags: needinfo?(nfroyd)
Duplicate of this bug: 1551406
Duplicate of this bug: 1516152
Duplicate of this bug: 1561816
You need to log in before you can comment on or make changes to this bug.