Assertion failure: rc != 0 (destroyed timer off its target thread!) in simplecall-no-ssrcs.https.html WPT
Categories
(Core :: XPCOM, defect, P2)
Tracking
()
People
(Reporter: ng, Assigned: pehrsons)
References
Details
Attachments
(4 files, 1 obsolete file)
Assignee | ||
Comment 1•3 years ago
•
|
||
We fail the assert after the timer has fired but has failed to dispatch to the target thread. Might be in shutdown then. This should be fine as the user of the timer should do the final release when the timer is unarmed, or so I read the nsTimerImpl code. Could this be some race between firing the timer on the TimerThread and canceling+releasing (sync) it on some other thread? This AutoUnlock could be involved.
I'll try to figure out which timer this is, and maybe we can reason more about the usage of it.
Assignee | ||
Comment 2•3 years ago
|
||
It also comes in this shape for non-debug builds.
Then you just see this on treeherder:
PROCESS-CRASH | /webrtc/RTCRtpReceiver-getSynchronizationSources.https.html | application crashed [@ nsThread::ProcessNextEvent(bool, bool*)]
But the log follows up with
Mozilla crash reason: MOZ_ASSERT(rc != 0) (destroyed timer off its target thread!)
Assignee | ||
Comment 3•3 years ago
|
||
It's a DelayedRunnable. That's a bit of a blanket term. I'll try to figure out which DelayedRunnable.
Assignee | ||
Comment 4•3 years ago
|
||
One coming from a webrtc::QueuedTask
. Makes sense. I'll see if I can reason my way through this.
Assignee | ||
Comment 5•3 years ago
|
||
I think this could happen if:
- DelayedDispatch() inits the timer
- The same DelayedDispatch() then does a tail dispatch
- The TaskQueue is shut down (other thread) before the tail dispatching task has completed (so dispatching the tail fails)
Now the tail dispatcher drops the RefPtr to the DelayedRunnable (and the timer), and when the timer fires it fails dispatching the notification and releases the timer on the timer thread, and trips the assert that makes up this bug.
We should fix this in xpcom.
Assignee | ||
Comment 6•3 years ago
|
||
The hypothesis in comment 5 didn't yield any results.
But with some extra logging across DelayedRunnable, TaskQueue and friends (see treeherder) I get:
[task 2021-09-27T13:15:21.581Z] 13:15:21 INFO - GECKO(1599) | BeginShutdown TaskQueue 0x117b03f90
[task 2021-09-27T13:15:21.582Z] 13:15:21 INFO - GECKO(1599) | CancelDelayedRunnablesLocked TaskQueue 0x117b03f90
[task 2021-09-27T13:15:21.582Z] 13:15:21 INFO - GECKO(1599) | CancelDelayedRunnablesImpl TaskQueue 0x117b03f90 runnable 0x118dc7580
[task 2021-09-27T13:15:21.583Z] 13:15:21 INFO - GECKO(1599) | CancelTimer DelayedRunnable 0x118dc7580 mTimer 0x1175809e0
[task 2021-09-27T13:15:21.583Z] 13:15:21 INFO - GECKO(1599) | CancelDelayedRunnablesImpl TaskQueue 0x117b03f90 runnable 0x118dc7380
[task 2021-09-27T13:15:21.584Z] 13:15:21 INFO - GECKO(1599) | CancelTimer DelayedRunnable 0x118dc7380 mTimer 0x10105fe60
[task 2021-09-27T13:15:21.584Z] 13:15:21 INFO - GECKO(1599) | ~DelayedRunnable 0x118dc7580 mTimer 0x1175809e0
[task 2021-09-27T13:15:21.584Z] 13:15:21 INFO - GECKO(1599) | ~DelayedRunnable 0x118dc7380 mTimer 0x10105fe60
[task 2021-09-27T13:15:21.585Z] 13:15:21 INFO - GECKO(1599) | [Child 1600, Timer] WARNING: ##### Failed dispatch due to shutdown: file /builds/worker/checkouts/gecko/xpcom/threads/TaskQueue.cpp:49
[task 2021-09-27T13:15:21.586Z] 13:15:21 INFO - GECKO(1599) | MaybeResolveShutdown
[task 2021-09-27T13:15:21.586Z] 13:15:21 INFO - GECKO(1599) | ##### Baddie timer 'TaskQueueWrapper::CreateTaskRunner (webrtc::QueuedTask)' 0x0
[task 2021-09-27T13:15:21.587Z] 13:15:21 INFO - GECKO(1599) | Assertion failure: rc != 0 (destroyed timer off its target thread!), at /builds/worker/checkouts/gecko/xpcom/threads/TimerThread.cpp:463
Not by any means verified, but this smells like a race between canceling (on the target) and firing (on TimerThread) a timer.
Assignee | ||
Comment 7•3 years ago
|
||
There's a possible race where that Cancel() on the target Removes the timer while the TimerThread unlocks the monitor in PostTimerEvent (the one that fails to post). Then we end up releasing off-target although we've done our deed and canceled on-target.
It would be great if we could avoid that unlock. Byron, looks like you added this. Do you recall why that Dispatch could lock the monitor? That path doesn't seem obvious to me.
Comment 8•3 years ago
|
||
That unlock is there because there is (was?) code that makes the TimerThread itself a target of some timers, and it re-locks when trying to post. Let me figure out whether this horrible situation is still the case.
Comment 9•3 years ago
|
||
So, commenting out that unlock does not seem to make all hell break loose for mochitest:
https://treeherder.mozilla.org/jobs?repo=try&revision=4baa8883da973b3990eabc28586eb651b040b358
Locally, the timer gtests look fine, as do webrtc wpt.
It looks like this weird corner case is no longer happening in CI.
Comment 10•3 years ago
|
||
I guess if there was ever an implementation of nsIEventTarget::Dispatch that ended up hitting the timer API, we could end up deadlocking. For instance, if Dispatch does something that results in the destruction of some object that then cancels a timer, or even calls Cancel on the runnable synchronously.
Comment 11•3 years ago
|
||
Maybe we can de-risk this by adding some thread assertions.
Assignee | ||
Comment 12•3 years ago
|
||
Some history relating to this unlock over in bug 1334383. The concern seems to be that we'd deadlock if an event target's Dispatch()
would create or cancel a timer. That seems fair, although none of our event targets seem to do it at the moment.
The case we are hitting here is just like in bug 1334383 when the TimerThread fails to dispatch to the target, and the timer has been canceled already. Without the unlock this would be impossible, but maybe there's a more targeted approach that would also work.
Comment 13•3 years ago
|
||
So I added some thread assertions, and it seems we have some reentrancy:
The stacks are incomplete, so it is a little difficult to tell how we're going reentrant here. Maybe it will repro in pernosco? I cannot get a local reproduction so far.
Comment 14•3 years ago
|
||
Ok, that seems to just be happening during shutdown, which is fine. I think it would be ok to remove the unlock, although I think we should probably remove this too:
It is not too much of a stretch to imagine an event target that would synchronously call CancelableRunnable::Cancel inside Dispatch, which would then lead to a deadlock. Since the timer is already being fired, I don't know if there's any value to canceling it if the event target is not going to run events anymore. I'll run some tests on a patch that does these things.
Comment 15•3 years ago
•
|
||
Note: This is not based on the changesets from bug 1654112, since those changes still muddy the waters.
https://treeherder.mozilla.org/jobs?repo=try&revision=57b555a8bfd3b0d23597d9ec4c341b72c4872aa7
Comment 16•3 years ago
|
||
Comment 17•3 years ago
|
||
Comment 18•3 years ago
|
||
Also, remove an unneccessary nsTimerImpl::Cancel call in nsTimerEvent that
could cause deadlocks, and remove an unused function.
Updated•3 years ago
|
Comment 19•3 years ago
•
|
||
Only seeing one unknown orange in the full-tree mochitest push in comment 16. Here's a baseline push for comparison:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5b11120b6534530a0ddbb858a8262661ee581ac5
This orange appears to be extremely rare, continuing to retrigger.
Edit: Even with 40 retriggers, no luck in making this happen again, either on baseline or with the fix.
Otherwise, try looks fine. It may be worth running the wpt suite also.
Assignee | ||
Comment 20•3 years ago
|
||
Assignee | ||
Comment 21•3 years ago
|
||
Assignee | ||
Comment 22•3 years ago
|
||
Tsan with MOZ_LOG=nsTimerImpl:5 reveals this race.
Comment 23•3 years ago
|
||
Here's a try push with the patch based on the libwebrtc update work:
Assignee | ||
Updated•3 years ago
|
Comment 24•3 years ago
|
||
Broad mochitest push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4787f71a2e2b269c16d2cf2da48aaa95fb60675f
Broad mochitest push based on libwebrtc update work:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dce10d2619ca9a143c8b3ed7c06c87666c680f18
Assignee | ||
Comment 25•3 years ago
|
||
I discussed our approaches with Nika on Matrix yesterday. She came up with another: moving the assertion from TimerThread to nsTimerImpl::~nsTimerImpl, since what we really want to assert against is nsTimerImpl going away without a Cancel, i.e. taking a valid mCallback with it. I have a patch doing the rounds on try atm.
Assignee | ||
Comment 26•3 years ago
|
||
Updated•3 years ago
|
Comment 27•3 years ago
|
||
Comment 28•3 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/84ae8311ae62
https://hg.mozilla.org/mozilla-central/rev/879bbc9fc804
https://hg.mozilla.org/mozilla-central/rev/5848b0ef4e55
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Description
•