Closed Bug 1729964 Opened 5 months ago Closed 4 months ago

Assertion failure: rc != 0 (destroyed timer off its target thread!) in simplecall-no-ssrcs.https.html WPT


(Core :: XPCOM, defect, P2)




95 Branch
Tracking Status
firefox-esr78 --- wontfix
firefox-esr91 --- wontfix
firefox92 --- wontfix
firefox93 --- wontfix
firefox94 --- wontfix
firefox95 --- fixed


(Reporter: ng, Assigned: pehrsons)




(4 files, 1 obsolete file)

We fail the assert after the timer has fired but has failed to dispatch to the target thread. Might be in shutdown then. This should be fine as the user of the timer should do the final release when the timer is unarmed, or so I read the nsTimerImpl code. Could this be some race between firing the timer on the TimerThread and canceling+releasing (sync) it on some other thread? This AutoUnlock could be involved.

I'll try to figure out which timer this is, and maybe we can reason more about the usage of it.

It also comes in this shape for non-debug builds.

Then you just see this on treeherder:

PROCESS-CRASH | /webrtc/RTCRtpReceiver-getSynchronizationSources.https.html | application crashed [@ nsThread::ProcessNextEvent(bool, bool*)]

But the log follows up with

Mozilla crash reason: MOZ_ASSERT(rc != 0) (destroyed timer off its target thread!)

It's a DelayedRunnable. That's a bit of a blanket term. I'll try to figure out which DelayedRunnable.

One coming from a webrtc::QueuedTask. Makes sense. I'll see if I can reason my way through this.

I think this could happen if:

Now the tail dispatcher drops the RefPtr to the DelayedRunnable (and the timer), and when the timer fires it fails dispatching the notification and releases the timer on the timer thread, and trips the assert that makes up this bug.

We should fix this in xpcom.

Assignee: nobody → apehrson

The hypothesis in comment 5 didn't yield any results.

But with some extra logging across DelayedRunnable, TaskQueue and friends (see treeherder) I get:

[task 2021-09-27T13:15:21.581Z] 13:15:21 INFO - GECKO(1599) | BeginShutdown TaskQueue 0x117b03f90
[task 2021-09-27T13:15:21.582Z] 13:15:21 INFO - GECKO(1599) | CancelDelayedRunnablesLocked TaskQueue 0x117b03f90
[task 2021-09-27T13:15:21.582Z] 13:15:21 INFO - GECKO(1599) | CancelDelayedRunnablesImpl TaskQueue 0x117b03f90 runnable 0x118dc7580
[task 2021-09-27T13:15:21.583Z] 13:15:21 INFO - GECKO(1599) | CancelTimer DelayedRunnable 0x118dc7580 mTimer 0x1175809e0
[task 2021-09-27T13:15:21.583Z] 13:15:21 INFO - GECKO(1599) | CancelDelayedRunnablesImpl TaskQueue 0x117b03f90 runnable 0x118dc7380
[task 2021-09-27T13:15:21.584Z] 13:15:21 INFO - GECKO(1599) | CancelTimer DelayedRunnable 0x118dc7380 mTimer 0x10105fe60
[task 2021-09-27T13:15:21.584Z] 13:15:21 INFO - GECKO(1599) | ~DelayedRunnable 0x118dc7580 mTimer 0x1175809e0
[task 2021-09-27T13:15:21.584Z] 13:15:21 INFO - GECKO(1599) | ~DelayedRunnable 0x118dc7380 mTimer 0x10105fe60
[task 2021-09-27T13:15:21.585Z] 13:15:21 INFO - GECKO(1599) | [Child 1600, Timer] WARNING: ##### Failed dispatch due to shutdown: file /builds/worker/checkouts/gecko/xpcom/threads/TaskQueue.cpp:49
[task 2021-09-27T13:15:21.586Z] 13:15:21 INFO - GECKO(1599) | MaybeResolveShutdown
[task 2021-09-27T13:15:21.586Z] 13:15:21 INFO - GECKO(1599) | ##### Baddie timer 'TaskQueueWrapper::CreateTaskRunner (webrtc::QueuedTask)' 0x0
[task 2021-09-27T13:15:21.587Z] 13:15:21 INFO - GECKO(1599) | Assertion failure: rc != 0 (destroyed timer off its target thread!), at /builds/worker/checkouts/gecko/xpcom/threads/TimerThread.cpp:463

Not by any means verified, but this smells like a race between canceling (on the target) and firing (on TimerThread) a timer.

There's a possible race where that Cancel() on the target Removes the timer while the TimerThread unlocks the monitor in PostTimerEvent (the one that fails to post). Then we end up releasing off-target although we've done our deed and canceled on-target.

It would be great if we could avoid that unlock. Byron, looks like you added this. Do you recall why that Dispatch could lock the monitor? That path doesn't seem obvious to me.

Flags: needinfo?(docfaraday)

That unlock is there because there is (was?) code that makes the TimerThread itself a target of some timers, and it re-locks when trying to post. Let me figure out whether this horrible situation is still the case.

Flags: needinfo?(docfaraday)

So, commenting out that unlock does not seem to make all hell break loose for mochitest:

Locally, the timer gtests look fine, as do webrtc wpt.

It looks like this weird corner case is no longer happening in CI.

I guess if there was ever an implementation of nsIEventTarget::Dispatch that ended up hitting the timer API, we could end up deadlocking. For instance, if Dispatch does something that results in the destruction of some object that then cancels a timer, or even calls Cancel on the runnable synchronously.

Maybe we can de-risk this by adding some thread assertions.

Some history relating to this unlock over in bug 1334383. The concern seems to be that we'd deadlock if an event target's Dispatch() would create or cancel a timer. That seems fair, although none of our event targets seem to do it at the moment.

The case we are hitting here is just like in bug 1334383 when the TimerThread fails to dispatch to the target, and the timer has been canceled already. Without the unlock this would be impossible, but maybe there's a more targeted approach that would also work.

See Also: → 1334383

So I added some thread assertions, and it seems we have some reentrancy:

The stacks are incomplete, so it is a little difficult to tell how we're going reentrant here. Maybe it will repro in pernosco? I cannot get a local reproduction so far.

Ok, that seems to just be happening during shutdown, which is fine. I think it would be ok to remove the unlock, although I think we should probably remove this too:

It is not too much of a stretch to imagine an event target that would synchronously call CancelableRunnable::Cancel inside Dispatch, which would then lead to a deadlock. Since the timer is already being fired, I don't know if there's any value to canceling it if the event target is not going to run events anymore. I'll run some tests on a patch that does these things.

Note: This is not based on the changesets from bug 1654112, since those changes still muddy the waters.

Also, remove an unneccessary nsTimerImpl::Cancel call in nsTimerEvent that
could cause deadlocks, and remove an unused function.

Attachment #9243298 - Attachment description: WIP: Bug 1729964: Stop unlocking in PostTimerEvent. → Bug 1729964: Stop unlocking in PostTimerEvent.

Only seeing one unknown orange in the full-tree mochitest push in comment 16. Here's a baseline push for comparison:

This orange appears to be extremely rare, continuing to retrigger.

Edit: Even with 40 retriggers, no luck in making this happen again, either on baseline or with the fix.

Otherwise, try looks fine. It may be worth running the wpt suite also.

Tsan with MOZ_LOG=nsTimerImpl:5 reveals this race.

Component: WebRTC → XPCOM

I discussed our approaches with Nika on Matrix yesterday. She came up with another: moving the assertion from TimerThread to nsTimerImpl::~nsTimerImpl, since what we really want to assert against is nsTimerImpl going away without a Cancel, i.e. taking a valid mCallback with it. I have a patch doing the rounds on try atm.

Attachment #9243451 - Attachment is obsolete: true
Pushed by
Fix linker error. r=baku
Move timer-release threading assert from TimerThread to nsTimerImpl. r=xpcom-reviewers,nika
Fix data race on static logging-only nsTimerImpl members. r=xpcom-reviewers,KrisWright
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → 95 Branch
You need to log in before you can comment on or make changes to this bug.