Closed Bug 1729964 Opened 3 years ago Closed 3 years ago

Assertion failure: rc != 0 (destroyed timer off its target thread!) in simplecall-no-ssrcs.https.html WPT

Tracking

()

Status:

RESOLVED FIXED

Milestone:

95 Branch

Tracking Flags:

Tracking

Status

firefox-esr78

---

wontfix

firefox-esr91

---

wontfix

firefox92

---

wontfix

firefox93

---

wontfix

firefox94

---

wontfix

firefox95

---

fixed

People

(Reporter: ng, Assigned: pehrsons)

References

Details

Attachments

(4 files, 1 obsolete file)

Bug 1729964: Stop unlocking in PostTimerEvent. 3 years ago Byron Campen [:bwc] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1729964 - Fix linker error. r?baku 3 years ago Andreas Pehrson [:pehrsons] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1729964 - During TimerThread dispatch track whether the dispatched timer was removed. r?#xpcom-reviewers 3 years ago Andreas Pehrson [:pehrsons] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1729964 - Fix data race on static logging-only nsTimerImpl members. r?#xpcom-reviewers 3 years ago Andreas Pehrson [:pehrsons] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1729964 - Move timer-release threading assert from TimerThread to nsTimerImpl. r?#xpcom-reviewers 3 years ago Andreas Pehrson [:pehrsons] 48 bytes, text/x-phabricator-request		Details \| Review

Nico Grunbaum [:ng, @chew:mozilla.org]

Reporter

Description

•

3 years ago

•

Edited

https://firefoxci.taskcluster-artifacts.net/O8BgxL5SSTqqIYWCcx8i_w/0/public/logs/live_backing.log

https://treeherder.mozilla.org/jobs?repo=try&revision=0a0ed471615895716fea092510e4d00dad3d629e&selectedTaskRun=O8BgxL5SSTqqIYWCcx8i_w.0

Also: https://treeherder.mozilla.org/logviewer?job_id=350968749&repo=try&lineNumber=32562

This may not be tied to the WPT.

Andreas Pehrson [:pehrsons]

Assignee

Comment 1

•

3 years ago

•

Edited

We fail the assert after the timer has fired but has failed to dispatch to the target thread. Might be in shutdown then. This should be fine as the user of the timer should do the final release when the timer is unarmed, or so I read the nsTimerImpl code. Could this be some race between firing the timer on the TimerThread and canceling+releasing (sync) it on some other thread? This AutoUnlock could be involved.

I'll try to figure out which timer this is, and maybe we can reason more about the usage of it.

Andreas Pehrson [:pehrsons]

Assignee

Comment 2

•

3 years ago

It also comes in this shape for non-debug builds.

Then you just see this on treeherder:

PROCESS-CRASH | /webrtc/RTCRtpReceiver-getSynchronizationSources.https.html | application crashed [@ nsThread::ProcessNextEvent(bool, bool*)]

But the log follows up with

Mozilla crash reason: MOZ_ASSERT(rc != 0) (destroyed timer off its target thread!)

Andreas Pehrson [:pehrsons]

Assignee

Comment 3

•

3 years ago

It's a DelayedRunnable. That's a bit of a blanket term. I'll try to figure out which DelayedRunnable.

Andreas Pehrson [:pehrsons]

Assignee

Comment 4

•

3 years ago

One coming from a webrtc::QueuedTask. Makes sense. I'll see if I can reason my way through this.

Andreas Pehrson [:pehrsons]

Assignee

Comment 5

•

3 years ago

I think this could happen if:

DelayedDispatch() inits the timer
The same DelayedDispatch() then does a tail dispatch
The TaskQueue is shut down (other thread) before the tail dispatching task has completed (so dispatching the tail fails)

Now the tail dispatcher drops the RefPtr to the DelayedRunnable (and the timer), and when the timer fires it fails dispatching the notification and releases the timer on the timer thread, and trips the assert that makes up this bug.

We should fix this in xpcom.

Assignee: nobody → apehrson

Status: NEW → ASSIGNED

status-firefox92: --- → wontfix

status-firefox93: --- → wontfix

status-firefox94: --- → affected

status-firefox-esr78: --- → wontfix

status-firefox-esr91: --- → wontfix

Andreas Pehrson [:pehrsons]

Assignee

Comment 6

•

3 years ago

The hypothesis in comment 5 didn't yield any results.

But with some extra logging across DelayedRunnable, TaskQueue and friends (see treeherder) I get:

[task 2021-09-27T13:15:21.581Z] 13:15:21 INFO - GECKO(1599) | BeginShutdown TaskQueue 0x117b03f90
[task 2021-09-27T13:15:21.582Z] 13:15:21 INFO - GECKO(1599) | CancelDelayedRunnablesLocked TaskQueue 0x117b03f90
[task 2021-09-27T13:15:21.582Z] 13:15:21 INFO - GECKO(1599) | CancelDelayedRunnablesImpl TaskQueue 0x117b03f90 runnable 0x118dc7580
[task 2021-09-27T13:15:21.583Z] 13:15:21 INFO - GECKO(1599) | CancelTimer DelayedRunnable 0x118dc7580 mTimer 0x1175809e0
[task 2021-09-27T13:15:21.583Z] 13:15:21 INFO - GECKO(1599) | CancelDelayedRunnablesImpl TaskQueue 0x117b03f90 runnable 0x118dc7380
[task 2021-09-27T13:15:21.584Z] 13:15:21 INFO - GECKO(1599) | CancelTimer DelayedRunnable 0x118dc7380 mTimer 0x10105fe60
[task 2021-09-27T13:15:21.584Z] 13:15:21 INFO - GECKO(1599) | ~DelayedRunnable 0x118dc7580 mTimer 0x1175809e0
[task 2021-09-27T13:15:21.584Z] 13:15:21 INFO - GECKO(1599) | ~DelayedRunnable 0x118dc7380 mTimer 0x10105fe60
[task 2021-09-27T13:15:21.585Z] 13:15:21 INFO - GECKO(1599) | [Child 1600, Timer] WARNING: ##### Failed dispatch due to shutdown: file /builds/worker/checkouts/gecko/xpcom/threads/TaskQueue.cpp:49
[task 2021-09-27T13:15:21.586Z] 13:15:21 INFO - GECKO(1599) | MaybeResolveShutdown
[task 2021-09-27T13:15:21.586Z] 13:15:21 INFO - GECKO(1599) | ##### Baddie timer 'TaskQueueWrapper::CreateTaskRunner (webrtc::QueuedTask)' 0x0
[task 2021-09-27T13:15:21.587Z] 13:15:21 INFO - GECKO(1599) | Assertion failure: rc != 0 (destroyed timer off its target thread!), at /builds/worker/checkouts/gecko/xpcom/threads/TimerThread.cpp:463

Not by any means verified, but this smells like a race between canceling (on the target) and firing (on TimerThread) a timer.

Andreas Pehrson [:pehrsons]

Assignee

Comment 7

•

3 years ago

There's a possible race where that Cancel() on the target Removes the timer while the TimerThread unlocks the monitor in PostTimerEvent (the one that fails to post). Then we end up releasing off-target although we've done our deed and canceled on-target.

It would be great if we could avoid that unlock. Byron, looks like you added this. Do you recall why that Dispatch could lock the monitor? That path doesn't seem obvious to me.

Flags: needinfo?(docfaraday)

Byron Campen [:bwc]

Comment 8

•

3 years ago

That unlock is there because there is (was?) code that makes the TimerThread itself a target of some timers, and it re-locks when trying to post. Let me figure out whether this horrible situation is still the case.

Flags: needinfo?(docfaraday)

Byron Campen [:bwc]

Comment 9

•

3 years ago

So, commenting out that unlock does not seem to make all hell break loose for mochitest:

https://treeherder.mozilla.org/jobs?repo=try&revision=4baa8883da973b3990eabc28586eb651b040b358

Locally, the timer gtests look fine, as do webrtc wpt.

It looks like this weird corner case is no longer happening in CI.

Byron Campen [:bwc]

Comment 10

•

3 years ago

I guess if there was ever an implementation of nsIEventTarget::Dispatch that ended up hitting the timer API, we could end up deadlocking. For instance, if Dispatch does something that results in the destruction of some object that then cancels a timer, or even calls Cancel on the runnable synchronously.

Byron Campen [:bwc]

Comment 11

•

3 years ago

Maybe we can de-risk this by adding some thread assertions.

Andreas Pehrson [:pehrsons]

Assignee

Comment 12

•

3 years ago

Some history relating to this unlock over in bug 1334383. The concern seems to be that we'd deadlock if an event target's Dispatch() would create or cancel a timer. That seems fair, although none of our event targets seem to do it at the moment.

The case we are hitting here is just like in bug 1334383 when the TimerThread fails to dispatch to the target, and the timer has been canceled already. Without the unlock this would be impossible, but maybe there's a more targeted approach that would also work.

Comment 13

•

3 years ago

So I added some thread assertions, and it seems we have some reentrancy:

https://treeherder.mozilla.org/jobs?repo=try&revision=0213dab91683dc510b422b4af7de7cfdd39a7b29&selectedTaskRun=YkpWi4-8Shqqwomb_avcJA.0

The stacks are incomplete, so it is a little difficult to tell how we're going reentrant here. Maybe it will repro in pernosco? I cannot get a local reproduction so far.

Byron Campen [:bwc]

Comment 14

•

3 years ago

Ok, that seems to just be happening during shutdown, which is fine. I think it would be ok to remove the unlock, although I think we should probably remove this too:

https://searchfox.org/mozilla-central/rev/3fa5cc437a4937c621ea068ba5dc246f75831633/xpcom/threads/TimerThread.cpp#122

It is not too much of a stretch to imagine an event target that would synchronously call CancelableRunnable::Cancel inside Dispatch, which would then lead to a deadlock. Since the timer is already being fired, I don't know if there's any value to canceling it if the event target is not going to run events anymore. I'll run some tests on a patch that does these things.

Byron Campen [:bwc]

Comment 15

•

3 years ago

•

Edited

Note: This is not based on the changesets from bug 1654112, since those changes still muddy the waters.

https://treeherder.mozilla.org/jobs?repo=try&revision=57b555a8bfd3b0d23597d9ec4c341b72c4872aa7

Byron Campen [:bwc]

Comment 16

•

3 years ago

Gtest try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ef5a00b96a6b250de9850fe351dc030957d7182f
Broader mochitest try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4dc50d2e1fd7ab8e336d4a6fdfe57d38afa000a2

Byron Campen [:bwc]

Comment 17

•

3 years ago

Webrtc wpt: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f22f9431cf32bebbf3d0b3e2a43c46ff04d1cbc2

Byron Campen [:bwc]

Comment 18

•

3 years ago

Attached file Bug 1729964: Stop unlocking in PostTimerEvent. — Details

Also, remove an unneccessary nsTimerImpl::Cancel call in nsTimerEvent that
could cause deadlocks, and remove an unused function.

Phabricator Automation

Updated

•

3 years ago

Attachment #9243298 - Attachment description: WIP: Bug 1729964: Stop unlocking in PostTimerEvent. → Bug 1729964: Stop unlocking in PostTimerEvent.

Byron Campen [:bwc]

Comment 19

•

3 years ago

•

Edited

Only seeing one unknown orange in the full-tree mochitest push in comment 16. Here's a baseline push for comparison:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5b11120b6534530a0ddbb858a8262661ee581ac5

This orange appears to be extremely rare, continuing to retrigger.

Edit: Even with 40 retriggers, no luck in making this happen again, either on baseline or with the fix.

Otherwise, try looks fine. It may be worth running the wpt suite also.

Andreas Pehrson [:pehrsons]

Assignee

Comment 20

•

3 years ago

Attached file Bug 1729964 - Fix linker error. r?baku — Details

Andreas Pehrson [:pehrsons]

Assignee

Comment 21

•

3 years ago

Attached file Bug 1729964 - During TimerThread dispatch track whether the dispatched timer was removed. r?#xpcom-reviewers (obsolete) — Details

Andreas Pehrson [:pehrsons]

Assignee

Comment 22

•

3 years ago

Attached file Bug 1729964 - Fix data race on static logging-only nsTimerImpl members. r?#xpcom-reviewers — Details

Tsan with MOZ_LOG=nsTimerImpl:5 reveals this race.

Byron Campen [:bwc]

Comment 23

•

3 years ago

Here's a try push with the patch based on the libwebrtc update work:

https://treeherder.mozilla.org/jobs?repo=try&revision=21132a7009026e0ef2c4f0dee803da26c4f082b0&selectedTaskRun=Utb7FankSRiJsUTtAuxX1A.0

Andreas Pehrson [:pehrsons]

Assignee

Updated

•

3 years ago

Component: WebRTC → XPCOM

Byron Campen [:bwc]

Comment 24

•

3 years ago

Broad mochitest push:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4787f71a2e2b269c16d2cf2da48aaa95fb60675f

Broad mochitest push based on libwebrtc update work:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=dce10d2619ca9a143c8b3ed7c06c87666c680f18

Andreas Pehrson [:pehrsons]

Assignee

Comment 25

•

3 years ago

I discussed our approaches with Nika on Matrix yesterday. She came up with another: moving the assertion from TimerThread to nsTimerImpl::~nsTimerImpl, since what we really want to assert against is nsTimerImpl going away without a Cancel, i.e. taking a valid mCallback with it. I have a patch doing the rounds on try atm.

Andreas Pehrson [:pehrsons]

Assignee

Comment 26

•

3 years ago

Attached file Bug 1729964 - Move timer-release threading assert from TimerThread to nsTimerImpl. r?#xpcom-reviewers — Details

Phabricator Automation

Updated

•

3 years ago

Attachment #9243451 - Attachment is obsolete: true

Pulsebot

Comment 27

•

3 years ago

Pushed by pehrsons@gmail.com: https://hg.mozilla.org/integration/autoland/rev/84ae8311ae62 Fix linker error. r=baku https://hg.mozilla.org/integration/autoland/rev/879bbc9fc804 Move timer-release threading assert from TimerThread to nsTimerImpl. r=xpcom-reviewers,nika https://hg.mozilla.org/integration/autoland/rev/5848b0ef4e55 Fix data race on static logging-only nsTimerImpl members. r=xpcom-reviewers,KrisWright

Marian-Vasile Laza

Comment 28

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/84ae8311ae62
https://hg.mozilla.org/mozilla-central/rev/879bbc9fc804
https://hg.mozilla.org/mozilla-central/rev/5848b0ef4e55

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox95: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 95 Branch

Andreas Pehrson [:pehrsons]

Assignee

Updated

•

3 years ago

status-firefox94: affected → wontfix

Andreas Pehrson [:pehrsons]

Assignee

Updated

•

3 years ago

See Also: → 1730665, 1355633, 1349768, 1017721, 963474, 1355315, 881413, 1633677, 1547201, 1544384, 1493909, 1489652, 1462894, 1460169, 1441137, 1440891, 1440890, 1439866, 1437751, 1432041, 1395380, 1393979, 1378530, 1341381, 941751

Comment hidden (Intermittent Failures Robot)

You need to log in before you can comment on or make changes to this bug.