Closed Bug 1754867 Opened 2 years ago Closed 2 years ago

Nullptr crash due to reentrancy in ~AutoTaskDispatcher

Categories

(Core :: XPCOM, defect, P2)

defect

Tracking

()

RESOLVED FIXED
99 Branch
Tracking Status
firefox99 --- fixed

People

(Reporter: bwc, Assigned: bwc)

References

Details

Crash Data

Attachments

(1 file)

See Also: → 1748333
Crash Signature: AutoTaskDispatcher::DispatchTasksFor → [@ AutoTaskDispatcher::DispatchTasksFor ]
Crash Signature: [@ AutoTaskDispatcher::DispatchTasksFor ] → [@ mozilla::AutoTaskDispatcher::DispatchTasksFor ]

There's a few things we could blame here:

  1. This code results in ~AutoTaskDispatcher going reentrant and calling mTailDispatcher.DispatchTasksFor, because Maybe::reset invokes the d'tor first, and then marks itself as containing Nothing:

https://searchfox.org/mozilla-central/rev/9f61d854547cedbde0773b2893e4f925352be3b3/xpcom/threads/AbstractThread.cpp#196
https://searchfox.org/mozilla-central/rev/9f61d854547cedbde0773b2893e4f925352be3b3/mfbt/Maybe.h#639,642

This is pretty inconsistent with the way things like unique_ptr work, which is basically how we're using mTailDispatcher. I think that this is probably something we need to fix, one way or another, even if there are other things we should be fixing here.

  1. It is probably not ideal that we are setting up a situation where reentrancy could occur in the first place. It is harder to say what we could do to fix this, though.
Summary: Fuzzing nullptr crash due to reentrancy in ~AutoTaskDispatcher → Nullptr crash due to reentrancy in ~AutoTaskDispatcher

The core of the problem here is that Maybe::reset invokes the d'tor first, and
then clears the value, whereas unique_ptr::reset does the opposite.

Assignee: nobody → docfaraday
Status: NEW → ASSIGNED

webrtc jobs look ok, although this change could effect pretty much any feature.

Crash Signature: [@ mozilla::AutoTaskDispatcher::DispatchTasksFor ] → [@ mozilla::AutoTaskDispatcher::DispatchTasksFor ] [@ mozilla::AutoTaskDispatcher::DispatchTasksFor(mozilla::AbstractThread*)]

Seeing a lot more failures in test-windows7-32-qr/opt-mochitest-browser-chrome-fis-e10s-5 (including several novel-looking failures at the end of js/xpconnect/tests/browser):

With changes:
https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=O9J8VfX1SMO2rix7BmGq0g.0&revision=d742e72b4e1f3a6c9583801f45b180f5caf86670

Without changes:
https://treeherder.mozilla.org/jobs?repo=try&duplicate_jobs=visible&revision=2ba69bced06adce1d6968022f34fb6ea473f37a2&selectedTaskRun=TUnn-v3iTM-DvwkRIns2pQ.0

Here's a try push focusing on js/xpconnect/tests/browser. Maybe I can get better logging of this problem on some other platform.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c9572c441295de22e708579498b95715e7f228d7

Yeah, that only seems to happen on windows7 jobs, and it does not appear that we have any windows7 debug jobs even with --full...

Crash Signature: [@ mozilla::AutoTaskDispatcher::DispatchTasksFor ] [@ mozilla::AutoTaskDispatcher::DispatchTasksFor(mozilla::AbstractThread*)] → [@ mozilla::AutoTaskDispatcher::DispatchTasksFor ] [@ mozilla::AutoTaskDispatcher::DispatchTasksFor(mozilla::AbstractThread*)]

Any idea why this patch might be causing a new hang-on-finish in the xpconnect bc tests?

https://treeherder.mozilla.org/jobs?repo=try&revision=c9572c441295de22e708579498b95715e7f228d7&selectedTaskRun=dhoRVCkFTESygEZfZopjLw.0

Flags: needinfo?(nika)

I'm not sure why that would be happening. I do notice that the log has the line [task 2022-02-14T23:20:25.127Z] 23:20:25 INFO - GECKO(3384) | [Exiting due to channel error. (https://treeherder.mozilla.org/logviewer?job_id=367873398&repo=try&lineNumber=1955) which suggests that the parent process crashed in some way & lost connection to a child process unexpectedly (child processes log that when the parent process disappears unexpectedly & they quick-exit), though that doesn't line up super well with the application hanging.

Given that this failure is occuring on opt builds, we wouldn't have deadlock detection running, so one of the most likely options here is that we're encountering a deadlock somewhere in the parent process. Unfortunately I don't really have any idea where that would be.

Flags: needinfo?(nika)

Hmm. The work in progress over in bug 1755318 makes this happen far more often on try:

https://treeherder.mozilla.org/jobs?repo=try&revision=46df533e014f3fda1f17c0d5f5cc0b447e658f52

I think this will have to block bug 1755318.

Blocks: 1755318
Severity: S3 → S2
Priority: P3 → P2

Seems that the issue does not reproduce when I hack win7 opt to be more in line with a debug build:

https://treeherder.mozilla.org/jobs?repo=try&revision=a7a34bddcf240948fa697573bcc7a8c25da1bc59

It is possible that disabling assertions prevents the issue, but when assertions are left enabled, we hit assertions and crash before we even run any tests.

It may be that I need to selectively enable just the debugging features I want (eg; just deadlock detection).

Actually, there's another possibility; maybe the rebase I just did fixed the issue. Here's a try push for that:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=427fca1b66223e6904e0603f2884b4fdabeb6599

Edit: Well that's a lucky break. A rebase has either stopped the issue, or made it so uncommon that it doesn't show up in 20 runs.

Pushed by bcampen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/68192b01f87d
Use unique_ptr instead of Maybe for mTailDispatcher to avoid reentrancy. r=nika
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 99 Branch
Component: WebRTC: Audio/Video → XPCOM
Blocks: 1752959
No longer blocks: 1752959
See Also: → 1752959
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: