Closed Bug 1712357 Opened 3 years ago Closed 3 years ago

Intermittent org.mozilla.geckoview.test.NavigationDelegateTest.onNewSession_childShouldLoad | application crashed [@ mozilla::dom::ContentParent::Pid() const]

Categories

(Core :: DOM: Content Processes, defect)

Unspecified
All
defect

Tracking

()

RESOLVED FIXED
91 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox88 --- wontfix
firefox89 --- wontfix
firefox90 + fixed
firefox91 + fixed

People

(Reporter: mccr8, Assigned: mccr8)

References

Details

(4 keywords, Whiteboard: [post-critsmash-triage][adv-main90+r])

Crash Data

Attachments

(1 file)

As seen in bug 1710153, this UAF has continued. In bug 1709931 I landed a patch to ensure that the timer is shut down when the particular process manager is destroyed, but looking at the stack again I think the issue is that the ContentParent we're calling Pid() on has been destroyed.

One possible reason for this is that we failed to get the observer service. I'm not sure why it would be intermittently unavailable, but this is a unit test, and the priority manager might start up early, so I guess it could happen? It is interesting that the failures seem to be entirely in debug builds, even though we seem to also run this test with opt. If it is some startup race, maybe we lose more often with a slower debug build?

I'll try to confirm this hypothesis by asserting that the observer service exists in both the priority manager and the place in ContentParent where the observation notice is sent. I can work around this particular failure by making a direct call instead of using the observer service, though maybe there are other things that will need to be fixed if the issue is ContentParent failing to send ipc:content-shutdown.

I added asserts for the observer service, but the crash still happens: https://treeherder.mozilla.org/logviewer?job_id=340470750&repo=try&lineNumber=8096

I'll try taking a more direct approach and crash if we destroy a ContentParent while it is still referenced.

I think I've figured out what is going wrong here. I added some instrumentation to the process priority manager to track the content parent ids we've seen in ProcessPriorityManagerImpl::ObserveContentParentDestroyed, then asserts if we see one of those again later in GetParticularProcessPriorityManager. In a try run, that seemed to fail 100% of the time, with this stack:

0 ProcessPriorityManagerImpl::GetParticularProcessPriorityManager(mozilla::dom::ContentParent*) [ProcessPriorityManager.cpp : 405 + 0x29]
1 mozilla::ProcessPriorityManager::ActivityChanged(mozilla::dom::BrowserParent*, bool) [ProcessPriorityManager.cpp: 925 + 0x16]
2 mozilla::dom::BrowserParent::DestroyInternal() [BrowserParent.cpp: 603 + 0x5]
3 mozilla::dom::BrowserParent::Destroy() [BrowserParent.cpp: 628 + 0x8]
4 mozilla::dom::BrowserBridgeParent::Destroy() [BrowserBridgeParent.cpp: 132 + 0x10]
5 mozilla::ipc::IProtocol::DestroySubtree(mozilla::ipc::IProtocol::ActorDestroyReason) [ProtocolUtils.cpp: 623 + 0xc]
6 mozilla::dom::PBrowserBridgeParent::OnMessageReceived(IPC::Message const&) [PBrowserBridgeParent.cpp: 321 + 0x8]

So, the problem is that we tell the priority manager that a content parent is going away, which destroys the particular process priority manager (PPPM). Later, we destroy the browser parent. This makes the browser parent tell the priority manager that the browser parent is no longer active, which makes GetParticularProcessPriorityManager create a new PPPM for the content parent that has already been destroyed. This new PPPM has a priority of UNKNOWN, so we do a ScheduleResetPriority() to turn the priority to BACKGROUND, which creates a timer.

I'm not sure what goes wrong after that, but I expect there's some kind of race between destroying the ContentParent, running the timer, and canceling the timer. We must usually cancel all timers before we run the PPPM timer, as I can't see how we'd win the race otherwise.

I added the call to ActivityChanged to BrowserParent::Deactivated() in bug 1704545, so maybe that's the regressor? Though I'm not sure why the crash didn't show up earlier.

One fix would be to change ActivityChanged to not create a new PPPM if aIsActive is false. The goal of the call is really just to tell the priority manager to stop considering the BrowserParent when calculating the priority, but if the priority manager has no record of the process, clearly it is already not doing so.

I guess the easier thing to do is to just never create a new PPPM in GetParticularProcessPriorityManager if ContentParent::IsDead() returns true.

I looked in crash stats for at crashes that contain some of the frames here and I didn't see any evidence of UAF poison values.

Don't adjust the priority of ContentParents that are going away. r=kmag
https://hg.mozilla.org/mozilla-central/rev/671001812b0f

Group: dom-core-security → core-security-release
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 91 Branch
Flags: qe-verify-
Whiteboard: [post-critsmash-triage]
Blocks: 1710153

Comment on attachment 9224481 [details]
Bug 1712357 - Don't adjust the priority of ContentParents that are going away.

Beta/Release Uplift Approval Request

  • User impact if declined: There's no evidence that this is affecting anybody in practice, but maybe it could cause a security issue on Windows, so I think it would be good to fix just in case.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It just stops us from running code in a situation where we really don't want to run it.
  • String changes made/needed: none
Attachment #9224481 - Flags: approval-mozilla-beta?

Comment on attachment 9224481 [details]
Bug 1712357 - Don't adjust the priority of ContentParents that are going away.

Approved for 90.0b4.

Attachment #9224481 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

It looks like there were at least 9 crashes like this on Fenix in the wild, down to the poison values. None after the 20210527092758 build.

bp-9ddecfe2-3f48-48bc-9346-5267f0210603

Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main90+r]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: