Closed Bug 1764878 Opened 3 years ago Closed 3 years ago

Crash in [@ mozilla::layers::APZTaskRunnable::QueueRequest]

Categories

(Core :: Panning and Zooming, defect)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
102 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox99 --- wontfix
firefox100 --- wontfix
firefox101 + fixed
firefox102 + fixed

People

(Reporter: gsvelto, Assigned: tnikkel)

References

(Regression)

Details

(Keywords: crash, csectype-uaf, sec-high, Whiteboard: [post-critsmash-triage][adv-main101+r])

Crash Data

Attachments

(1 file, 1 obsolete file)

Crash report: https://crash-stats.mozilla.org/report/index/52e28f90-7912-4c4f-ac4f-5dc130220409

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0 xul.dll mozilla::layers::APZTaskRunnable::QueueRequest gfx/layers/apz/util/APZTaskRunnable.cpp:80
1 xul.dll mozilla::layers::APZChild::RecvRequestContentRepaint gfx/layers/ipc/APZChild.cpp:46
2 xul.dll mozilla::layers::PAPZChild::OnMessageReceived ipc/ipdl/PAPZChild.cpp:232
3 xul.dll mozilla::layers::PCompositorManagerChild::OnMessageReceived ipc/ipdl/PCompositorManagerChild.cpp:414
4 xul.dll mozilla::ipc::MessageChannel::MessageTask::Run ipc/glue/MessageChannel.cpp:1528
5 xul.dll mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:780
6 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1187
7 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:85
8 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:373
9 xul.dll MessageLoop::Run ipc/chromium/src/base/message_loop.cc:355

This appears to be a use-after-free crash, the crashing address is 0xffffffffffffffff because we're hitting bug 1493342 but looking into the registers we clearly see the poison pattern.

Component: Layout → Panning and Zooming

The file in which the crash occurs, APZTaskRunnable.cpp, was added in bug 1730998, which landed in Firefox 95, and we have crash reports dating back to Firefox 95, so this seems like a regression introduced by that change.

Regressed by: 1730998

Set release status flags based on info from the regressing bug 1730998

:hiro, since you are the author of the regressor, bug 1730998, could you take a look?
For more information, please visit auto_nag documentation.

Flags: needinfo?(hikezoe.birchill)
Has Regression Range: --- → yes

Symptoms are sec-high and it's in the parent process which is worrying. The frequency is low happening by itself, but that doesn't mean it isn't easy to trigger if you know it's there.

Keywords: sec-high

It looks like the crash is on this line: mAPZTaskRunnable->QueueRequest(aRequest);

My guess, based on similar crashes in the past and not any knowledge of the code itself, is that something is spinning the event loop or otherwise causing mAPZTaskRunnable to get cleared and thus causing |this| to get destroyed.

Andrew, can you please point me out one of the bug numbers for the similar crashes? Now I have totally no idea how the crash happens. Thanks!

I'm not sure I have one off the top of my head, but the general issue is if you have a call like

mFoo->SomeMethod();

and SomeMethod spins the event loop, and it can run all sorts of weird stuff, including code that does mFoo = nullptr;, which will cause mFoo to get destroyed, so this inside SomeMethod now points to a dead object.

I don't know anything about the ipc implementation, but is it possible that we do something like send a delete msg (at normal priority) and then send a control priority msg (so it'd be handled before the delete msg I presume), and not handle that properly at some level?

Yeah, higher priority messages get handled first, I'm pretty sure. (Also there's no ordering guarantees across multiple top level protocols, though I'd guess that doesn't come into play here.)

I think I know why this is happening now. When we call mController->GetTopLevelPresShell() in APZTaskRunnable.cpp that seems like it'd just be returning a pointer, however it is not. ContentProcessController::GetTopLevelPresShell calls BrowserChild::GetTopLevelPresShell which calls BrowserChild::GetTopLevelDocument which what seems like a pure getter on nsIWebNavigation, however it will actually create a document if one does not exist

https://searchfox.org/mozilla-central/rev/4365e6ab34ca829c930b60252ff19a5f42856da0/docshell/base/nsIWebNavigation.idl#342

And I've run into that before where the creation of a doc from what seemed like a getter was causing one of my patches to fail. Anyways, that can get to nsDocShell::CreateAboutBlankContentViewer which can apparently kill the current docshell

https://searchfox.org/mozilla-central/rev/4365e6ab34ca829c930b60252ff19a5f42856da0/docshell/base/nsDocShell.cpp#6661

I audited all the C++ callers of nsIWebNavigation::GetDocument (there aren't that many), nsWebBrowser::SaveDocument we can just hold a refptr to make it safe (I don't know much about that function so I don't want to change it functionally). The other caller is BrowserChild::GetTopLevelDocument. I checked all users of BrowserChild::GetTopLevelDocument and they all seem better to not create a document. So not too hard to fix up the nsIWebNavigation::GetDocument callers.

However the root of the problem is the PermitUnload call in nsDocShell::CreateAboutBlankContentViewer which could destroy the current docshell in that funciton except for the strong pointer we hold in that function, but then the docshell will probably just die when we return from the function. Looking at all callers of PermitUnload it seems there are several places that could be unsafe (there's quite a lot of auditing to do it properly).

Attached patch thepatch (obsolete) — Splinter Review

This is the patch to fix the nsIWebNavigation::GetDocument callers.

Since there are potentially unfixed issues from the same root cause as I mentioned in comment 10, we may want to take steps to avoid making the issue too obvious.

Except for the toolkit/components/browser/nsWebBrowser.cpp change, the code could be landed from an open bug under the guise of simplifying/optimizing by not creating a document where one does not exist.

The toolkit/components/browser/nsWebBrowser.cpp is pretty clearly fixing some sort of security issue, but it's not super specific (someone reading the code was likely skip over the GetDocument call thinking it's just a simple getter, and perhaps focus on |mPersist->SaveDocument| call below which seems much more suspicious).

I'm not sure if we need to take that level of caution here or not.

Assignee: nobody → tnikkel

ni for question in comment 11. Do we want to split this patch out as described in comment 11 or do we not need that level of caution?

Flags: needinfo?(dveditz)
Flags: needinfo?(continuation)

Personally I don't think it is a huge deal, given that we don't know that these are even really exploitable. It doesn't seem to give much more information than somebody doing a regexp search on the codebase for things matching mFoo->, but maybe Tom has a different opinion.

Flags: needinfo?(continuation) → needinfo?(tom)

Should we just use GetExtantDoc instead of GetDocument instead?

No, I think the normal sec-approval landing process is fine. We can land it the week of the 9th for example.

Flags: needinfo?(tom)

This seems fine to land as-is. The strongThis looks like a typical UAF-prevention fix, and the first bits look vaguely like null de-ref prevention. Certainly don't hint at the doc-creation side-effects you're avoiding.

Flags: needinfo?(dveditz)

Sounds good. I'll get it into review and into the normal process.

Flags: needinfo?(hikezoe.birchill)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #14)

Should we just use GetExtantDoc instead of GetDocument instead?

Yes, I think we should use GetExtantDoc instead of GetDocument (that's what the patch does for the callers I'm comfortable changing behaviour of, which is most of them, the C++ ones at least, js callers should be safe at least but still side-effect causing). It would be great if we could get rid of the side-effect causing GetDocument altogether, but there might be places that depend on it. I'm all for followups to remove this foot-gun or make it harder for this to happen again.

Attached file Bug 1764878. r?emilio
Attachment #9273584 - Attachment is obsolete: true

Comment on attachment 9273737 [details]
Bug 1764878. r?emilio

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: the root change of switching a GetDocument function from that one creates a document if one does not exist to one that just gets any already existing doc is not immediately obvious, so this is less obvious then most sec patches
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
  • Which older supported branches are affected by this flaw?: all
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: Yes
  • If not, how different, hard to create, and risky will they be?: trivial
  • How likely is this patch to cause regressions; how much testing does it need?: the core of the patch is changing a GetDocument function from that one creates a document (and thus can destroy things) if one does not exist to one that just gets any already existing doc. I audited all of the callers and they all seem like that is the behaviour they want, and I didn't see any way that removing that side effect would cause problems. There's still the possibility that the side effect of that document had some ripple effect through a long chain of events that was beneficial to us somehow that is pretty much impossible to reject with 100% certainty.
  • Is Android affected?: Yes
Attachment #9273737 - Flags: sec-approval?

Comment on attachment 9273737 [details]
Bug 1764878. r?emilio

Approved to land and uplift

Attachment #9273737 - Flags: sec-approval?
Attachment #9273737 - Flags: sec-approval+
Attachment #9273737 - Flags: approval-mozilla-esr91+
Attachment #9273737 - Flags: approval-mozilla-beta+

Already fixed and re-pushed.

Flags: needinfo?(tnikkel)
Group: layout-core-security → core-security-release
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 102 Branch
Flags: qe-verify-
Whiteboard: [post-critsmash-triage]
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main101+r]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: