Closed Bug 1443748 Opened 2 years ago Closed 11 months ago

Crash in mozilla::ipc::IPDLParamTraits<T>::Write

Categories

(Core :: IPC, defect, P1, critical)

Unspecified
All
defect

Tracking

()

RESOLVED FIXED
mozilla64
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 63+ fixed
firefox58 --- unaffected
firefox59 --- unaffected
firefox60 - wontfix
firefox61 - wontfix
firefox62 --- wontfix
firefox63 + fixed
firefox64 + fixed

People

(Reporter: calixte, Assigned: haik)

References

(Blocks 1 open bug)

Details

(4 keywords, Whiteboard: [post-critsmash-triage][adv-main63+][adv-esr60.3+])

Crash Data

Attachments

(4 files, 1 obsolete file)

This bug was filed from the Socorro interface and is
report bp-e468b93b-cfe9-4200-b1f2-b6b490180307.
=============================================================

Top 10 frames of crashing thread:

0 XUL mozilla::ipc::IPDLParamTraits<mozilla::ipc::FileDescriptor>::Write ipc/glue/FileDescriptor.cpp:235
1 XUL std::__1::__function::__func<mozilla::net::PNeckoParent::OnMessageReceived ipc/glue/IPDLParamTraits.h:60
2 XUL mozilla::net::ExtensionJARFileOpener::SendBackFD clang/include/c++/v1/functional:1916
3 XUL mozilla::detail::RunnableMethodImpl<mozilla::net::ExtensionJARFileOpener*, nsresult  xpcom/threads/nsThreadUtils.h:1149
4 XUL nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1040
5 XUL NS_ProcessPendingEvents xpcom/threads/nsThreadUtils.cpp:459
6 XUL nsBaseAppShell::NativeEventCallback widget/nsBaseAppShell.cpp:98
7 XUL nsAppShell::ProcessGeckoEvents widget/cocoa/nsAppShell.mm:436
8 CoreFoundation CoreFoundation@0xa7320 
9 CoreFoundation CoreFoundation@0x8821c 

=============================================================

There is 1 crash in nightly 60 with buildid 20180306100123. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1440511.

[1] https://hg.mozilla.org/mozilla-central/rev?node=5467d67a2577
Flags: needinfo?(nika)
I suspect this is not actually a new crash, but merely a signature change, since we moved some code around.  That being said, I can't find the previous crash...
I think what :froydnj said is probably true, but I also can't find the original crash.

I took a look anyway. It looks like a null pointer crash, and the only pointer which has any chance of being null there is the aActor pointer. This is what the code which uses it looks like (not linking to searchfox, because it's generated):

            WeakPtr<PNeckoParent> self__ = this;
            GetExtensionFDResolver resolver = [this, self__, id__, seqno__](const FileDescriptor& aParam) {
                if ((!(self__))) {
                    NS_WARNING("Not resolving response because actor is dead.");
                    return;
                }
                if ((mState) == (PNecko::__Dead)) {
                    NS_WARNING("Not resolving response because actor is destroyed.");
                    return;
                }
                bool resolve__ = true;
                FileDescriptor fd;
                fd = mozilla::Move(aParam);
                IPC::Message* reply__ = PNecko::Reply_GetExtensionFD(id__);
                WriteIPDLParam(reply__, self__, resolve__);
                // Sentinel = 'resolve__'
                (reply__)->WriteSentinel(3997392463);
                WriteIPDLParam(reply__, self__, fd);

The only way I can see that self__ could become null between the check at the top and the call to Write is if one of the other calls (like WriteIPDLParam or something) caused the PNecko to be destroyed, which I have no idea how it would happen?
Flags: needinfo?(nika)
Priority: -- → P3
Crash Signature: [@ mozilla::ipc::IPDLParamTraits<T>::Write] → [@ mozilla::ipc::IPDLParamTraits<T>::Write] [@ mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::ipc::IPDLParamTraits<T>::Write]
OS: Mac OS X → All
this is a UAF crash in many cases (resetting the priority to re-evaluate under these circumstances).
Group: core-security
Keywords: csectype-uaf
Priority: P3 → --
Group: core-security → dom-core-security
:mystor can you take another look based on comment 3, since it seems this isn't actually null much of the time?
Flags: needinfo?(nika)
Sorry I don't have much time to look into this right now, I tried to think through the way this could happen, and I don't have any ideas.
Flags: needinfo?(nika)
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → INCOMPLETE
Why was this closed?
Flags: needinfo?(continuation)
Unactionable sec-high bugs make the list of open sec-high bug less useful.
Flags: needinfo?(continuation)
(In reply to Julien Cristau [:jcristau] from comment #6)
> Why was this closed?

The Dev says he has no ideas of how to fix this so this bug is just going to sit open until... In that situation, critsmash triage resolves them as incomplete so the list of sec-high (or critical) bugs is a list of bugs that are actually being worked on to a resolution.
No sense tracking a bug that isn't going anywhere. That said, Fx60 betas are still showing a pretty regular crash rate here...
Given the high crash rate here, it might be more appropriate to have this marked as stalled but to keep it open. However, given the below, I'm not yet sure we're stalled.

Peter Van Der Beken and I looked into this a bit and with Jeff Muizelaar's help, we were able to get to https://bit.ly/2LJRfGm vs https://bit.ly/2osbsHy (we weren't sure if the template parameter in the signature was catching too many unrelated crashes). It looks like the crashes with mozilla::net::ExtensionJARFileOpener::SendBackFD() on the stack are the most common. Haik touched that code last with reviews by Jim so I'll needinfo them here to see if they have ideas.
Status: RESOLVED → REOPENED
Flags: needinfo?(jmathies)
Flags: needinfo?(haftandilian)
Resolution: INCOMPLETE → ---
Duplicate of this bug: 1490677
I don't know what's causing this yet, but given what :nika and :philipp said above, I suspect this is due to inadequate synchronization in the NeckoParent handling of the async GetExtensionFD message. We handle that message by offloading work to another thread then jumping to the main thread to send back the result.

The GetExtensionFD message is sent from a child process when it needs to get the FD for an extension JAR file so it can open and read the JAR without having filesystem access to it directly. This async message is handled in the Necko parent process. The JAR file is opened in the parent process and the FD is sent back to the child. The JAR file is opened off the main thread in ExtensionJARFileOpener::OpenFile() and then the FD is sent back from the main thread in ExtensionJARFileOpener::SendBackFD(). Tear down of the NeckoParent needs to wait for any ExtensionJARFileOpener instances to complete, but that is not being done. As a fix, we could keep a list of outstanding ExtensionJARFileOpener instances and then wait for them to all complete duing teardown.

I need to do some more debugging to confirm that this is the problem. Taking this bug.
Assignee: nobody → haftandilian
Flags: needinfo?(haftandilian)
Priority: -- → P1
(In reply to Haik Aftandilian [:haik] from comment #12)
> Tear down of the
> NeckoParent needs to wait for any ExtensionJARFileOpener instances to
> complete, but that is not being done. As a fix, we could keep a list of
> outstanding ExtensionJARFileOpener instances and then wait for them to all
> complete duing teardown.

What I don't understand here is: if the actor is destroyed before the resolve function is called, the captured WeakPtr self__ should be null here: https://crash-stats.mozilla.com/sources/highlight/?url=https://gecko-generated-sources.s3.amazonaws.com/ccc17ebc155a410d6b3a59b3247a330b6e1af1ea1217cdeb22906fbcd899a89c5b1711fcfa6472957bff18c5957fe964a8d521b445c95db3e7e534fe63987c49/ipc/ipdl/PNeckoParent.cpp#L-2321

And in that case we shouldn't be able to reach the point where the crash is reported.  I was wondering if maybe something was happening on the wrong thread, but I didn't see any obvious way that would happen from looking at the code.
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #13)
> What I don't understand here is:

See also comment #2.
A thing that's useful to know: on amd64, accesses to non-canonical addresses (where it's not a sign-extended 48-bit value) raise #GP, not #PF, and the CPU doesn't supply the address in question to the OS.  Linux and Mac report the address as 0, Windows as -1.

The 32-bit versions of this crash are almost all 0xe5e5e5e5 (or the 64-bit sign extension of that value).  Many of the 64-bit versions are the address we'd get if it were a non-canonical address, but some of them aren't.  This suggests comparing two crashes from the same Firefox build:

bp-677abd07-4382-4a24-94c1-1dad60180828 has crash address 0xe2d82c1351c; in the Raw Dump tab, rsi and rdi have the same value.

bp-5e96cbb7-72a6-4933-8bab-3765c0180825 reports crash address 0x0, but in the raw dump, rsi and rdi are 0xe5e5e5e5e5e5e5e5

So maybe the value we're getting from self__ is the result of loading from a dangling pointer, which would explain why it's not null.  Specifically:

Are we copy-constructing or destroying a copy of the resolver function off-main-thread?  It contains a non-thread-safe WeakPtr, so that could cause a lost update to the WeakReference refcount and explain what we're seeing.  WeakPtr has thread-safety assertions, but they're debug-only, so if this happens only in a timing-sensitive edge case they might not catch it.
bp-ccf7bb2a-d6b7-4ca1-bc27-ce8dc0180907 and bp-c8630a42-07c4-4c92-b6fb-ff4bd0180909 are from one of our builds, so I can do some diassembly.  The bad address really is the aActor argument to IPDLParamTraits<FileDescriptor>::Write, and it does appear to be result of converting self__ to a raw pointer (and is the same value checked against 0 at the start of the function).
Working with Jed, we've got an explanation for the crash. In short, the ExtensionJARFileOpener (and therefore the WeakPtr within the resolve closure) can be deleted off the main thread. WeakPtr isn't thread safe and in this case needs to be deleted on the main thread to avoid racing with other instances of WeakPtr<PNeckoParent> referring to the same PNeckoParent. This could cause the WeakReference's reference counting to be off which can cause it to be prematurely deleted.

ExtensionJARFileOpener gets deleted on the OpenFile thread if the SendBackFD Runnable it creates to send back the FD on the main thread completes before ExtensionJARFileOpener is released by the OpenFile thread.
For the record, I was wrong about the assertions.  Destroying a WeakPtr doesn't check thread safety; other operations (creation, assignment, dereference) do.
This patch clears the resolver on the main thread after it is used in SendBackFD. If the final Release of ExtensionJARFileOpener ends up being called after SendBackFD on the worker thread, the resolver will be already deleted and we'll avoid this crash because we won't be deleting a WeakPtr<PNeckoParent> off the main thread. I intentionally did not add comments because this is a security bug.

I haven't reproduced the reported crash, but I have reproduced the WeakPtr being deleted off the main thread which could lead to the crash. I'm still testing the patch.
Attachment #9009664 - Flags: review?(jld)
Flags: needinfo?(jmathies)
Comment on attachment 9009664 [details] [diff] [review]
Patch to avoid the crash be ensuring the PNeckoParent resolver is deleted on the main thread

Review of attachment 9009664 [details] [diff] [review]:
-----------------------------------------------------------------

I like the simplicity of this approach.

::: netwerk/protocol/res/ExtensionProtocolHandler.cpp
@@ +193,5 @@
>  private:
> +  virtual ~ExtensionJARFileOpener()
> +  {
> +    if (!NS_IsMainThread()) {
> +      MOZ_RELEASE_ASSERT(mResolve == nullptr);

This part of the patch seems like a relatively big hint about what the bug is, but whoever does the sec-approval would have a better idea of how much of a concern that is.

Specifically, this should work without the main thread check — as far as I can see, once we create one of these objects, every path either eventually calls SendBackFD or leaks it (if dispatch fails during shutdown).
Attachment #9009664 - Flags: review?(jld) → review+
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #20)
> Comment on attachment 9009664 [details] [diff] [review]
> ::: netwerk/protocol/res/ExtensionProtocolHandler.cpp
> @@ +193,5 @@
> >  private:
> > +  virtual ~ExtensionJARFileOpener()
> > +  {
> > +    if (!NS_IsMainThread()) {
> > +      MOZ_RELEASE_ASSERT(mResolve == nullptr);
> 
> This part of the patch seems like a relatively big hint about what the bug
> is, but whoever does the sec-approval would have a better idea of how much
> of a concern that is.

Thanks for the review. I'll err on the side of obfuscation and remove the new ASSERT. I'll use another bug to add more comments and add the ASSERT back once this lands.

> Specifically, this should work without the main thread check — as far as I
> can see, once we create one of these objects, every path either eventually
> calls SendBackFD or leaks it (if dispatch fails during shutdown).

I agree.
Removed the destructor ASSERT. Carrying forward r+.
Attachment #9009664 - Attachment is obsolete: true
Attachment #9010110 - Flags: review+
Attachment #9010110 - Flags: review+
Comment on attachment 9010110 [details] [diff] [review]
Patch to avoid the crash by ensuring the PNeckoParent resolver is deleted on the main thread

[Security approval request comment]
How easily could an exploit be constructed based on the patch?
Not easily. The ramifications of the bug aren't clear from the patch. The UAF occurs in the parent process, but the code is triggered from the child process via an IPC message. A compromised child process could repeatedly trigger the code to run in the parent and eventually hit the race condition causing the UAF. The compromised child would also have to control allocations in order to leverage the UAD.

I don't think web content or an extension could repeatedly trigger the code to run in the parent, hitting the UAF.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?
No, but please review the check-in comment.

Which older supported branches are affected by this flaw?
All (ESR60, Release, Beta, Nightly)

If not all supported branches, which bug introduced the flaw?

Do you have backports for the affected branches?
No

If not, how different, hard to create, and risky will they be?
The patches should all be the same 1-line change and should not be hard to create. Given the small size of the patch, I think the change is well understood and low risk.

How likely is this patch to cause regressions; how much testing does it need?
Low probability of causing regressions. Standard bug fix testing. Existing automated tests will exercise the changed code.
Attachment #9010110 - Flags: sec-approval?
Comment on attachment 9010110 [details] [diff] [review]
Patch to avoid the crash by ensuring the PNeckoParent resolver is deleted on the main thread

sec-approval+ for trunk. Let's get Beta and ESR60 patches nominated as well.
Attachment #9010110 - Flags: sec-approval? → sec-approval+
This problem was introduced by my fix for bug 1334550 "Proxy moz-extension protocol requests to the parent process".

Working on getting the other patches ready.
Attached patch Beta patchSplinter Review
Identical patch for Beta.
Attached patch Release patchSplinter Review
Identical patch for Release.
Attached patch ESR60 patchSplinter Review
Identical patch for ESR60.
Comment on attachment 9010773 [details] [diff] [review]
Beta patch

[Security approval request comment]
The code changes are the same as on Nightly.
See comment 23.
Attachment #9010773 - Flags: sec-approval?
Comment on attachment 9010774 [details] [diff] [review]
Release patch

[Security approval request comment]
The code changes are the same as on Nightly.
See comment 23.
Attachment #9010774 - Flags: sec-approval?
Comment on attachment 9010775 [details] [diff] [review]
ESR60 patch

[Security approval request comment]
The code changes are the same as on Nightly.
See comment 23.
Attachment #9010775 - Flags: sec-approval?
:RyanVM informed me sec-approval is only needed for the Nightly patch. I'll update patches to request standard approvals.
Comment on attachment 9010773 [details] [diff] [review]
Beta patch

Approval Request Comment
[Feature/Bug causing the regression]:
Bug 1334550

[User impact if declined]:
User is exposed to a UAF crash in the parent process.

[Is this code covered by automated tests?]:
Yes

[Has the fix been verified in Nightly?]:
No, not yet on Nightly due to it being a security bug.

[Needs manual test from QE? If yes, steps to reproduce]:
No specific tests that reproduce the crash. Try builds have been shown to exercise the changed code, but don't cause the crash because that requires a race condition that has not been reproduced.
 
[List of other uplifts needed for the feature/fix]:
None

[Is the change risky?]:
No

[Why is the change risky/not risky?]:
The change is small and well understood.

[String changes made/needed]:
None
Attachment #9010773 - Flags: sec-approval? → approval-mozilla-beta?
Comment on attachment 9010774 [details] [diff] [review]
Release patch

Approval Request Comment
[Feature/Bug causing the regression]:
Bug 1334550

[User impact if declined]:
User is exposed to a UAF crash in the parent process.

[Is this code covered by automated tests?]:
Yes

[Has the fix been verified in Nightly?]:
No, not yet on Nightly due to it being a security bug.

[Needs manual test from QE? If yes, steps to reproduce]:
No specific tests that reproduce the crash. Try builds have been shown to exercise the changed code, but don't cause the crash because that requires a race condition that has not been reproduced.
 
[List of other uplifts needed for the feature/fix]:
None

[Is the change risky?]:
No

[Why is the change risky/not risky?]:
The change is small and well understood.

[String changes made/needed]:
None
Attachment #9010774 - Flags: approval-mozilla-release?
Attachment #9010774 - Flags: sec-approval?
Comment on attachment 9010775 [details] [diff] [review]
ESR60 patch

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration:
User impact if declined:
sec-high
 
Fix Landed on Version:
Not yet landed.

Risk to taking this patch (and alternatives if risky):
Low risk due to the size of the change. It's in code used for loading JAR files which is used by extensions so it's frequently used. 

String or UUID changes made by this patch: 
None

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.
Attachment #9010775 - Flags: sec-approval? → approval-mozilla-esr60?
I filed bug 1493045 to assert thread-safety on WeakPtr destruction.  In this case, that would have caught the wrong-thread destruction even if it wouldn't have caused a lost update to the refcount, which is probably the much more common case, and hopefully (1) failed often enough that we'd have caught it in CI, and (2) pointed much more directly at the problem.

(I'm not linking the bugs with a See Also because I'd like to not put up a big sign saying the other bug is inspired by a live security issue.)
Comment on attachment 9010774 [details] [diff] [review]
Release patch

Unless this is being actively exploited it should go in 63.
Attachment #9010774 - Flags: approval-mozilla-release? → approval-mozilla-release-
Comment on attachment 9010773 [details] [diff] [review]
Beta patch

Uplift approved for 63 beta 9, thanks.
Attachment #9010773 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/mozilla-central/rev/bad3f3643548
Group: dom-core-security → core-security-release
Status: REOPENED → RESOLVED
Closed: Last year11 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
I also filed bug 1493342 about most of the 64-bit crashes having the wrong address.
Flags: qe-verify-
Whiteboard: [post-critsmash-triage]
Comment on attachment 9010775 [details] [diff] [review]
ESR60 patch

Fixes a sec-high crash. Approved for ESR 60.3.
Attachment #9010775 - Flags: approval-mozilla-esr60? → approval-mozilla-esr60+
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main63+][adv-esr60.3+]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.