Closed Bug 1602862 Opened 4 years ago Closed 3 years ago

Crash in [@ _platform_memmove$VARIANT$Haswell | Pickle::WriteBytes]

Categories

(Core :: IPC, defect, P2)

defect

Tracking

()

RESOLVED FIXED
90 Branch
Fission Milestone M8
Tracking Status
firefox-esr78 89+ fixed
firefox87 --- wontfix
firefox88 --- wontfix
firefox89 + fixed
firefox90 + fixed

People

(Reporter: annyG, Assigned: jesup)

References

Details

(4 keywords, Whiteboard: [sec-survey][adv-main89+r][adv-esr78.11+r])

Crash Data

Attachments

(1 file)

This bug is for crash report bp-acb75d91-03c9-48e3-b66c-9aa1c0191210.

Top 10 frames of crashing thread:

0 libsystem_platform.dylib _platform_memmove$VARIANT$Haswell 
1 XUL Pickle::WriteBytes ipc/chromium/src/base/pickle.cc:630
2 XUL void mozilla::ipc::WriteIPDLParam<nsTArray<mozilla::Telemetry::KeyedHistogramAccumulation> const&> ipc/glue/IPDLParamTraits.h:60
3 XUL mozilla::dom::PContentChild::SendAccumulateChildKeyedHistograms ipc/ipdl/PContentChild.cpp:6566
4 XUL mozilla::TelemetryIPCAccumulator::IPCTimerFired toolkit/components/telemetry/core/ipc/TelemetryIPCAccumulator.cpp:314
5 XUL mozilla::detail::RunnableFunction< xpcom/threads/nsThreadUtils.h:564
6 XUL mozilla::SchedulerGroup::Runnable::Run xpcom/threads/SchedulerGroup.cpp:295
7 XUL nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1256
8 XUL NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:486
9 XUL mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:87

Steps to reproduce (STR):
Remember to disable any adblockers. If this doesn't reproduce, one can also try following the steps in a private window.

  1. Go to https://www.soul-cycle.com/shop/ and click on any item
  2. Choose the size for the item and click 'add to bag'
  3. On the next page, click checkout
  4. Choose 'Continue as guest' and enter any email
  5. For shipping address, enter random letters for every field, then click 'Next'
  6. You will notice the input fields for billing information are greyed out (I believe that's bug 1556627)
  7. However, when you try to click around or try to type things (or sometimes even without it) the input iframes will soon crash. If that doesn't happen, you can also hard refresh the page.

Is IPDLParamTraits<nsTArray<T>> just missing a check for an empty array? I don't see how that case is handled, but it would be surprising that we wouldn't have hit this before...

No, I guess if the size of zero, then we don't hit the loop in BufferList<AllocPolicy>::WriteBytes.

Crash Signature: [@ _platform_memmove$VARIANT$Haswell | Pickle::WriteBytes] → [@ _platform_memmove$VARIANT$Haswell | Pickle::WriteBytes] [@ acb75d91-03c9-48e3-b66c-9aa1c0191210]

I am currently trying to record it via rr to see what's happening

Crash Signature: [@ _platform_memmove$VARIANT$Haswell | Pickle::WriteBytes] [@ acb75d91-03c9-48e3-b66c-9aa1c0191210] → [@ _platform_memmove$VARIANT$Haswell | Pickle::WriteBytes] [@ acb75d91-03c9-48e3-b66c-9aa1c0191210]

I am not able to reproduce this on linux. I will try reproducing it on my mac by building Firefox myself.

I haven't been able to extract much from the crash report (the Breakpad symbol generation has made some unhelpful choices for how to deal with inlining, and we don't seem to have a copy of the full debug info), but:

  • It does seem to be a null source address for the memcpy, assuming I'm looking at the right instruction in my copy of the library
  • I notice that the nsTArray being sent is accessed from other threads. There's a mutex that's used while appending elements and taking the array to send it, but maybe I missed something and we can end up with a race where serializing sees a non-zero length and null element pointer.
Crash Signature: [@ _platform_memmove$VARIANT$Haswell | Pickle::WriteBytes] [@ acb75d91-03c9-48e3-b66c-9aa1c0191210] → [@ _platform_memmove$VARIANT$Haswell | Pickle::WriteBytes]

Most of the crashes with this signature are being called from mozilla::dom::PContentChild::SendAccumulateChildKeyedHistograms() (like in the original crash) but there's also one under PBackgroundStorageParent::SendOriginsHavingData() ( bp-4648566e-4f96-4f75-aeaa-db9760191209 ) and another under PBackgroundStorage::Preload ( bp-094e7eca-5ee9-4471-b497-84aac0191208 ). Those also involve code that deals with multiple threads, and arrays, so Jed's theory sounds reasonable to me.

Crash Signature: [@ _platform_memmove$VARIANT$Haswell | Pickle::WriteBytes] → [@ _platform_memmove$VARIANT$Haswell | Pickle::WriteBytes] [@ __platform_memmove$VARIANT$Nehalem | Pickle::WriteBytes ]

I found a few variants.

Crash Signature: [@ _platform_memmove$VARIANT$Haswell | Pickle::WriteBytes] [@ __platform_memmove$VARIANT$Nehalem | Pickle::WriteBytes ] → [@ _platform_memmove$VARIANT$Haswell | Pickle::WriteBytes] [@ __platform_memmove$VARIANT$Nehalem | Pickle::WriteBytes ] [@ _platform_memmove$VARIANT$Merom | Pickle::WriteBytes ]

SendAccumulatedData has the comment "This method takes the lock only to double-buffer the batched telemetry. It releases the lock before calling out to IPC code which can (and does) Accumulate (which would deadlock)".

Ah, but that does swap out the data to a stack-local array. Ok.

Jed, to further my understanding of the code base can you please explain how you came to the following conclusion:

I notice that the nsTArray being sent is accessed from other threads. There's a mutex that's used while appending elements and taking the array to send it, but maybe I missed something and we can end up with a race where serializing sees a non-zero length and null element pointer.

How did you see that the nsTArray was being accessed from other threads?

Flags: needinfo?(jld)
Crash Signature: [@ _platform_memmove$VARIANT$Haswell | Pickle::WriteBytes] [@ __platform_memmove$VARIANT$Nehalem | Pickle::WriteBytes ] [@ _platform_memmove$VARIANT$Merom | Pickle::WriteBytes ] → [@ _platform_memmove$VARIANT$Haswell | Pickle::WriteBytes] [@ __platform_memmove$VARIANT$Nehalem | Pickle::WriteBytes ] [@ _platform_memmove$VARIANT$Merom | Pickle::WriteBytes ] [@ acb75d91-03c9-48e3-b66c-9aa1c0191210]
Crash Signature: [@ _platform_memmove$VARIANT$Haswell | Pickle::WriteBytes] [@ __platform_memmove$VARIANT$Nehalem | Pickle::WriteBytes ] [@ _platform_memmove$VARIANT$Merom | Pickle::WriteBytes ] [@ acb75d91-03c9-48e3-b66c-9aa1c0191210] → [@ _platform_memmove$VARIANT$Haswell | Pickle::WriteBytes] [@ __platform_memmove$VARIANT$Nehalem | Pickle::WriteBytes ] [@ _platform_memmove$VARIANT$Merom | Pickle::WriteBytes ]

(In reply to Anny Gakhokidze [:annyG] from comment #12)

Jed, to further my understanding of the code base can you please explain how you came to the following conclusion:

I notice that the nsTArray being sent is accessed from other threads. There's a mutex that's used while appending elements and taking the array to send it, but maybe I missed something and we can end up with a race where serializing sees a non-zero length and null element pointer.

How did you see that the nsTArray was being accessed from other threads?

Sorry; the way I wrote that was a little unclear. I don't know what was happening to the array in that specific case that crashed; inconsistency between the array's length and storage hints that a data race might be involved, but that's not conclusive. What I did was read the code starting at SendAccumulatedData (called from TelemetryIPCAccumulator::IPCTimerFired): it takes the gTelemetryIPCAccumulatorMutex while swapping global arrays with local variables, and I know that telemetry allows using probes on any thread. I also looked at uses of those global arrays, like gKeyedHistogramAccumulations; there's one call to AppendElement that I can see and it's protected by the same mutex (and, following the call graph, that locked AppendElement is used to implement Telemetry::Accumulate in a child process.)

Flags: needinfo?(jld)

We looked at this a little more in the IPC meeting. The null pointer probably isn't from the array, because the element type isn't a primitive numeric type. However, KeyedHistogramAccumulation contains an nsCString for the key, copy-constructed from whatever was passed into Telemetry::Accumulate. So, if something went wrong with the string ownership model, that could explain this. XPCOM strings use refcounted sharing, but the ref counts are thread-safe, so that's probably not it; see also bug 1353181. If there were a dependent string based on memory that didn't live long enough (because sadly this isn't Rust and we don't have lifetime types)… but would we see that as a null pointer dereference?

If we knew what histogram was involved (the HistogramID in the mId field), that could help narrow it down.

Priority: -- → P2

Jed, since I can reproduce this on my mac, I could add some prints to see what is happening at the time of the crash.

So here https://searchfox.org/mozilla-central/rev/62a130ba0ac80f75175e4b65536290b52391f116/toolkit/components/telemetry/core/ipc/TelemetryIPCAccumulator.cpp#166 if I am to print aId, would that be helpful? Or should I print it elsewhere?

Flags: needinfo?(jld)

I can only reproduce this on mac using opt build unfortunately.. and I can't reproduce this without Fission (earlier mccr8 recommended I try to see how often this happens without fission).

This error still happens on latest nightly but less frequently. After the last step I have to refresh the page a couple of times while trying to click the input fields and type things. Sometimes it happens after I reopen the page without deleting any cookies or local storage data such that my items are still in the cart and I am taken to the billing information straight, and I try to enter things and then the input iframes crash. This does not happen often either.

(In reply to Anny Gakhokidze [:annyG] PTO till jan3rd from comment #15)

Jed, since I can reproduce this on my mac, I could add some prints to see what is happening at the time of the crash.

So here https://searchfox.org/mozilla-central/rev/62a130ba0ac80f75175e4b65536290b52391f116/toolkit/components/telemetry/core/ipc/TelemetryIPCAccumulator.cpp#166 if I am to print aId, would that be helpful? Or should I print it elsewhere?

I was thinking later than that, like just before the call to SendAccumulateChildKeyedHistograms.

But on second thought I don't know if that will help. The problem seems to be getting a bad value for nsTStringRepr::mData, which suggests that the problem is how the nsTStrings themselves are being stored, and if that's the case then the specific histogram key wouldn't matter.

As comment #8 mentioned, there are a few crash reports for other messages (which all contain nsCString or nsString), and I've found a few more. They all seem to have an nsCString or nsString stored in an array, either directly or via a struct. So this might actually be a problem with nsTArray, and there might be non-string-related variants of it that we're not seeing here because they have different crash signatures.

Flags: needinfo?(jld)

I've looked at the latest crash reports, and none of them seem to be null pointers. It's hard to tell exactly what's going on in general without the raw minidumps because the crashing instruction is in a system library, but they all seem consistent with the crash being on a read from rsi. Specifically:

I also notice that, on the Mac crashes, r14 and rbx have a value (the same one) that usually has a similar pattern to the bad address but 32-bit, so it's probably the length. It's 0xe5e5e5e5 for bp-8ad0ba1f-35d7-4ee9-9b57-14ddb0191228 as expected, but also for bp-14122172-a9ce-4036-b959-7e50e0191228 (the page-crossing load).

In any case, this looks like a use-after-free. I still don't understand how it's possible from looking at the code, but it's hard to argue with the crash reports.

Group: dom-core-security

Uh. Is this another instance of nsTArray defaulting to memcpy rather than copy constructors biting us, somehow? We've certainly defined many IPDL types (and otherwise) to use copy constructors:

https://searchfox.org/mozilla-central/source/xpcom/ds/nsTArray.h#719-735

...except that the types being sent here (KeyedHistogramAccumulation) aren't defined in IPDL, and there's nothing particularly unusual about the member field types of KeyedHistogramAccumulation (integers and nsCString).

All the existing crashes involving arrays and strings (directly in the arrays or as member fields) does smell suspicious, though I can't see what the problem could be yet...

We were discussing this at the IPC meeting and Nika noticed that if an nsCString (or other nsTSubstring) is read from freed memory and the memory pattern for mDataFlags happens to have LITERAL set and REFCOUNTED clear, the pointer and length will just be copied without any dereferencing or other checks.

So, if some code somewhere else does a use-after-free of an nsCString value, then the bad bits could end up being copied (possibly multiple times) and not being caught until they propagate into IPC.

If that's the cause of this, then given the crash reports there could be multiple such bugs. Also, there would be other cases where the use-after-free didn't produce the right bit pattern and crashed earlier, but they'd have different crash signatures and there would be nothing to connect them to this bug.

We could try testing this by adding release assertions to try to catch bad values somewhere in the copy/Assign/AssignLiteral call chain, like checking for unassigned flags bits being set, or reading and discarding the data (some of it? all of it? only for purported literals?). That would probably be too expensive to ship, but it might find things in local testing.

Keywords: stalled

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #21)

We were discussing this at the IPC meeting and Nika noticed that if an nsCString (or other nsTSubstring) is read from freed memory and the memory pattern for mDataFlags happens to have LITERAL set and REFCOUNTED clear, the pointer and length will just be copied without any dereferencing or other checks.

This seems kind of plausible, but given the definitions of StringDataFlags:

https://searchfox.org/mozilla-central/source/xpcom/string/nsStringFlags.h

wouldn't a UAF string (not StringBuffer) have a flags of 0xe5e5, which would be TERMINATED | REFCOUNTED | LITERAL? Or is the presence of REFCOUNTED there a red herring, and the LITERAL takes precedence?

Is it worth considering rearranging the flags so that 0xe5e5 produces some combination that's liable to crash in a more recognizable way?

Jesup has some ideas to make forward progress here, un-stalling.

Assignee: nobody → rjesup
Keywords: stalled

(In reply to Nathan Froyd [:froydnj] from comment #22)

wouldn't a UAF string (not StringBuffer) have a flags of 0xe5e5, which would be TERMINATED | REFCOUNTED | LITERAL? Or is the presence of REFCOUNTED there a red herring, and the LITERAL takes precedence?

The copy code checks first for REFCOUNTED and then for LITERAL, so it would have to be literal and not refcounted. 0xe5e5 would send it into the refcounted path, which tries to access the refcount through mData.

The cases I looked at in comment #19 mostly didn't have the e5 pattern, which would make sense if the object needs to be at least partially overwritten as a new allocation to end up in this state.

I was wondering if this could be related to bug 1665411, because at least the original case with the telemetry histograms fits the pattern of a UAF on some data that's used/allocated/freed on different threads with what should be sufficient locking. But we also see this bug on Windows (e.g., bp-cb39403f-dfab-447a-8380-a8e790201028) and not just Mac, so it may not be.

Fission Milestone: M5 → ---
Fission Milestone: --- → MVP

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:jesup, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(rjesup)
Flags: needinfo?(jld)

No reports since September - maybe signature change?

Status: NEW → ASSIGNED
Fission Milestone: MVP → M8

It's worth noting that the majority of the crashes under those signatures don't appear to be doing UAF accesses. I could find only a handful that had the poison pattern somewhere in the registers.

Thanks for digging those up.

Here's one that does have a poison address: bp-2405096f-cb24-4f17-996f-3bcbf0210330

Comment on attachment 9192249 [details]
Bug 1602862: Add checks for invalid StringClassFlags r=jld

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: Extremely hard. There's no normal way to get invalid bits into the type. At most it might encourage someone to try to provoke the new assertions, but there's no way to know how to do so. The comments mention invalid bits, but that's going to be obvious with or without the comments.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: Yes
  • Which older supported branches are affected by this flaw?: all
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: Yes
  • If not, how different, hard to create, and risky will they be?: Should apply with no problems to all branches.
  • How likely is this patch to cause regressions; how much testing does it need?: Almost impossible to have regressions
Flags: needinfo?(rjesup)
Attachment #9192249 - Flags: sec-approval?

Comment on attachment 9192249 [details]
Bug 1602862: Add checks for invalid StringClassFlags r=jld

Approved to land and request uplift if that is desired.

Attachment #9192249 - Flags: sec-approval? → sec-approval+
Group: dom-core-security → core-security-release
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 90 Branch

As part of a security bug pattern analysis, we are requesting your help with a high level analysis of this bug. It is our hope to develop static analysis (or potentially runtime/dynamic analysis) in the future to identify classes of bugs.

Please visit this google form to reply.

Flags: needinfo?(rjesup)
Whiteboard: [sec-survey]
Flags: needinfo?(jld)

This grafts cleanly all the way down to ESR78. Please nominate for Beta/ESR78 approval when you get a chance.

Comment on attachment 9192249 [details]
Bug 1602862: Add checks for invalid StringClassFlags r=jld

Beta/Release Uplift Approval Request

  • User impact if declined: No added saneness checks
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Just adds some assertions in obviously-bad cases (corruption)
  • String changes made/needed: none
Flags: needinfo?(rjesup)
Attachment #9192249 - Flags: approval-mozilla-beta?

Comment on attachment 9192249 [details]
Bug 1602862: Add checks for invalid StringClassFlags r=jld

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration:
  • User impact if declined: Won't assert on corruption of strings
  • Fix Landed on Version: 90
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): checks for invalid flag bits being set
  • String or UUID changes made by this patch: none
Attachment #9192249 - Flags: approval-mozilla-esr78?

Comment on attachment 9192249 [details]
Bug 1602862: Add checks for invalid StringClassFlags r=jld

Uplift approved for beta and ESR, thanks

Attachment #9192249 - Flags: approval-mozilla-esr78?
Attachment #9192249 - Flags: approval-mozilla-esr78+
Attachment #9192249 - Flags: approval-mozilla-beta?
Attachment #9192249 - Flags: approval-mozilla-beta+
QA Whiteboard: [post-critsmash-triage]
Flags: qe-verify-
Whiteboard: [sec-survey] → [sec-survey][adv-main89+r]
Whiteboard: [sec-survey][adv-main89+r] → [sec-survey][adv-main89+r][adv-esr78.11+r]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: