Crash in [@ _platform_memmove$VARIANT$Haswell | Pickle::WriteBytes]
Categories
(Core :: IPC, defect, P2)
Tracking
()
People
(Reporter: annyG, Assigned: jesup)
References
Details
(4 keywords, Whiteboard: [sec-survey][adv-main89+r][adv-esr78.11+r])
Crash Data
Attachments
(1 file)
47 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
pascalc
:
approval-mozilla-esr78+
tjr
:
sec-approval+
|
Details | Review |
This bug is for crash report bp-acb75d91-03c9-48e3-b66c-9aa1c0191210.
Top 10 frames of crashing thread:
0 libsystem_platform.dylib _platform_memmove$VARIANT$Haswell
1 XUL Pickle::WriteBytes ipc/chromium/src/base/pickle.cc:630
2 XUL void mozilla::ipc::WriteIPDLParam<nsTArray<mozilla::Telemetry::KeyedHistogramAccumulation> const&> ipc/glue/IPDLParamTraits.h:60
3 XUL mozilla::dom::PContentChild::SendAccumulateChildKeyedHistograms ipc/ipdl/PContentChild.cpp:6566
4 XUL mozilla::TelemetryIPCAccumulator::IPCTimerFired toolkit/components/telemetry/core/ipc/TelemetryIPCAccumulator.cpp:314
5 XUL mozilla::detail::RunnableFunction< xpcom/threads/nsThreadUtils.h:564
6 XUL mozilla::SchedulerGroup::Runnable::Run xpcom/threads/SchedulerGroup.cpp:295
7 XUL nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1256
8 XUL NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:486
9 XUL mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:87
Reporter | ||
Comment 1•4 years ago
•
|
||
Steps to reproduce (STR):
Remember to disable any adblockers. If this doesn't reproduce, one can also try following the steps in a private window.
- Go to https://www.soul-cycle.com/shop/ and click on any item
- Choose the size for the item and click 'add to bag'
- On the next page, click checkout
- Choose 'Continue as guest' and enter any email
- For shipping address, enter random letters for every field, then click 'Next'
- You will notice the input fields for billing information are greyed out (I believe that's bug 1556627)
- However, when you try to click around or try to type things (or sometimes even without it) the input iframes will soon crash. If that doesn't happen, you can also hard refresh the page.
Comment 3•4 years ago
|
||
Is IPDLParamTraits<nsTArray<T>> just missing a check for an empty array? I don't see how that case is handled, but it would be surprising that we wouldn't have hit this before...
Comment 4•4 years ago
|
||
No, I guess if the size of zero, then we don't hit the loop in BufferList<AllocPolicy>::WriteBytes.
Updated•4 years ago
|
Reporter | ||
Comment 5•4 years ago
|
||
I am currently trying to record it via rr to see what's happening
Reporter | ||
Comment 6•4 years ago
|
||
I am not able to reproduce this on linux. I will try reproducing it on my mac by building Firefox myself.
Comment 7•4 years ago
|
||
I haven't been able to extract much from the crash report (the Breakpad symbol generation has made some unhelpful choices for how to deal with inlining, and we don't seem to have a copy of the full debug info), but:
- It does seem to be a null source address for the
memcpy
, assuming I'm looking at the right instruction in my copy of the library - I notice that the
nsTArray
being sent is accessed from other threads. There's a mutex that's used while appending elements and taking the array to send it, but maybe I missed something and we can end up with a race where serializing sees a non-zero length and null element pointer.
Updated•4 years ago
|
Comment 8•4 years ago
|
||
Most of the crashes with this signature are being called from mozilla::dom::PContentChild::SendAccumulateChildKeyedHistograms() (like in the original crash) but there's also one under PBackgroundStorageParent::SendOriginsHavingData() ( bp-4648566e-4f96-4f75-aeaa-db9760191209 ) and another under PBackgroundStorage::Preload ( bp-094e7eca-5ee9-4471-b497-84aac0191208 ). Those also involve code that deals with multiple threads, and arrays, so Jed's theory sounds reasonable to me.
Updated•4 years ago
|
Comment 9•4 years ago
|
||
I found a few variants.
Comment 10•4 years ago
|
||
SendAccumulatedData has the comment "This method takes the lock only to double-buffer the batched telemetry. It releases the lock before calling out to IPC code which can (and does) Accumulate (which would deadlock)".
Comment 11•4 years ago
|
||
Ah, but that does swap out the data to a stack-local array. Ok.
Reporter | ||
Comment 12•4 years ago
|
||
Jed, to further my understanding of the code base can you please explain how you came to the following conclusion:
I notice that the nsTArray being sent is accessed from other threads. There's a mutex that's used while appending elements and taking the array to send it, but maybe I missed something and we can end up with a race where serializing sees a non-zero length and null element pointer.
How did you see that the nsTArray
was being accessed from other threads?
Reporter | ||
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 13•4 years ago
|
||
(In reply to Anny Gakhokidze [:annyG] from comment #12)
Jed, to further my understanding of the code base can you please explain how you came to the following conclusion:
I notice that the nsTArray being sent is accessed from other threads. There's a mutex that's used while appending elements and taking the array to send it, but maybe I missed something and we can end up with a race where serializing sees a non-zero length and null element pointer.
How did you see that the
nsTArray
was being accessed from other threads?
Sorry; the way I wrote that was a little unclear. I don't know what was happening to the array in that specific case that crashed; inconsistency between the array's length and storage hints that a data race might be involved, but that's not conclusive. What I did was read the code starting at SendAccumulatedData
(called from TelemetryIPCAccumulator::IPCTimerFired
): it takes the gTelemetryIPCAccumulatorMutex
while swapping global arrays with local variables, and I know that telemetry allows using probes on any thread. I also looked at uses of those global arrays, like gKeyedHistogramAccumulations
; there's one call to AppendElement
that I can see and it's protected by the same mutex (and, following the call graph, that locked AppendElement
is used to implement Telemetry::Accumulate
in a child process.)
Comment 14•4 years ago
|
||
We looked at this a little more in the IPC meeting. The null pointer probably isn't from the array, because the element type isn't a primitive numeric type. However, KeyedHistogramAccumulation
contains an nsCString
for the key, copy-constructed from whatever was passed into Telemetry::Accumulate
. So, if something went wrong with the string ownership model, that could explain this. XPCOM strings use refcounted sharing, but the ref counts are thread-safe, so that's probably not it; see also bug 1353181. If there were a dependent string based on memory that didn't live long enough (because sadly this isn't Rust and we don't have lifetime types)… but would we see that as a null pointer dereference?
If we knew what histogram was involved (the HistogramID
in the mId
field), that could help narrow it down.
Reporter | ||
Comment 15•4 years ago
|
||
Jed, since I can reproduce this on my mac, I could add some prints to see what is happening at the time of the crash.
So here https://searchfox.org/mozilla-central/rev/62a130ba0ac80f75175e4b65536290b52391f116/toolkit/components/telemetry/core/ipc/TelemetryIPCAccumulator.cpp#166 if I am to print aId
, would that be helpful? Or should I print it elsewhere?
Reporter | ||
Comment 16•4 years ago
|
||
I can only reproduce this on mac using opt build unfortunately.. and I can't reproduce this without Fission (earlier mccr8 recommended I try to see how often this happens without fission).
Reporter | ||
Comment 17•4 years ago
|
||
This error still happens on latest nightly but less frequently. After the last step I have to refresh the page a couple of times while trying to click the input fields and type things. Sometimes it happens after I reopen the page without deleting any cookies or local storage data such that my items are still in the cart and I am taken to the billing information straight, and I try to enter things and then the input iframes crash. This does not happen often either.
Comment 18•4 years ago
|
||
(In reply to Anny Gakhokidze [:annyG] PTO till jan3rd from comment #15)
Jed, since I can reproduce this on my mac, I could add some prints to see what is happening at the time of the crash.
So here https://searchfox.org/mozilla-central/rev/62a130ba0ac80f75175e4b65536290b52391f116/toolkit/components/telemetry/core/ipc/TelemetryIPCAccumulator.cpp#166 if I am to print
aId
, would that be helpful? Or should I print it elsewhere?
I was thinking later than that, like just before the call to SendAccumulateChildKeyedHistograms
.
But on second thought I don't know if that will help. The problem seems to be getting a bad value for nsTStringRepr::mData
, which suggests that the problem is how the nsTString
s themselves are being stored, and if that's the case then the specific histogram key wouldn't matter.
As comment #8 mentioned, there are a few crash reports for other messages (which all contain nsCString
or nsString
), and I've found a few more. They all seem to have an nsCString
or nsString
stored in an array, either directly or via a struct. So this might actually be a problem with nsTArray
, and there might be non-string-related variants of it that we're not seeing here because they have different crash signatures.
Comment 19•4 years ago
•
|
||
I've looked at the latest crash reports, and none of them seem to be null pointers. It's hard to tell exactly what's going on in general without the raw minidumps because the crashing instruction is in a system library, but they all seem consistent with the crash being on a read from rsi
. Specifically:
- Unmapped (but otherwise valid) addresses: bp-5981c630-71f5-4001-917a-4d8450191227, bp-448ec26e-d61a-4bbf-b2cc-059ea0191227
- Seemingly a vector load that crosses a page boundary and faults at the start of the second page: bp-14122172-a9ce-4036-b959-7e50e0191228
- “Addresses” that look like UTF-16 text: bp-6c8746ea-a453-40eb-8d92-d5d450191225, bp-15c74168-4c8e-497d-a185-181230191229, bp-195d3fd0-7dd9-4341-bb1a-d38ef0191229
- Last but not least, the memory poison pattern
0xe5e5e5e5e5e5e5e5
: bp-8ad0ba1f-35d7-4ee9-9b57-14ddb0191228
I also notice that, on the Mac crashes, r14
and rbx
have a value (the same one) that usually has a similar pattern to the bad address but 32-bit, so it's probably the length. It's 0xe5e5e5e5
for bp-8ad0ba1f-35d7-4ee9-9b57-14ddb0191228 as expected, but also for bp-14122172-a9ce-4036-b959-7e50e0191228 (the page-crossing load).
In any case, this looks like a use-after-free. I still don't understand how it's possible from looking at the code, but it's hard to argue with the crash reports.
Updated•4 years ago
|
Comment 20•4 years ago
|
||
Uh. Is this another instance of nsTArray
defaulting to memcpy
rather than copy constructors biting us, somehow? We've certainly defined many IPDL types (and otherwise) to use copy constructors:
https://searchfox.org/mozilla-central/source/xpcom/ds/nsTArray.h#719-735
...except that the types being sent here (KeyedHistogramAccumulation
) aren't defined in IPDL, and there's nothing particularly unusual about the member field types of KeyedHistogramAccumulation
(integers and nsCString
).
All the existing crashes involving arrays and strings (directly in the arrays or as member fields) does smell suspicious, though I can't see what the problem could be yet...
Comment 21•4 years ago
|
||
We were discussing this at the IPC meeting and Nika noticed that if an nsCString
(or other nsTSubstring
) is read from freed memory and the memory pattern for mDataFlags
happens to have LITERAL
set and REFCOUNTED
clear, the pointer and length will just be copied without any dereferencing or other checks.
So, if some code somewhere else does a use-after-free of an nsCString
value, then the bad bits could end up being copied (possibly multiple times) and not being caught until they propagate into IPC.
If that's the cause of this, then given the crash reports there could be multiple such bugs. Also, there would be other cases where the use-after-free didn't produce the right bit pattern and crashed earlier, but they'd have different crash signatures and there would be nothing to connect them to this bug.
We could try testing this by adding release assertions to try to catch bad values somewhere in the copy/Assign
/AssignLiteral
call chain, like checking for unassigned flags bits being set, or reading and discarding the data (some of it? all of it? only for purported literals?). That would probably be too expensive to ship, but it might find things in local testing.
Updated•4 years ago
|
Comment 22•4 years ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #21)
We were discussing this at the IPC meeting and Nika noticed that if an
nsCString
(or othernsTSubstring
) is read from freed memory and the memory pattern formDataFlags
happens to haveLITERAL
set andREFCOUNTED
clear, the pointer and length will just be copied without any dereferencing or other checks.
This seems kind of plausible, but given the definitions of StringDataFlags
:
https://searchfox.org/mozilla-central/source/xpcom/string/nsStringFlags.h
wouldn't a UAF string (not StringBuffer)
have a flags of 0xe5e5
, which would be TERMINATED | REFCOUNTED | LITERAL
? Or is the presence of REFCOUNTED
there a red herring, and the LITERAL
takes precedence?
Is it worth considering rearranging the flags so that 0xe5e5
produces some combination that's liable to crash in a more recognizable way?
Comment 23•4 years ago
|
||
Jesup has some ideas to make forward progress here, un-stalling.
Comment 24•4 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #22)
wouldn't a UAF string (not
StringBuffer)
have a flags of0xe5e5
, which would beTERMINATED | REFCOUNTED | LITERAL
? Or is the presence ofREFCOUNTED
there a red herring, and theLITERAL
takes precedence?
The copy code checks first for REFCOUNTED
and then for LITERAL
, so it would have to be literal and not refcounted. 0xe5e5
would send it into the refcounted path, which tries to access the refcount through mData
.
The cases I looked at in comment #19 mostly didn't have the e5
pattern, which would make sense if the object needs to be at least partially overwritten as a new allocation to end up in this state.
Comment 25•4 years ago
|
||
I was wondering if this could be related to bug 1665411, because at least the original case with the telemetry histograms fits the pattern of a UAF on some data that's used/allocated/freed on different threads with what should be sufficient locking. But we also see this bug on Windows (e.g., bp-cb39403f-dfab-447a-8380-a8e790201028) and not just Mac, so it may not be.
Updated•3 years ago
|
Assignee | ||
Comment 26•3 years ago
|
||
Updated•3 years ago
|
Comment 27•3 years ago
|
||
There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:jesup, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 28•3 years ago
|
||
No reports since September - maybe signature change?
Comment 29•3 years ago
|
||
Comment 30•3 years ago
|
||
It's worth noting that the majority of the crashes under those signatures don't appear to be doing UAF accesses. I could find only a handful that had the poison pattern somewhere in the registers.
Comment 31•3 years ago
|
||
Thanks for digging those up.
Here's one that does have a poison address: bp-2405096f-cb24-4f17-996f-3bcbf0210330
Assignee | ||
Comment 32•3 years ago
|
||
Comment on attachment 9192249 [details]
Bug 1602862: Add checks for invalid StringClassFlags r=jld
Security Approval Request
- How easily could an exploit be constructed based on the patch?: Extremely hard. There's no normal way to get invalid bits into the type. At most it might encourage someone to try to provoke the new assertions, but there's no way to know how to do so. The comments mention invalid bits, but that's going to be obvious with or without the comments.
- Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: Yes
- Which older supported branches are affected by this flaw?: all
- If not all supported branches, which bug introduced the flaw?: None
- Do you have backports for the affected branches?: Yes
- If not, how different, hard to create, and risky will they be?: Should apply with no problems to all branches.
- How likely is this patch to cause regressions; how much testing does it need?: Almost impossible to have regressions
Comment 33•3 years ago
|
||
Comment on attachment 9192249 [details]
Bug 1602862: Add checks for invalid StringClassFlags r=jld
Approved to land and request uplift if that is desired.
Comment 34•3 years ago
|
||
Add checks for invalid StringClassFlags r=jld
https://hg.mozilla.org/integration/autoland/rev/630f37b668d88d30886d42c76e76ee0bef0afd0a
https://hg.mozilla.org/mozilla-central/rev/630f37b668d8
Comment 35•3 years ago
|
||
As part of a security bug pattern analysis, we are requesting your help with a high level analysis of this bug. It is our hope to develop static analysis (or potentially runtime/dynamic analysis) in the future to identify classes of bugs.
Please visit this google form to reply.
Updated•3 years ago
|
Comment 36•3 years ago
|
||
This grafts cleanly all the way down to ESR78. Please nominate for Beta/ESR78 approval when you get a chance.
Assignee | ||
Comment 37•3 years ago
|
||
Comment on attachment 9192249 [details]
Bug 1602862: Add checks for invalid StringClassFlags r=jld
Beta/Release Uplift Approval Request
- User impact if declined: No added saneness checks
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Just adds some assertions in obviously-bad cases (corruption)
- String changes made/needed: none
Assignee | ||
Comment 38•3 years ago
|
||
Comment on attachment 9192249 [details]
Bug 1602862: Add checks for invalid StringClassFlags r=jld
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration:
- User impact if declined: Won't assert on corruption of strings
- Fix Landed on Version: 90
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): checks for invalid flag bits being set
- String or UUID changes made by this patch: none
Comment 39•3 years ago
|
||
Comment on attachment 9192249 [details]
Bug 1602862: Add checks for invalid StringClassFlags r=jld
Uplift approved for beta and ESR, thanks
Comment 40•3 years ago
|
||
uplift |
Comment 41•3 years ago
|
||
uplift |
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•2 years ago
|
Description
•