Closed
Bug 1406727
Opened 7 years ago
Closed 3 years ago
Crash in nsDisplayWrapList::Destroy
Categories
(Core :: Web Painting, defect, P2)
Tracking
()
RESOLVED
WORKSFORME
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox-esr60 | --- | wontfix |
firefox56 | --- | unaffected |
firefox57 | + | wontfix |
firefox58 | + | wontfix |
firefox59 | + | wontfix |
firefox60 | + | wontfix |
firefox61 | - | wontfix |
firefox62 | - | wontfix |
firefox63 | --- | wontfix |
firefox64 | --- | wontfix |
firefox65 | --- | wontfix |
firefox66 | --- | wontfix |
firefox67 | --- | wontfix |
firefox68 | --- | wontfix |
firefox69 | --- | wontfix |
People
(Reporter: philipp, Unassigned)
References
Details
(5 keywords)
Crash Data
Attachments
(6 files, 1 obsolete file)
47.92 KB,
patch
|
froydnj
:
review+
mstange
:
review+
abillings
:
sec-approval+
|
Details | Diff | Splinter Review |
2.83 KB,
patch
|
froydnj
:
review+
francois
:
review+
abillings
:
sec-approval+
|
Details | Diff | Splinter Review |
44.94 KB,
patch
|
froydnj
:
review+
gchang
:
approval-mozilla-beta-
abillings
:
sec-approval+
|
Details | Diff | Splinter Review |
6.34 KB,
patch
|
froydnj
:
review+
abillings
:
approval-mozilla-beta+
abillings
:
sec-approval+
|
Details | Diff | Splinter Review |
8.18 KB,
patch
|
mattwoodrow
:
review+
abillings
:
approval-mozilla-beta+
abillings
:
sec-approval+
|
Details | Diff | Splinter Review |
6.53 KB,
patch
|
mstange
:
review+
abillings
:
approval-mozilla-beta+
abillings
:
sec-approval+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is
report bp-a501024d-a454-4a33-b035-c06fe0171008.
=============================================================
Crashing Thread (0)
Frame Module Signature Source
0 xul.dll nsDisplayWrapList::Destroy(nsDisplayListBuilder*) layout/painting/nsDisplayList.h:4015
1 xul.dll nsDisplayWrapList::Destroy(nsDisplayListBuilder*) layout/painting/nsDisplayList.h:4015
2 xul.dll nsDisplayWrapList::Destroy(nsDisplayListBuilder*) layout/painting/nsDisplayList.h:4015
3 xul.dll nsDisplayList::DeleteAll(nsDisplayListBuilder*) layout/painting/nsDisplayList.cpp:2413
4 xul.dll nsLayoutUtils::PaintFrame(gfxContext*, nsIFrame*, nsRegion const&, unsigned int, nsDisplayListBuilderMode, nsLayoutUtils::PaintFrameFlags) layout/base/nsLayoutUtils.cpp:3888
5 xul.dll mozilla::PresShell::Paint(nsView*, nsRegion const&, unsigned int) layout/base/PresShell.cpp:6454
6 xul.dll nsViewManager::ProcessPendingUpdatesPaint(nsIWidget*) view/nsViewManager.cpp:480
7 xul.dll nsViewManager::ProcessPendingUpdatesForView(nsView*, bool) view/nsViewManager.cpp:412
8 xul.dll nsViewManager::ProcessPendingUpdates() view/nsViewManager.cpp:1102
9 xul.dll nsRefreshDriver::Tick(__int64, mozilla::TimeStamp) layout/base/nsRefreshDriver.cpp:2082
10 xul.dll nsRefreshDriver::DoTick() layout/base/nsRefreshDriver.cpp:1526
11 xul.dll nsRefreshDriver::FinishedWaitingForTransaction() layout/base/nsRefreshDriver.cpp:2189
12 xul.dll nsRefreshDriver::NotifyTransactionCompleted(unsigned __int64) layout/base/nsRefreshDriver.cpp:2256
13 xul.dll mozilla::layers::ClientLayerManager::DidComposite(unsigned __int64, mozilla::TimeStamp const&, mozilla::TimeStamp const&) gfx/layers/client/ClientLayerManager.cpp:529
14 xul.dll mozilla::dom::TabChild::DidComposite(unsigned __int64, mozilla::TimeStamp const&, mozilla::TimeStamp const&) dom/ipc/TabChild.cpp:3079
15 xul.dll mozilla::layers::CompositorBridgeChild::RecvDidComposite(unsigned __int64 const&, unsigned __int64 const&, mozilla::TimeStamp const&, mozilla::TimeStamp const&) gfx/layers/ipc/CompositorBridgeChild.cpp:540
16 xul.dll mozilla::layers::PCompositorBridgeChild::OnMessageReceived(IPC::Message const&) ipc/ipdl/PCompositorBridgeChild.cpp:1473
17 xul.dll mozilla::layers::PCompositorManagerChild::OnMessageReceived(IPC::Message const&) ipc/ipdl/PCompositorManagerChild.cpp:161
18 xul.dll mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) ipc/glue/MessageChannel.cpp:2119
19 xul.dll mozilla::ipc::MessageChannel::DispatchMessageW(IPC::Message&&) ipc/glue/MessageChannel.cpp:2049
20 xul.dll mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::MessageChannel::MessageTask&) ipc/glue/MessageChannel.cpp:1895
21 xul.dll mozilla::ipc::MessageChannel::MessageTask::Run() ipc/glue/MessageChannel.cpp:1928
22 xul.dll mozilla::SchedulerGroup::Runnable::Run() xpcom/threads/SchedulerGroup.cpp:396
23 xul.dll nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp:1039
24 xul.dll mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) ipc/glue/MessagePump.cpp:97
25 xul.dll mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) ipc/glue/MessagePump.cpp:301
26 xul.dll MessageLoop::RunHandler() ipc/chromium/src/base/message_loop.cc:319
27 xul.dll MessageLoop::Run() ipc/chromium/src/base/message_loop.cc:299
28 xul.dll nsBaseAppShell::Run() widget/nsBaseAppShell.cpp:158
29 xul.dll nsAppShell::Run() widget/windows/nsAppShell.cpp:230
30 xul.dll XRE_RunAppShell() toolkit/xre/nsEmbedFunctions.cpp:880
31 xul.dll mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) ipc/glue/MessagePump.cpp:269
32 xul.dll MessageLoop::RunHandler() ipc/chromium/src/base/message_loop.cc:319
33 xul.dll MessageLoop::Run() ipc/chromium/src/base/message_loop.cc:299
34 xul.dll XRE_InitChildProcess(int, char** const, XREChildData const*) toolkit/xre/nsEmbedFunctions.cpp:705
35 xul.dll mozilla::BootstrapImpl::XRE_InitChildProcess(int, char** const, XREChildData const*) toolkit/xre/Bootstrap.cpp:65
36 firefox.exe content_process_main(mozilla::Bootstrap*, int, char** const) ipc/contentproc/plugin-container.cpp:63
37 firefox.exe wmain toolkit/xre/nsWindowsWMain.cpp:115
38 firefox.exe __scrt_common_main_seh f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:253
39 kernel32.dll BaseThreadInitThunk
40 ntdll.dll __RtlUserThreadStart
41 ntdll.dll _RtlUserThreadStart
crash reports with this signature are newly showing up in firefox 57 - so far most of the time (86%) in the content process, from builds on windows & apparently in a codepath added in bug 1388162.
Updated•7 years ago
|
Flags: needinfo?(matt.woodrow)
Priority: -- → P2
Comment 1•7 years ago
|
||
How frequent is this crash?
I suspect this is a duplicate of bug 1198710, just with the new signature.
Flags: needinfo?(matt.woodrow)
Comment 2•7 years ago
|
||
(In reply to Matt Woodrow (:mattwoodrow) from comment #1)
> How frequent is this crash?
>
> I suspect this is a duplicate of bug 1198710, just with the new signature.
A bit higher than bug 1198710 within the period of 57 beta.
It seems the display item address returned from RemoveBottm() could be invalid or contained invalid child items.
Do you have any idea how to progress the investigation?
Comment 3•7 years ago
|
||
[Tracking Requested - why for this release]:
UAF sec-critical regression in 57
Crashes are all wildptr and/or clear UAFs. Worse, they're READ, WRITE and EXEC crashes.
This may be a dup of bug 1198710 with a new signature -- that shows similar problems and probably should be marked as sec.
Comment 4•7 years ago
|
||
Tracking 57+ based on Randell's assessment in Comment 3. There are a few crashes in 58 as well.
The other signature nsDisplayList::DeleteAll in Bug 1198710 has crashes in 57 as well, but also has crashes dating back several releases.
Although this isn't high volume in beta, would be good to understand more about why these appear more severe in 57.
Comment 5•7 years ago
|
||
(In reply to Marcia Knous [:marcia - use ni] from comment #4)
> The other signature nsDisplayList::DeleteAll in Bug 1198710 has crashes in
> 57 as well, but also has crashes dating back several releases.
>
> Although this isn't high volume in beta, would be good to understand more
> about why these appear more severe in 57.
The other ones on bug 1198710 are sec crashes too, just with a different signature. Might be different bugs, but I'm betting on the same bug. We probably should mark that one as sec as well, though it's been open a long time
NI dveditz to decide on the other bug
Flags: needinfo?(dveditz)
Comment 6•7 years ago
|
||
It looks like we have a display item in the list that contains garbage memory. Trying to call Destroy() on it crashes.
Display items are all allocated in an arena, which we free once we've finished painting (and after this call).
If we had already free'd this display item, I'd still expect the memory contents to be valid since we don't do any sort of recycling of the arena memory, and the dtor shouldn't be writing garbage into our vtable pointer.
A few of the crashes (though not many) even look like they have the poison value (0xE5) written into them. I can't think of anywhere where we'd write the poison value into the still-valid arena.
So it seems like either have corruption, something else is clobbering the contents of this display item. Possibly an invalid static_cast? That doesn't explain where the poison value comes from though.
Another possibility is that the display item was allocated into a different arena, one that has been actually released to the OS at this point (and then can be clobbered/poisoned by another user of the same memory).
I can't see how this would happen though. We definitely have cases where we can recurse into nsLayoutUtils::PaintFrame which creates a new builder/arena, but this doesn't pass through the display list from the outer caller.
There would have to be some way of a display item, display list, or display list builder crossing this boundary, and then us adding an item to the wrong list.
I did a quick audit and can't find anywhere that we cache any of these items in a way that the wrong could access them, but it's possible I missed something.
Not sure how to make further progress without STR.
Comment 7•7 years ago
|
||
We could try this, it should at least narrow the options down.
Might not be ideal for performance, so I think we'd want to back this out once we get results (ideally in Nightly, but Beta at the latest).
Nathan, would you be ok with adding something like this? I could add lots of comments telling people they aren't allowed to use the new functions except for this bug, but we also don't want to draw too much attention.
Assignee: nobody → matt.woodrow
Attachment #8917655 -
Flags: feedback?
Updated•7 years ago
|
Attachment #8917655 -
Flags: feedback? → feedback?(nfroyd)
Updated•7 years ago
|
Group: core-security → layout-core-security
Comment 8•7 years ago
|
||
Could we do some ifdefs to restrict to nightly or to early-beta-and-earlier?
And/or add some code to check arena-it-was-allocated-from when elements are added? Or before releasing the list?
Flags: needinfo?(matt.woodrow)
Comment 9•7 years ago
|
||
On the other bug, I see this crash: https://crash-stats.mozilla.com/report/index/551eb5bf-6b3b-4e58-a5d1-187e20171007
That's an EXEC crash with an e5e5 address in 57b7 (and there are quite a few wildptr EXEC crashes there).
I'm going to mark it as sec also. I strongly suspect these two signatures are part of the same bug, but will leave separate for now.
Comment 10•7 years ago
|
||
Comment on attachment 8917655 [details] [diff] [review]
assert-display-item
Review of attachment 8917655 [details] [diff] [review]:
-----------------------------------------------------------------
I think we could do something like this, sure. Definitely says that something fishy is going on with this code, though. =/
I like jesup's idea about trying to restrict this to nightly/early beta.
::: xpcom/ds/ArenaAllocator.h
@@ +178,5 @@
> +
> + bool DebugContains(void* aPtr)
> + {
> + return aPtr >= reinterpret_cast<void*>(header.offset) &&
> + aPtr < reinterpret_cast<void*>(header.tail);
This should be written as:
uintptr_t p = reinterpret_cast<uintptr_t>(aPtr);
return p >= header.offset && p < header.tail;
due to rules in the standard about comparing pointers: aPtr doesn't necessarily point into the same object bounded by header.{offset,tail}, so it can't be compared to them.
Attachment #8917655 -
Flags: feedback?(nfroyd) → feedback+
(In reply to Nathan Froyd [:froydnj] from comment #10)
> Comment on attachment 8917655 [details] [diff] [review]
> assert-display-item
> ...
>
> I like jesup's idea about trying to restrict this to nightly/early beta.
MOZ_DIAGNOSTIC_ASSERT instead of MOZ_RELEASE_ASSERT? It'd hit nightly and dev edition, but not beta and release.
Flags: needinfo?(milan)
Comment 12•7 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #8)
> And/or add some code to check arena-it-was-allocated-from when elements are
> added? Or before releasing the list?
Checking the arena when elements are added is what we really want I think, but it's considerably bigger patch to write, so I thought this would be a good start to see if we're on the right track.
I can bite the bullet and write the bigger patch if you think it's worth it.
(In reply to Nathan Froyd [:froydnj] from comment #10)
> I think we could do something like this, sure. Definitely says that
> something fishy is going on with this code, though. =/
It does :( I'd want to remove this again once we figure out exactly what that is.
Flags: needinfo?(matt.woodrow)
Comment 13•7 years ago
|
||
(In reply to Matt Woodrow (:mattwoodrow) from comment #12)
> Checking the arena when elements are added is what we really want I think,
> but it's considerably bigger patch to write, so I thought this would be a
> good start to see if we're on the right track.
>
> I can bite the bullet and write the bigger patch if you think it's worth it.
I'm far from the final arbiter here, but I think it is, since this is a LONG-standing sec-critical (see bug 1198710).
Comment hidden (me-too) |
Comment 15•7 years ago
|
||
Attachment #8917655 -
Attachment is obsolete: true
Attachment #8918686 -
Flags: review?(nfroyd)
Attachment #8918686 -
Flags: review?(mstange)
Comment 16•7 years ago
|
||
Comment on attachment 8918686 [details] [diff] [review]
assert-display-item
Review of attachment 8918686 [details] [diff] [review]:
-----------------------------------------------------------------
r+ on everything outside ArenaAllocator.h
Attachment #8918686 -
Flags: review?(mstange) → review+
Comment 17•7 years ago
|
||
Comment on attachment 8918686 [details] [diff] [review]
assert-display-item
Review of attachment 8918686 [details] [diff] [review]:
-----------------------------------------------------------------
r=me on the ArenaAllocator.h part, but we should talk about what the right test below is.
::: xpcom/ds/ArenaAllocator.h
@@ +183,5 @@
> +#ifdef MOZ_DIAGNOSTIC_ASSERT_ENABLED
> + bool DebugContains(void* aPtr)
> + {
> + return aPtr >= (this + 1) &&
> + reinterpret_cast<uintptr_t>(aPtr) < header.tail;
Wait, what? Oh, I see, because `this` is actually embedded in the arena, and testing:
header.offset <= aPtr && aPtr < header.tail
is not the correct test, because we'd be checking to see whether the pointer lives in the *unallocated* part of the arena.
So is the right test something like
(this + 1) <= aPtr && aPtr < header.offset
since that would be checking for things actually allocated? Or do we want to use the test written here?
Either way, I think we want to write the first part of the test as:
reinterpret_cast<uintptr_t>(aPtr) >= reinterpret_cast<uintptr_t>(this + 1)
for the aforementioned pointer comparison issues.
Attachment #8918686 -
Flags: review?(nfroyd) → review+
Comment 18•7 years ago
|
||
Comment on attachment 8918686 [details] [diff] [review]
assert-display-item
[Security approval request comment]
How easily could an exploit be constructed based on the patch?
Not easily, we still haven't figured out what the issue is.
Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?
It makes the outcome fairly obvious, but doesn't really help finding the cause (until we get crash reports).
Which older supported branches are affected by this flaw?
All.
Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?
No, but this code hasn't changed much, should be easy.
How likely is this patch to cause regressions; how much testing does it need?
Unlikely, just adding some new assertions.
Attachment #8918686 -
Flags: sec-approval?
Comment 19•7 years ago
|
||
Comment on attachment 8918686 [details] [diff] [review]
assert-display-item
sec-approval+ for trunk.
Is there any point in backporting this to beta if we're just adding assertions?
Flags: needinfo?(matt.woodrow)
Attachment #8918686 -
Flags: sec-approval? → sec-approval+
Comment 20•7 years ago
|
||
(In reply to Al Billings [:abillings] from comment #19)
> Comment on attachment 8918686 [details] [diff] [review]
> assert-display-item
>
> sec-approval+ for trunk.
> Is there any point in backporting this to beta if we're just adding
> assertions?
Beta just gets a much higher crash volume, so it'll make it quicker to track down. Looks like this happens a few times a week on Nightly though, so maybe there's no point.
Flags: needinfo?(matt.woodrow)
Comment 21•7 years ago
|
||
(In reply to Matt Woodrow (:mattwoodrow) from comment #20)
> (In reply to Al Billings [:abillings] from comment #19)
> > Comment on attachment 8918686 [details] [diff] [review]
> > assert-display-item
> >
> > sec-approval+ for trunk.
> > Is there any point in backporting this to beta if we're just adding
> > assertions?
>
> Beta just gets a much higher crash volume, so it'll make it quicker to track
> down. Looks like this happens a few times a week on Nightly though, so maybe
> there's no point.
Are you sure? An assertion crash is a LOT better than a UAF
Flags: needinfo?(matt.woodrow)
Comment 22•7 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #21)
> (In reply to Matt Woodrow (:mattwoodrow) from comment #20)
> > (In reply to Al Billings [:abillings] from comment #19)
> > > Comment on attachment 8918686 [details] [diff] [review]
> > > assert-display-item
> > >
> > > sec-approval+ for trunk.
> > > Is there any point in backporting this to beta if we're just adding
> > > assertions?
> >
> > Beta just gets a much higher crash volume, so it'll make it quicker to track
> > down. Looks like this happens a few times a week on Nightly though, so maybe
> > there's no point.
>
> Are you sure? An assertion crash is a LOT better than a UAF
Well the changes are restricted to development builds since it's both a guess and a performance regression.
Protecting our beta users might be worth it, but we really just need to figure out the real issue ASAP and get it fixed so we can ship something to release.
Flags: needinfo?(matt.woodrow)
Comment 23•7 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/940a0c98e46c
https://hg.mozilla.org/mozilla-central/rev/2c77560532c9
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Updated•7 years ago
|
Group: layout-core-security → core-security-release
Comment 24•7 years ago
|
||
Calling this 57:wontfix given comment 22. Feel free to set it back to affected and nominate for Beta approval if I've misunderstood, though.
Updated•7 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 25•7 years ago
|
||
Crash reports:
bp-fec67bcc-475e-442a-9a1e-2912d0171028
bp-e6d6de36-221f-4b01-a759-e9e780171029
These are really weird, we're hitting the assertion when tearing down the list, or when sorting the list (which removes and then adds items from the same list).
At no point do we ever hit an assertion when actually adding a new item to the list.
I did another audit to try find locations where we might edit the list that aren't covered by assertions and couldn't find any.
It's possible that this is just a corruption issue.
Comment 26•7 years ago
|
||
Trying some extra logging to figure out exactly what the broken pointer is (if it's close to arena memory, or entirely unrelated), and also trying to see if we can catch the 'mAbove' linked list pointer being corrupt.
Attachment #8924049 -
Flags: review?(nfroyd)
Comment 27•7 years ago
|
||
Comment on attachment 8924049 [details] [diff] [review]
more-assertions
Review of attachment 8924049 [details] [diff] [review]:
-----------------------------------------------------------------
r=me, but you need to ask for data review:
https://wiki.mozilla.org/Firefox/Data_Collection
I think this falls into the "technical data" category, so review should be pretty painless.
::: layout/painting/nsDisplayList.h
@@ +1823,5 @@
> // destructors.
> protected:
> nsDisplayItemLink() : mAbove(nullptr) {}
> nsDisplayItemLink(const nsDisplayItemLink&) : mAbove(nullptr) {}
> + uint64_t mSentinel = 0xDEADBEEFDEADBEEF;
You may want to guard this with MOZ_DIAGNOSTIC_ASSERT_ENABLED, just in case we don't back this out when diagnostic asserts are expiring?
::: xpcom/ds/ArenaAllocator.h
@@ +134,5 @@
> }
> + std::stringstream log;
> + log << "Failed to find pointer " << aPtr << " within arena blocks: ";
> + for (ArenaChunk* arena = mHead.next; arena; arena = arena->next) {
> + log << "(" << reinterpret_cast<uintptr_t>(arena + 1) << ", " << arena->header.offset << "), ";
Dumping the pointer that we're looking for and the addresses of the arena in which we looked for it, so we can potentially tell why we missed it...OK, makes sense.
@@ +136,5 @@
> + log << "Failed to find pointer " << aPtr << " within arena blocks: ";
> + for (ArenaChunk* arena = mHead.next; arena; arena = arena->next) {
> + log << "(" << reinterpret_cast<uintptr_t>(arena + 1) << ", " << arena->header.offset << "), ";
> + }
> + MOZ_CRASH_UNSAFE_OOL(log.str().c_str());
Usages of this macro need a data-review to ensure we're not sending private data.
Attachment #8924049 -
Flags: review?(nfroyd) → review+
Comment 28•7 years ago
|
||
Comment on attachment 8924049 [details] [diff] [review]
more-assertions
Review of attachment 8924049 [details] [diff] [review]:
-----------------------------------------------------------------
(Sorry, I should have just done this as part of my review, redirecting your review request)
Requesting data-review for sending runtime-defined strings with MOZ_CRASH_UNSAFE_OOL: we're sending information about memory addresses of internal data structures. We believe this falls into Category 1 ("technical data"). This data is being collected to understand crashes we're seeing in the wild better and is being collected for nightly builds and early beta builds.
Attachment #8924049 -
Flags: review?(francois)
Updated•7 years ago
|
Comment 29•7 years ago
|
||
Comment on attachment 8924049 [details] [diff] [review]
more-assertions
Review of attachment 8924049 [details] [diff] [review]:
-----------------------------------------------------------------
datareview+ based on the description from comment 28.
Attachment #8924049 -
Flags: review?(francois) → review+
Comment 30•7 years ago
|
||
Comment on attachment 8924049 [details] [diff] [review]
more-assertions
[Security approval request comment]
How easily could an exploit be constructed based on the patch?
Not easily, we still haven't figured out what the issue is.
Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?
It makes the outcome fairly obvious, but doesn't really help finding the cause (until we get crash reports).
Which older supported branches are affected by this flaw?
All.
Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?
No, but this code hasn't changed much, should be easy.
How likely is this patch to cause regressions; how much testing does it need?
Unlikely, just adding some new assertions.
Attachment #8924049 -
Flags: sec-approval?
Comment 31•7 years ago
|
||
Comment on attachment 8924049 [details] [diff] [review]
more-assertions
sec-approval+. Land away.
Attachment #8924049 -
Flags: sec-approval? → sec-approval+
Updated•7 years ago
|
Keywords: checkin-needed
Comment 32•7 years ago
|
||
Keywords: checkin-needed → leave-open
Target Milestone: mozilla58 → ---
Comment 33•7 years ago
|
||
Comment 34•7 years ago
|
||
Crash report with the newer assertions:
https://crash-stats.mozilla.com/report/index/be49f893-9f70-4d7e-a265-d3a3e0171105
It's failing the mSentinel check, that's pretty bad. The only explanation I can think of is heap corruption/buffer overflow. Someone is overwriting memory belonging to a display item.
I'm not sure what to do next to track down where this is coming from, will think more about it.
Comment 35•7 years ago
|
||
Crash report from the other assertion, showing the arena blocks:
https://crash-stats.mozilla.com/report/index/25293941-d437-484f-8195-6a5b60171108
Comment 36•7 years ago
|
||
Nathan: Looks like one of the arena chunks there has 0 for the 'offset' value. That's pretty bizarre.
The arena is 8192, so the failing pointer is probably in that chunk, if it had the expected end value.
I can't see any way we could get a 0 there though, excluding heap corruption.
Flags: needinfo?(nfroyd)
Comment 37•7 years ago
|
||
The 'offset' value is part of the header, which is in the same allocation as the actual arena memory that we give out. So this could be roughly the same as the other crash report, something is corrupting memory within our arena blocks.
It looks like the corrupt one doesn't have another block right before it, so it would seem possible that we've got a buffer overflow from something else overflowing into an arena block.
This is a fairly common crash (more than I think would be expected for generic heap corruption), so the overflowing memory is likely something else we allocate during painting.
Comment 38•7 years ago
|
||
(In reply to Matt Woodrow (:mattwoodrow) from comment #36)
> Nathan: Looks like one of the arena chunks there has 0 for the 'offset'
> value. That's pretty bizarre.
>
> The arena is 8192, so the failing pointer is probably in that chunk, if it
> had the expected end value.
>
> I can't see any way we could get a 0 there though, excluding heap corruption.
That *is* pretty weird! ArenaAllocator::mHead is initialized with offset=0, tail=0, but no other ArenaChunk is. And mHead isn't even included in the list of chunks; it's just there as a placeholder.
The only ways I can think of a zero getting there are:
1. Some kind of weird race. I can't really construct a plausible race for this, though.
2. Overflow in ArenaChunk::Allocate:
header.offset += aSize;
That'd have to be a *very* specific aSize, though; comment 35's crash, for instance, would require an allocation of ~2.5GB or so. I guess that could arise from allocation sizes overflowing before coming into ArenaAllocator...we could add some overflow checks in ArenaAllocator to determine whether that's what's happening. But IIUC, we can't get an overflowing size there, since we're coming from nsPresArena's pooled allocator, and we don't allocate arbitrary (i.e. calculated) sizes there...do we?
We could try inserting (several!) CorruptionCanary members (from mfbt/Poison.h) into ArenaHeader and checking them at various points. That would provide solid evidence of heap corruption, but wouldn't move us much closer to figuring out what was going on. :(
Maybe we should try printing the tail member, too?
Checking for overflow would have the virtue--if the overflow check(s) ever failed--of telling us exactly where things were going wrong, rather than "somebody scribbled over this memory at some point in the past".
Flags: needinfo?(nfroyd)
Comment 39•7 years ago
|
||
I really can't see how it could be overflow, shouldn't the size check prevent us from allocating a size that would go past the tail pointer?
If we did overflow, then I think further allocation attempts would succeed, and we'd never move on to a new chunk.
As you said, we also don't do arbitrary allocations from the display list builder, so I can't see how we'd ever get something big enough.
I'll have a go at the canary option.
Comment 40•7 years ago
|
||
Attachment #8927693 -
Flags: review?(nfroyd)
Comment 41•7 years ago
|
||
Comment on attachment 8927693 [details] [diff] [review]
canary-check
Review of attachment 8927693 [details] [diff] [review]:
-----------------------------------------------------------------
I'm really hesitant to remove the logging in favor of the canary checking. I'd prefer to keep both, if possible. WDYT?
::: xpcom/ds/ArenaAllocator.h
@@ -134,5 @@
> }
> - std::stringstream log;
> - log << "Failed to find pointer " << aPtr << " within arena blocks: ";
> - for (ArenaChunk* arena = mHead.next; arena; arena = arena->next) {
> - log << "(" << reinterpret_cast<uintptr_t>(arena + 1) << ", " << arena->header.offset << "), ";
I haven't looked through the crash reports to tell if there have been multiple reports with a 0 offset...do we really want to lose the logging here and/or do we want to print header.tail into this to try and understand better what's going on? Couldn't we, say, keep the logging and canary.Check() in the DebugContains() loop above? We'd crash on the check if there was heap corruption, but we'd still get useful information from the logging if there *wasn't* heap corruption.
Just looking at the crash report summaries, we're still hitting read/exec breakpoints on sometimes-freed memory with the signatures from this bug and not hitting the crash here...maybe several different things are being conflated.
Attachment #8927693 -
Flags: review?(nfroyd)
Comment 42•7 years ago
|
||
The primary issue is that this logging is really expensive, we've got quite a few open regression bugs because of it.
Retained display lists landed, and showed a regression on the display list mutation talos test, just because this logging adds overhead to display list operations (we show a 30% win without the logging).
I realize that the logging is restricted to dev builds, but it's interfering with our ability to track performance regressions accurately, and that's not great.
I do think you're right that there's more than one way that this can crash, I'm just not sure this logging is helping us narrow it down enough to justify the cost.
Comment 43•7 years ago
|
||
Hm, that's fair. But is the canary going to be much cheaper?
The string logging is only activated if there's ever a problem, so we don't care about how expensive that is. The cost the benchmarks are detecting is really the loop + DebugContains check. The DebugContains check is two comparisons, whereas the canary check is a single comparison. So that will be a little cheaper, but we're still paying a bunch of cache misses to wander through memory, which I'd bet is the performance problem. We could make the DebugContains check a single unsigned comparison (I think) *and* do the Canary check and it wouldn't be much more expensive than what we're doing now...
Have you measured the cost of the current checks vs. canary checking?
Another alternative would be to make the individual arenas bigger, which would waste some memory, but give us fewer things to loop over when we're doing the debug checks.
Comment 44•7 years ago
|
||
The big difference is the canary check (as currently written) only happens 3 times during display list building, whereas DebugContains runs per-display-item (can easily be 20k times per-paint).
Comment 45•7 years ago
|
||
(In reply to Matt Woodrow (:mattwoodrow) from comment #44)
> The big difference is the canary check (as currently written) only happens 3
> times during display list building, whereas DebugContains runs
> per-display-item (can easily be 20k times per-paint).
Aha, I totally missed that. My mistake.
I'm still not excited about re-doing data review if we have to add annotations for the crash, but maybe the canary check will reveal new information. Worth a shot.
Comment 46•7 years ago
|
||
Comment on attachment 8927693 [details] [diff] [review]
canary-check
Review of attachment 8927693 [details] [diff] [review]:
-----------------------------------------------------------------
Let's try this, then.
Attachment #8927693 -
Flags: review+
Comment 47•7 years ago
|
||
Comment on attachment 8927693 [details] [diff] [review]
canary-check
[Security approval request comment]
How easily could an exploit be constructed based on the patch?
Same as above, more attempts to try track down the cause of this bug.
Attachment #8927693 -
Flags: sec-approval?
Comment 48•7 years ago
|
||
sec-approval+ for trunk.
status-firefox59:
--- → affected
tracking-firefox59:
--- → +
Updated•7 years ago
|
Attachment #8927693 -
Flags: sec-approval? → sec-approval+
Comment 49•7 years ago
|
||
Comment 50•7 years ago
|
||
Hi Matt, did your latest patch shed any light on the situation?
Flags: needinfo?(matt.woodrow)
Comment 51•7 years ago
|
||
Looks like there's one crash so far [1], and it's crashing at the first check point [2].
That suggests that the bug is in display list building itself, not FrameLayerBuilder.
Need to think of something to narrow it down further.
[1] https://crash-stats.mozilla.com/report/index/3ee5702a-db44-4411-805c-fa1840171128
[2] https://searchfox.org/mozilla-central/source/layout/base/nsLayoutUtils.cpp#3922
Updated•7 years ago
|
tracking-firefox58:
--- → +
Comment 52•7 years ago
|
||
Comment on attachment 8927693 [details] [diff] [review]
canary-check
Approval Request Comment
[Feature/Bug causing the regression]: Unknown.
[User impact if declined]: Needed as part of trying to debug 1421345.
[Is this code covered by automated tests?]: No.
[Has the fix been verified in Nightly?]: No.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No
[Why is the change risky/not risky?]: Just a diagnostic assert to try track down a possible corruption issue.
[String changes made/needed]: None.
Flags: needinfo?(matt.woodrow)
Attachment #8927693 -
Flags: approval-mozilla-beta?
Comment 53•7 years ago
|
||
Comment on attachment 8927693 [details] [diff] [review]
canary-check
After discussing with dev in mail thread, we can live with this and let's fix this in 59.
Attachment #8927693 -
Flags: approval-mozilla-beta? → approval-mozilla-beta-
Comment 54•7 years ago
|
||
Still seeing exploitable looking crashes in 59 beta. Matt, anything else to try here?
Flags: needinfo?(matt.woodrow)
Comment 55•7 years ago
|
||
There's some more interesting crash reports here: https://crash-stats.mozilla.com/signature/?moz_crash_reason=~Canary%20check%20failed&signature=nsPresArena%3A%3AAllocate&date=%3E%3D2018-01-07T03%3A26%3A09.000Z&date=%3C2018-02-07T03%3A26%3A09.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports
That's a canary check every time we try to allocate something new from the arena.
I was hoping the stacks would have something in common that would point to the issue, but I haven't found it yet. nsFlexContainerFrame shows up fairly frequently (more than I'd expect), but auditing the display list code there hasn't given me any ideas.
Adding a canary check each time we enter/exit a BuildDisplayList variant should hopefully narrow this down even further.
Flags: needinfo?(matt.woodrow)
Attachment #8948548 -
Flags: review?(nfroyd)
Updated•7 years ago
|
status-firefox60:
--- → ?
tracking-firefox60:
--- → +
Comment 56•7 years ago
|
||
Comment on attachment 8948548 [details] [diff] [review]
debug-dl-canary
Review of attachment 8948548 [details] [diff] [review]:
-----------------------------------------------------------------
rs=me More diagnostics is more better.
Attachment #8948548 -
Flags: review?(nfroyd) → review+
Comment 57•7 years ago
|
||
Comment on attachment 8948548 [details] [diff] [review]
debug-dl-canary
[Security approval request comment]
Yet another diagnostic patch to try track this down. Not sure how hard it is to exploit since we haven't been able to figure out how to trigger it.
Approval Request Comment
[Feature/Bug causing the regression]: Unknown.
[User impact if declined]: Needed as part of trying to debug 1421345.
[Is this code covered by automated tests?]: No.
[Has the fix been verified in Nightly?]: No.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No
[Why is the change risky/not risky?]: Just a diagnostic assert to try track down a possible corruption issue.
[String changes made/needed]: None.
Attachment #8948548 -
Flags: sec-approval?
Attachment #8948548 -
Flags: approval-mozilla-beta?
Comment 58•7 years ago
|
||
Comment on attachment 8948548 [details] [diff] [review]
debug-dl-canary
approvals given!
Attachment #8948548 -
Flags: sec-approval?
Attachment #8948548 -
Flags: sec-approval+
Attachment #8948548 -
Flags: approval-mozilla-beta?
Attachment #8948548 -
Flags: approval-mozilla-beta+
Comment 59•7 years ago
|
||
Comment 60•7 years ago
|
||
(In reply to Matt Woodrow (:mattwoodrow) from comment #57)
> [Is this code covered by automated tests?]: No.
> [Has the fix been verified in Nightly?]: No.
> [Needs manual test from QE? If yes, steps to reproduce]: No.
Setting qe-verify- based on Matt's assessment on manual testing needs.
Flags: qe-verify-
Updated•7 years ago
|
Comment 61•7 years ago
|
||
This keeps getting more confusing.
There's a bunch of crashes with the latest diagnostic patch, and this one is particularly interesting:
https://crash-stats.mozilla.com/report/index/e5aad7dd-483c-4653-ada5-d55600180214
The caller of nsTextFrame::BuildDisplayList called nsDisplayListBuilder::Check() right before we called BuildDisplayList.
I checked the minidump and isSelected is still Nothing(), so we didn't take the branch there, and basically didn't execute any code before creating the display item and crashing.
That really sounds like someone else is corrupting our memory and we just crash at random times.
As with bug 1359569, these crashes are all with D2DEnabled:false, which almost certainly is meaningful, but I have no idea how not having d2d means that another thread corrupts our memory.
Comment 62•7 years ago
|
||
I also went through all the other threads in the crash report, nothing seems to be running relevant code.
Updated•7 years ago
|
Comment 63•7 years ago
|
||
I can leave this tracked for 59 for now, but it sounds like this may not be actionable before the 59 release.
Updated•7 years ago
|
status-firefox61:
--- → affected
Comment 64•7 years ago
|
||
Matt, I understand this is a tricky issue, but maybe there are some newer crashes with your diagnostic patch applied that might be more conclusive? Would you please take a look?
Flags: needinfo?(matt.woodrow)
Comment 65•7 years ago
|
||
(In reply to Frederik Braun [:freddyb] from comment #64)
> Matt, I understand this is a tricky issue, but maybe there are some newer
> crashes with your diagnostic patch applied that might be more conclusive?
> Would you please take a look?
Thanks for reminding me to look at this.
Most of the previous diagnostic path was calls to nsDisplayListBuilder::Check, which is behind a MOZ_DIAGNOSTIC_ASSERT ifdef. My understanding is that this won't do anything in beta builds (though it should for dev edition?).
All the crashes I see from beta look to be true beta (based on telemetry environment settings:update:channel:"beta"), so I think the conclusion in comment 61 is incorrect.
I'm going to write some patches to try correct this, and hopefully get more interesting result.
I still see a strong correlation between the crashes and text, flexbox and tables. No progress on figuring out why that might be though.
Flags: needinfo?(matt.woodrow)
Comment 66•7 years ago
|
||
Fixed patch, one that should actually do something useful on beta. This is identical to the previous patch, but without the MOZ_DIAGNOSTIC_ASSERT_ENABLED, so it should be ok to carry the review and approval.
Attachment #8969134 -
Flags: review+
Comment 67•7 years ago
|
||
Comment on attachment 8969134 [details] [diff] [review]
debug-dl-canary v2
Same patch as before, just without the ifdef that made it not work on beta.
[Security approval request comment]
Yet another diagnostic patch to try track this down. Not sure how hard it is to exploit since we haven't been able to figure out how to trigger it.
Approval Request Comment
[Feature/Bug causing the regression]: Unknown.
[User impact if declined]: Needed as part of trying to debug 1421345.
[Is this code covered by automated tests?]: No.
[Has the fix been verified in Nightly?]: No.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No
[Why is the change risky/not risky?]: Just a diagnostic assert to try track down a possible corruption issue.
[String changes made/needed]: None.
Attachment #8969134 -
Flags: sec-approval?
Attachment #8969134 -
Flags: approval-mozilla-beta?
Comment 68•7 years ago
|
||
Comment on attachment 8969134 [details] [diff] [review]
debug-dl-canary v2
Approvals given
Attachment #8969134 -
Flags: sec-approval?
Attachment #8969134 -
Flags: sec-approval+
Attachment #8969134 -
Flags: approval-mozilla-beta?
Attachment #8969134 -
Flags: approval-mozilla-beta+
Comment 69•7 years ago
|
||
Comment 70•7 years ago
|
||
uplift |
tracking-firefox61:
--- → +
Comment 71•7 years ago
|
||
Comment 72•7 years ago
|
||
Few reports starting to show up with the new checks:
https://crash-stats.mozilla.com/report/index/b0e0ea6e-8201-42ad-a19c-502c60180423
https://crash-stats.mozilla.com/report/index/651cea1f-0055-48fe-9a47-071390180422
ScrollFrameHelper::BuildDisplayList looks very suspicious at this point, still can't find anything obviously wrong with it though.
Comment 73•7 years ago
|
||
Same as before, just adding some more checks specific to the code that looks close.
Attachment #8970400 -
Flags: sec-approval?
Attachment #8970400 -
Flags: review?(mstange)
Attachment #8970400 -
Flags: approval-mozilla-beta?
Comment 74•7 years ago
|
||
Al, I'm ok with this for beta 16 if you want to give it sec-approval.
Flags: needinfo?(abillings)
Comment 75•7 years ago
|
||
Comment on attachment 8970400 [details] [diff] [review]
more-checks
sec-approval+ given and I marked Beta as well.
Flags: needinfo?(abillings)
Attachment #8970400 -
Flags: sec-approval?
Attachment #8970400 -
Flags: sec-approval+
Attachment #8970400 -
Flags: approval-mozilla-beta?
Attachment #8970400 -
Flags: approval-mozilla-beta+
Updated•7 years ago
|
Attachment #8970400 -
Flags: review?(mstange) → review+
Comment 76•7 years ago
|
||
uplift |
Comment 77•7 years ago
|
||
Hi there, I have a user experiencing the this crash if you need more information. [Apologies, I don't mean to spam the bug] this is the thread we are working with the user in.https://support.mozilla.org/en-US/questions/1216050
I will ask them to try beta.
Comment 78•7 years ago
|
||
(In reply to guigs [guigs] PST (please needinfo me!) from comment #77)
> Hi there, I have a user experiencing the this crash if you need more
> information. [Apologies, I don't mean to spam the bug] this is the thread we
> are working with the user
> in.https://support.mozilla.org/en-US/questions/1216050
>
> I will ask them to try beta.
That's really interesting! I did not expect this to be consistently reproducible.
Any chance you could try get them to test out a Nightly ASAN build?
The latest one is available here: https://queue.taskcluster.net/v1/task/TByH83HhRCunzxqwqq774A/runs/0/artifacts/public/build/target.zip
Flags: needinfo?(rmcguigan)
Comment 79•7 years ago
|
||
This is wontfix for 60 at this point.
status-firefox62:
--- → affected
status-firefox-esr60:
--- → affected
tracking-firefox62:
--- → +
tracking-firefox-esr60:
--- → ?
Comment 80•7 years ago
|
||
I went through all the relevant release-60 crash reports that I could find and tried to analyze them as best I could:
https://pastebin.mozilla.org/9085739
Some of the callstack information is confusing in crash-stats thanks to inlined code, and sometimes we seem to miss the BuildDisplayListForChild stack frame that calls into CheckCanary. The BuildDisplayListForChild line numbers in the pastebin are obtained by looking at the minidump, looking at stack memory, pulling the return address and then looking up what code that corresponds to.
Mats, as I said in bug 1436505, this seems like it might actually be similar.
I'd really appreciate if you could have a look over this when you get time. Thanks!
Flags: needinfo?(mats)
Updated•7 years ago
|
Flags: needinfo?(rmcguigan)
Comment 82•6 years ago
|
||
While I'm still very-much interested in fixing this in 61 if at all possible, I don't think tracking it is going to help at this point. Please don't hesitate to nominate a patch for uplift should the root cause be found and fixed in time.
Comment 83•6 years ago
|
||
I'd also happily take a fix here in 62 beta.
Note, bug 1198710, bug 1406727, bug 1436505, and bug 1467526 may all be related.
status-firefox63:
--- → ?
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Keywords: sec-critical → sec-high
Updated•5 years ago
|
status-firefox68:
--- → fix-optional
status-firefox69:
--- → fix-optional
Updated•5 years ago
|
Comment 84•5 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:mattwoodrow, maybe it's time to close this bug?
Flags: needinfo?(matt.woodrow)
Comment 85•5 years ago
|
||
The crash rate is super low, but I guess we can leave this open, since it still happens sometimes.
Flags: needinfo?(matt.woodrow)
Comment 86•3 years ago
|
||
The bug assignee didn't login in Bugzilla in the last 7 months and this bug has priority 'P2'/severity 'critical'.
:miko, could you have a look please?
For more information, please visit auto_nag documentation.
Assignee: matt.woodrow → nobody
Flags: needinfo?(mikokm)
Comment 87•3 years ago
|
||
We have had a total of 10 crashes in the last six months, last crashing version is 91.6.0esr.
Status: REOPENED → RESOLVED
Closed: 7 years ago → 3 years ago
Flags: needinfo?(mikokm)
Resolution: --- → WORKSFORME
Updated•3 years ago
|
Keywords: leave-open
Comment 88•3 years ago
|
||
Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit auto_nag documentation.
Keywords: stalled
Updated•1 year ago
|
Group: core-security-release
You need to log in
before you can comment on or make changes to this bug.
Description
•