Closed Bug 1406727 Opened 7 years ago Closed 2 years ago

Crash in nsDisplayWrapList::Destroy

Categories

(Core :: Web Painting, defect, P2)

57 Branch
All
Windows
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- wontfix
firefox56 --- unaffected
firefox57 + wontfix
firefox58 + wontfix
firefox59 + wontfix
firefox60 + wontfix
firefox61 - wontfix
firefox62 - wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- wontfix
firefox69 --- wontfix

People

(Reporter: philipp, Unassigned)

References

Details

(5 keywords)

Crash Data

Attachments

(6 files, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-a501024d-a454-4a33-b035-c06fe0171008.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	nsDisplayWrapList::Destroy(nsDisplayListBuilder*) 	layout/painting/nsDisplayList.h:4015
1 	xul.dll 	nsDisplayWrapList::Destroy(nsDisplayListBuilder*) 	layout/painting/nsDisplayList.h:4015
2 	xul.dll 	nsDisplayWrapList::Destroy(nsDisplayListBuilder*) 	layout/painting/nsDisplayList.h:4015
3 	xul.dll 	nsDisplayList::DeleteAll(nsDisplayListBuilder*) 	layout/painting/nsDisplayList.cpp:2413
4 	xul.dll 	nsLayoutUtils::PaintFrame(gfxContext*, nsIFrame*, nsRegion const&, unsigned int, nsDisplayListBuilderMode, nsLayoutUtils::PaintFrameFlags) 	layout/base/nsLayoutUtils.cpp:3888
5 	xul.dll 	mozilla::PresShell::Paint(nsView*, nsRegion const&, unsigned int) 	layout/base/PresShell.cpp:6454
6 	xul.dll 	nsViewManager::ProcessPendingUpdatesPaint(nsIWidget*) 	view/nsViewManager.cpp:480
7 	xul.dll 	nsViewManager::ProcessPendingUpdatesForView(nsView*, bool) 	view/nsViewManager.cpp:412
8 	xul.dll 	nsViewManager::ProcessPendingUpdates() 	view/nsViewManager.cpp:1102
9 	xul.dll 	nsRefreshDriver::Tick(__int64, mozilla::TimeStamp) 	layout/base/nsRefreshDriver.cpp:2082
10 	xul.dll 	nsRefreshDriver::DoTick() 	layout/base/nsRefreshDriver.cpp:1526
11 	xul.dll 	nsRefreshDriver::FinishedWaitingForTransaction() 	layout/base/nsRefreshDriver.cpp:2189
12 	xul.dll 	nsRefreshDriver::NotifyTransactionCompleted(unsigned __int64) 	layout/base/nsRefreshDriver.cpp:2256
13 	xul.dll 	mozilla::layers::ClientLayerManager::DidComposite(unsigned __int64, mozilla::TimeStamp const&, mozilla::TimeStamp const&) 	gfx/layers/client/ClientLayerManager.cpp:529
14 	xul.dll 	mozilla::dom::TabChild::DidComposite(unsigned __int64, mozilla::TimeStamp const&, mozilla::TimeStamp const&) 	dom/ipc/TabChild.cpp:3079
15 	xul.dll 	mozilla::layers::CompositorBridgeChild::RecvDidComposite(unsigned __int64 const&, unsigned __int64 const&, mozilla::TimeStamp const&, mozilla::TimeStamp const&) 	gfx/layers/ipc/CompositorBridgeChild.cpp:540
16 	xul.dll 	mozilla::layers::PCompositorBridgeChild::OnMessageReceived(IPC::Message const&) 	ipc/ipdl/PCompositorBridgeChild.cpp:1473
17 	xul.dll 	mozilla::layers::PCompositorManagerChild::OnMessageReceived(IPC::Message const&) 	ipc/ipdl/PCompositorManagerChild.cpp:161
18 	xul.dll 	mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) 	ipc/glue/MessageChannel.cpp:2119
19 	xul.dll 	mozilla::ipc::MessageChannel::DispatchMessageW(IPC::Message&&) 	ipc/glue/MessageChannel.cpp:2049
20 	xul.dll 	mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::MessageChannel::MessageTask&) 	ipc/glue/MessageChannel.cpp:1895
21 	xul.dll 	mozilla::ipc::MessageChannel::MessageTask::Run() 	ipc/glue/MessageChannel.cpp:1928
22 	xul.dll 	mozilla::SchedulerGroup::Runnable::Run() 	xpcom/threads/SchedulerGroup.cpp:396
23 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1039
24 	xul.dll 	mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:97
25 	xul.dll 	mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:301
26 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:319
27 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:299
28 	xul.dll 	nsBaseAppShell::Run() 	widget/nsBaseAppShell.cpp:158
29 	xul.dll 	nsAppShell::Run() 	widget/windows/nsAppShell.cpp:230
30 	xul.dll 	XRE_RunAppShell() 	toolkit/xre/nsEmbedFunctions.cpp:880
31 	xul.dll 	mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:269
32 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:319
33 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:299
34 	xul.dll 	XRE_InitChildProcess(int, char** const, XREChildData const*) 	toolkit/xre/nsEmbedFunctions.cpp:705
35 	xul.dll 	mozilla::BootstrapImpl::XRE_InitChildProcess(int, char** const, XREChildData const*) 	toolkit/xre/Bootstrap.cpp:65
36 	firefox.exe 	content_process_main(mozilla::Bootstrap*, int, char** const) 	ipc/contentproc/plugin-container.cpp:63
37 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:115
38 	firefox.exe 	__scrt_common_main_seh 	f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:253
39 	kernel32.dll 	BaseThreadInitThunk 	
40 	ntdll.dll 	__RtlUserThreadStart 	
41 	ntdll.dll 	_RtlUserThreadStart

crash reports with this signature are newly showing up in firefox 57 - so far most of the time (86%) in the content process, from builds on windows & apparently in a codepath added in bug 1388162.
Flags: needinfo?(matt.woodrow)
Priority: -- → P2
How frequent is this crash?

I suspect this is a duplicate of bug 1198710, just with the new signature.
Flags: needinfo?(matt.woodrow)
(In reply to Matt Woodrow (:mattwoodrow) from comment #1)
> How frequent is this crash?
> 
> I suspect this is a duplicate of bug 1198710, just with the new signature.

A bit higher than bug 1198710 within the period of 57 beta.
It seems the display item address returned from RemoveBottm() could be invalid or contained invalid child items.
Do you have any idea how to progress the investigation?
[Tracking Requested - why for this release]:
UAF sec-critical regression in 57

Crashes are all wildptr and/or clear UAFs.  Worse, they're READ, WRITE and EXEC crashes.

This may be a dup of bug 1198710 with a new signature -- that shows similar problems and probably should be marked as sec.
Group: core-security
Flags: needinfo?(milan)
Tracking 57+ based on Randell's assessment in Comment 3. There are a few crashes in 58 as well.

The other signature nsDisplayList::DeleteAll in Bug 1198710 has crashes in 57 as well, but also has crashes dating back several releases.

Although this isn't high volume in beta, would be good to understand more about why these appear more severe in 57.
(In reply to Marcia Knous [:marcia - use ni] from comment #4)

> The other signature nsDisplayList::DeleteAll in Bug 1198710 has crashes in
> 57 as well, but also has crashes dating back several releases.
> 
> Although this isn't high volume in beta, would be good to understand more
> about why these appear more severe in 57.

The other ones on bug 1198710 are sec crashes too, just with a different signature.  Might be different bugs, but I'm betting on the same bug.  We probably should mark that one as sec as well, though it's been open a long time

NI dveditz to decide on the other bug
Flags: needinfo?(dveditz)
It looks like we have a display item in the list that contains garbage memory. Trying to call Destroy() on it crashes.

Display items are all allocated in an arena, which we free once we've finished painting (and after this call).

If we had already free'd this display item, I'd still expect the memory contents to be valid since we don't do any sort of recycling of the arena memory, and the dtor shouldn't be writing garbage into our vtable pointer.

A few of the crashes (though not many) even look like they have the poison value (0xE5) written into them. I can't think of anywhere where we'd write the poison value into the still-valid arena.

So it seems like either have corruption, something else is clobbering the contents of this display item. Possibly an invalid static_cast? That doesn't explain where the poison value comes from though.

Another possibility is that the display item was allocated into a different arena, one that has been actually released to the OS at this point (and then can be clobbered/poisoned by another user of the same memory).

I can't see how this would happen though. We definitely have cases where we can recurse into nsLayoutUtils::PaintFrame which creates a new builder/arena, but this doesn't pass through the display list from the outer caller.

There would have to be some way of a display item, display list, or display list builder crossing this boundary, and then us adding an item to the wrong list.

I did a quick audit and can't find anywhere that we cache any of these items in a way that the wrong could access them, but it's possible I missed something.

Not sure how to make further progress without STR.
Attached patch assert-display-item (obsolete) — Splinter Review
We could try this, it should at least narrow the options down.

Might not be ideal for performance, so I think we'd want to back this out once we get results (ideally in Nightly, but Beta at the latest).

Nathan, would you be ok with adding something like this? I could add lots of comments telling people they aren't allowed to use the new functions except for this bug, but we also don't want to draw too much attention.
Assignee: nobody → matt.woodrow
Attachment #8917655 - Flags: feedback?
Attachment #8917655 - Flags: feedback? → feedback?(nfroyd)
Group: core-security → layout-core-security
Could we do some ifdefs to restrict to nightly or to early-beta-and-earlier?

And/or add some code to check arena-it-was-allocated-from when elements are added?  Or before releasing the list?
Flags: needinfo?(matt.woodrow)
On the other bug, I see this crash: https://crash-stats.mozilla.com/report/index/551eb5bf-6b3b-4e58-a5d1-187e20171007
That's an EXEC crash with an e5e5 address in 57b7 (and there are quite a few wildptr EXEC crashes there).

I'm going to mark it as sec also.  I strongly suspect these two signatures are part of the same bug, but will leave separate for now.
Comment on attachment 8917655 [details] [diff] [review]
assert-display-item

Review of attachment 8917655 [details] [diff] [review]:
-----------------------------------------------------------------

I think we could do something like this, sure.  Definitely says that something fishy is going on with this code, though. =/

I like jesup's idea about trying to restrict this to nightly/early beta.

::: xpcom/ds/ArenaAllocator.h
@@ +178,5 @@
> +
> +    bool DebugContains(void* aPtr)
> +    {
> +      return aPtr >= reinterpret_cast<void*>(header.offset) &&
> +             aPtr < reinterpret_cast<void*>(header.tail);

This should be written as:

uintptr_t p = reinterpret_cast<uintptr_t>(aPtr);
return p >= header.offset && p < header.tail;

due to rules in the standard about comparing pointers: aPtr doesn't necessarily point into the same object bounded by header.{offset,tail}, so it can't be compared to them.
Attachment #8917655 - Flags: feedback?(nfroyd) → feedback+
(In reply to Nathan Froyd [:froydnj] from comment #10)
> Comment on attachment 8917655 [details] [diff] [review]
> assert-display-item
> ...
> 
> I like jesup's idea about trying to restrict this to nightly/early beta.

MOZ_DIAGNOSTIC_ASSERT instead of MOZ_RELEASE_ASSERT?  It'd hit nightly and dev edition, but not beta and release.
Flags: needinfo?(milan)
(In reply to Randell Jesup [:jesup] from comment #8) 
> And/or add some code to check arena-it-was-allocated-from when elements are
> added?  Or before releasing the list?

Checking the arena when elements are added is what we really want I think, but it's considerably bigger patch to write, so I thought this would be a good start to see if we're on the right track.

I can bite the bullet and write the bigger patch if you think it's worth it.

(In reply to Nathan Froyd [:froydnj] from comment #10)
> I think we could do something like this, sure.  Definitely says that
> something fishy is going on with this code, though. =/

It does :( I'd want to remove this again once we figure out exactly what that is.
Flags: needinfo?(matt.woodrow)
(In reply to Matt Woodrow (:mattwoodrow) from comment #12)
> Checking the arena when elements are added is what we really want I think,
> but it's considerably bigger patch to write, so I thought this would be a
> good start to see if we're on the right track.
> 
> I can bite the bullet and write the bigger patch if you think it's worth it.

I'm far from the final arbiter here, but I think it is, since this is a LONG-standing sec-critical (see bug 1198710).
Attachment #8917655 - Attachment is obsolete: true
Attachment #8918686 - Flags: review?(nfroyd)
Attachment #8918686 - Flags: review?(mstange)
Comment on attachment 8918686 [details] [diff] [review]
assert-display-item

Review of attachment 8918686 [details] [diff] [review]:
-----------------------------------------------------------------

r+ on everything outside ArenaAllocator.h
Attachment #8918686 - Flags: review?(mstange) → review+
Comment on attachment 8918686 [details] [diff] [review]
assert-display-item

Review of attachment 8918686 [details] [diff] [review]:
-----------------------------------------------------------------

r=me on the ArenaAllocator.h part, but we should talk about what the right test below is.

::: xpcom/ds/ArenaAllocator.h
@@ +183,5 @@
> +#ifdef MOZ_DIAGNOSTIC_ASSERT_ENABLED
> +    bool DebugContains(void* aPtr)
> +    {
> +      return aPtr >= (this + 1) &&
> +             reinterpret_cast<uintptr_t>(aPtr) < header.tail;

Wait, what?  Oh, I see, because `this` is actually embedded in the arena, and testing:

header.offset <= aPtr && aPtr < header.tail

is not the correct test, because we'd be checking to see whether the pointer lives in the *unallocated* part of the arena.

So is the right test something like

(this + 1) <= aPtr && aPtr < header.offset

since that would be checking for things actually allocated?  Or do we want to use the test written here?

Either way, I think we want to write the first part of the test as:

reinterpret_cast<uintptr_t>(aPtr) >= reinterpret_cast<uintptr_t>(this + 1)

for the aforementioned pointer comparison issues.
Attachment #8918686 - Flags: review?(nfroyd) → review+
Comment on attachment 8918686 [details] [diff] [review]
assert-display-item

[Security approval request comment]
How easily could an exploit be constructed based on the patch?

Not easily, we still haven't figured out what the issue is.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?

It makes the outcome fairly obvious, but doesn't really help finding the cause (until we get crash reports).

Which older supported branches are affected by this flaw?

All.

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?

No, but this code hasn't changed much, should be easy.

How likely is this patch to cause regressions; how much testing does it need?

Unlikely, just adding some new assertions.
Attachment #8918686 - Flags: sec-approval?
Comment on attachment 8918686 [details] [diff] [review]
assert-display-item

sec-approval+ for trunk.
Is there any point in backporting this to beta if we're just adding assertions?
Flags: needinfo?(matt.woodrow)
Attachment #8918686 - Flags: sec-approval? → sec-approval+
(In reply to Al Billings [:abillings] from comment #19)
> Comment on attachment 8918686 [details] [diff] [review]
> assert-display-item
> 
> sec-approval+ for trunk.
> Is there any point in backporting this to beta if we're just adding
> assertions?

Beta just gets a much higher crash volume, so it'll make it quicker to track down. Looks like this happens a few times a week on Nightly though, so maybe there's no point.
Flags: needinfo?(matt.woodrow)
(In reply to Matt Woodrow (:mattwoodrow) from comment #20)
> (In reply to Al Billings [:abillings] from comment #19)
> > Comment on attachment 8918686 [details] [diff] [review]
> > assert-display-item
> > 
> > sec-approval+ for trunk.
> > Is there any point in backporting this to beta if we're just adding
> > assertions?
> 
> Beta just gets a much higher crash volume, so it'll make it quicker to track
> down. Looks like this happens a few times a week on Nightly though, so maybe
> there's no point.

Are you sure?  An assertion crash is a LOT better than a UAF
Flags: needinfo?(matt.woodrow)
(In reply to Randell Jesup [:jesup] from comment #21)
> (In reply to Matt Woodrow (:mattwoodrow) from comment #20)
> > (In reply to Al Billings [:abillings] from comment #19)
> > > Comment on attachment 8918686 [details] [diff] [review]
> > > assert-display-item
> > > 
> > > sec-approval+ for trunk.
> > > Is there any point in backporting this to beta if we're just adding
> > > assertions?
> > 
> > Beta just gets a much higher crash volume, so it'll make it quicker to track
> > down. Looks like this happens a few times a week on Nightly though, so maybe
> > there's no point.
> 
> Are you sure?  An assertion crash is a LOT better than a UAF

Well the changes are restricted to development builds since it's both a guess and a performance regression.

Protecting our beta users might be worth it, but we really just need to figure out the real issue ASAP and get it fixed so we can ship something to release.
Flags: needinfo?(matt.woodrow)
https://hg.mozilla.org/mozilla-central/rev/940a0c98e46c
https://hg.mozilla.org/mozilla-central/rev/2c77560532c9
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Group: layout-core-security → core-security-release
Calling this 57:wontfix given comment 22. Feel free to set it back to affected and nominate for Beta approval if I've misunderstood, though.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Crash reports:

bp-fec67bcc-475e-442a-9a1e-2912d0171028
bp-e6d6de36-221f-4b01-a759-e9e780171029

These are really weird, we're hitting the assertion when tearing down the list, or when sorting the list (which removes and then adds items from the same list).

At no point do we ever hit an assertion when actually adding a new item to the list.

I did another audit to try find locations where we might edit the list that aren't covered by assertions and couldn't find any.

It's possible that this is just a corruption issue.
Attached patch more-assertionsSplinter Review
Trying some extra logging to figure out exactly what the broken pointer is (if it's close to arena memory, or entirely unrelated), and also trying to see if we can catch the 'mAbove' linked list pointer being corrupt.
Attachment #8924049 - Flags: review?(nfroyd)
Comment on attachment 8924049 [details] [diff] [review]
more-assertions

Review of attachment 8924049 [details] [diff] [review]:
-----------------------------------------------------------------

r=me, but you need to ask for data review:

https://wiki.mozilla.org/Firefox/Data_Collection

I think this falls into the "technical data" category, so review should be pretty painless.

::: layout/painting/nsDisplayList.h
@@ +1823,5 @@
>    // destructors.
>  protected:
>    nsDisplayItemLink() : mAbove(nullptr) {}
>    nsDisplayItemLink(const nsDisplayItemLink&) : mAbove(nullptr) {}
> +  uint64_t mSentinel = 0xDEADBEEFDEADBEEF;

You may want to guard this with MOZ_DIAGNOSTIC_ASSERT_ENABLED, just in case we don't back this out when diagnostic asserts are expiring?

::: xpcom/ds/ArenaAllocator.h
@@ +134,5 @@
>      }
> +    std::stringstream log;
> +    log << "Failed to find pointer " << aPtr << " within arena blocks: ";
> +    for (ArenaChunk* arena = mHead.next; arena; arena = arena->next) {
> +      log << "(" << reinterpret_cast<uintptr_t>(arena + 1) << ", " << arena->header.offset << "), ";

Dumping the pointer that we're looking for and the addresses of the arena in which we looked for it, so we can potentially tell why we missed it...OK, makes sense.

@@ +136,5 @@
> +    log << "Failed to find pointer " << aPtr << " within arena blocks: ";
> +    for (ArenaChunk* arena = mHead.next; arena; arena = arena->next) {
> +      log << "(" << reinterpret_cast<uintptr_t>(arena + 1) << ", " << arena->header.offset << "), ";
> +    }
> +    MOZ_CRASH_UNSAFE_OOL(log.str().c_str());

Usages of this macro need a data-review to ensure we're not sending private data.
Attachment #8924049 - Flags: review?(nfroyd) → review+
Comment on attachment 8924049 [details] [diff] [review]
more-assertions

Review of attachment 8924049 [details] [diff] [review]:
-----------------------------------------------------------------

(Sorry, I should have just done this as part of my review, redirecting your review request)

Requesting data-review for sending runtime-defined strings with MOZ_CRASH_UNSAFE_OOL: we're sending information about memory addresses of internal data structures.  We believe this falls into Category 1 ("technical data").  This data is being collected to understand crashes we're seeing in the wild better and is being collected for nightly builds and early beta builds.
Attachment #8924049 - Flags: review?(francois)
Comment on attachment 8924049 [details] [diff] [review]
more-assertions

Review of attachment 8924049 [details] [diff] [review]:
-----------------------------------------------------------------

datareview+ based on the description from comment 28.
Attachment #8924049 - Flags: review?(francois) → review+
Comment on attachment 8924049 [details] [diff] [review]
more-assertions

[Security approval request comment]
How easily could an exploit be constructed based on the patch?

Not easily, we still haven't figured out what the issue is.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?

It makes the outcome fairly obvious, but doesn't really help finding the cause (until we get crash reports).

Which older supported branches are affected by this flaw?

All.

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?

No, but this code hasn't changed much, should be easy.

How likely is this patch to cause regressions; how much testing does it need?

Unlikely, just adding some new assertions.
Attachment #8924049 - Flags: sec-approval?
Comment on attachment 8924049 [details] [diff] [review]
more-assertions

sec-approval+. Land away.
Attachment #8924049 - Flags: sec-approval? → sec-approval+
Crash report with the newer assertions:

https://crash-stats.mozilla.com/report/index/be49f893-9f70-4d7e-a265-d3a3e0171105

It's failing the mSentinel check, that's pretty bad. The only explanation I can think of is heap corruption/buffer overflow. Someone is overwriting memory belonging to a display item.

I'm not sure what to do next to track down where this is coming from, will think more about it.
Crash report from the other assertion, showing the arena blocks:

https://crash-stats.mozilla.com/report/index/25293941-d437-484f-8195-6a5b60171108
Nathan: Looks like one of the arena chunks there has 0 for the 'offset' value. That's pretty bizarre.

The arena is 8192, so the failing pointer is probably in that chunk, if it had the expected end value.

I can't see any way we could get a 0 there though, excluding heap corruption.
Flags: needinfo?(nfroyd)
The 'offset' value is part of the header, which is in the same allocation as the actual arena memory that we give out. So this could be roughly the same as the other crash report, something is corrupting memory within our arena blocks.

It looks like the corrupt one doesn't have another block right before it, so it would seem possible that we've got a buffer overflow from something else overflowing into an arena block.

This is a fairly common crash (more than I think would be expected for generic heap corruption), so the overflowing memory is likely something else we allocate during painting.
(In reply to Matt Woodrow (:mattwoodrow) from comment #36)
> Nathan: Looks like one of the arena chunks there has 0 for the 'offset'
> value. That's pretty bizarre.
> 
> The arena is 8192, so the failing pointer is probably in that chunk, if it
> had the expected end value.
> 
> I can't see any way we could get a 0 there though, excluding heap corruption.

That *is* pretty weird!  ArenaAllocator::mHead is initialized with offset=0, tail=0, but no other ArenaChunk is.  And mHead isn't even included in the list of chunks; it's just there as a placeholder.

The only ways I can think of a zero getting there are:

1. Some kind of weird race.  I can't really construct a plausible race for this, though.
2. Overflow in ArenaChunk::Allocate:

  header.offset += aSize;

That'd have to be a *very* specific aSize, though; comment 35's crash, for instance, would require an allocation of ~2.5GB or so.  I guess that could arise from allocation sizes overflowing before coming into ArenaAllocator...we could add some overflow checks in ArenaAllocator to determine whether that's what's happening.  But IIUC, we can't get an overflowing size there, since we're coming from nsPresArena's pooled allocator, and we don't allocate arbitrary (i.e. calculated) sizes there...do we?

We could try inserting (several!) CorruptionCanary members (from mfbt/Poison.h) into ArenaHeader and checking them at various points.  That would provide solid evidence of heap corruption, but wouldn't move us much closer to figuring out what was going on. :(

Maybe we should try printing the tail member, too?

Checking for overflow would have the virtue--if the overflow check(s) ever failed--of telling us exactly where things were going wrong, rather than "somebody scribbled over this memory at some point in the past".
Flags: needinfo?(nfroyd)
I really can't see how it could be overflow, shouldn't the size check prevent us from allocating a size that would go past the tail pointer?

If we did overflow, then I think further allocation attempts would succeed, and we'd never move on to a new chunk.

As you said, we also don't do arbitrary allocations from the display list builder, so I can't see how we'd ever get something big enough.

I'll have a go at the canary option.
Attached patch canary-checkSplinter Review
Attachment #8927693 - Flags: review?(nfroyd)
Blocks: 1415496
Comment on attachment 8927693 [details] [diff] [review]
canary-check

Review of attachment 8927693 [details] [diff] [review]:
-----------------------------------------------------------------

I'm really hesitant to remove the logging in favor of the canary checking.  I'd prefer to keep both, if possible.  WDYT?

::: xpcom/ds/ArenaAllocator.h
@@ -134,5 @@
>      }
> -    std::stringstream log;
> -    log << "Failed to find pointer " << aPtr << " within arena blocks: ";
> -    for (ArenaChunk* arena = mHead.next; arena; arena = arena->next) {
> -      log << "(" << reinterpret_cast<uintptr_t>(arena + 1) << ", " << arena->header.offset << "), ";

I haven't looked through the crash reports to tell if there have been multiple reports with a 0 offset...do we really want to lose the logging here and/or do we want to print header.tail into this to try and understand better what's going on?  Couldn't we, say, keep the logging and canary.Check() in the DebugContains() loop above?  We'd crash on the check if there was heap corruption, but we'd still get useful information from the logging if there *wasn't* heap corruption.

Just looking at the crash report summaries, we're still hitting read/exec breakpoints on sometimes-freed memory with the signatures from this bug and not hitting the crash here...maybe several different things are being conflated.
Attachment #8927693 - Flags: review?(nfroyd)
The primary issue is that this logging is really expensive, we've got quite a few open regression bugs because of it.

Retained display lists landed, and showed a regression on the display list mutation talos test, just because this logging adds overhead to display list operations (we show a 30% win without the logging).

I realize that the logging is restricted to dev builds, but it's interfering with our ability to track performance regressions accurately, and that's not great.

I do think you're right that there's more than one way that this can crash, I'm just not sure this logging is helping us narrow it down enough to justify the cost.
Hm, that's fair.  But is the canary going to be much cheaper?

The string logging is only activated if there's ever a problem, so we don't care about how expensive that is.  The cost the benchmarks are detecting is really the loop + DebugContains check.  The DebugContains check is two comparisons, whereas the canary check is a single comparison.  So that will be a little cheaper, but we're still paying a bunch of cache misses to wander through memory, which I'd bet is the performance problem.  We could make the DebugContains check a single unsigned comparison (I think) *and* do the Canary check and it wouldn't be much more expensive than what we're doing now...

Have you measured the cost of the current checks vs. canary checking?

Another alternative would be to make the individual arenas bigger, which would waste some memory, but give us fewer things to loop over when we're doing the debug checks.
The big difference is the canary check (as currently written) only happens 3 times during display list building, whereas DebugContains runs per-display-item (can easily be 20k times per-paint).
(In reply to Matt Woodrow (:mattwoodrow) from comment #44)
> The big difference is the canary check (as currently written) only happens 3
> times during display list building, whereas DebugContains runs
> per-display-item (can easily be 20k times per-paint).

Aha, I totally missed that.  My mistake.

I'm still not excited about re-doing data review if we have to add annotations for the crash, but maybe the canary check will reveal new information.  Worth a shot.
Comment on attachment 8927693 [details] [diff] [review]
canary-check

Review of attachment 8927693 [details] [diff] [review]:
-----------------------------------------------------------------

Let's try this, then.
Attachment #8927693 - Flags: review+
Comment on attachment 8927693 [details] [diff] [review]
canary-check

[Security approval request comment]
How easily could an exploit be constructed based on the patch?

Same as above, more attempts to try track down the cause of this bug.
Attachment #8927693 - Flags: sec-approval?
sec-approval+ for trunk.
Attachment #8927693 - Flags: sec-approval? → sec-approval+
Blocks: 1411988
Hi Matt, did your latest patch shed any light on the situation?
Flags: needinfo?(matt.woodrow)
Looks like there's one crash so far [1], and it's crashing at the first check point [2].

That suggests that the bug is in display list building itself, not FrameLayerBuilder.

Need to think of something to narrow it down further.

[1] https://crash-stats.mozilla.com/report/index/3ee5702a-db44-4411-805c-fa1840171128
[2] https://searchfox.org/mozilla-central/source/layout/base/nsLayoutUtils.cpp#3922
Comment on attachment 8927693 [details] [diff] [review]
canary-check

Approval Request Comment
[Feature/Bug causing the regression]: Unknown.
[User impact if declined]: Needed as part of trying to debug 1421345.
[Is this code covered by automated tests?]: No.
[Has the fix been verified in Nightly?]: No.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No
[Why is the change risky/not risky?]: Just a diagnostic assert to try track down a possible corruption issue.
[String changes made/needed]: None.
Flags: needinfo?(matt.woodrow)
Attachment #8927693 - Flags: approval-mozilla-beta?
Comment on attachment 8927693 [details] [diff] [review]
canary-check

After discussing with dev in mail thread, we can live with this and let's fix this in 59.
Attachment #8927693 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Still seeing exploitable looking crashes in 59 beta. Matt, anything else to try here?
Flags: needinfo?(matt.woodrow)
Attached patch debug-dl-canarySplinter Review
There's some more interesting crash reports here: https://crash-stats.mozilla.com/signature/?moz_crash_reason=~Canary%20check%20failed&signature=nsPresArena%3A%3AAllocate&date=%3E%3D2018-01-07T03%3A26%3A09.000Z&date=%3C2018-02-07T03%3A26%3A09.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports

That's a canary check every time we try to allocate something new from the arena.

I was hoping the stacks would have something in common that would point to the issue, but I haven't found it yet. nsFlexContainerFrame shows up fairly frequently (more than I'd expect), but auditing the display list code there hasn't given me any ideas.

Adding a canary check each time we enter/exit a BuildDisplayList variant should hopefully narrow this down even further.
Flags: needinfo?(matt.woodrow)
Attachment #8948548 - Flags: review?(nfroyd)
Comment on attachment 8948548 [details] [diff] [review]
debug-dl-canary

Review of attachment 8948548 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me  More diagnostics is more better.
Attachment #8948548 - Flags: review?(nfroyd) → review+
Comment on attachment 8948548 [details] [diff] [review]
debug-dl-canary

[Security approval request comment]
Yet another diagnostic patch to try track this down. Not sure how hard it is to exploit since we haven't been able to figure out how to trigger it.

Approval Request Comment
[Feature/Bug causing the regression]: Unknown.
[User impact if declined]: Needed as part of trying to debug 1421345.
[Is this code covered by automated tests?]: No.
[Has the fix been verified in Nightly?]: No.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No
[Why is the change risky/not risky?]: Just a diagnostic assert to try track down a possible corruption issue.
[String changes made/needed]: None.
Attachment #8948548 - Flags: sec-approval?
Attachment #8948548 - Flags: approval-mozilla-beta?
Comment on attachment 8948548 [details] [diff] [review]
debug-dl-canary

approvals given!
Attachment #8948548 - Flags: sec-approval?
Attachment #8948548 - Flags: sec-approval+
Attachment #8948548 - Flags: approval-mozilla-beta?
Attachment #8948548 - Flags: approval-mozilla-beta+
(In reply to Matt Woodrow (:mattwoodrow) from comment #57)
> [Is this code covered by automated tests?]: No.
> [Has the fix been verified in Nightly?]: No.
> [Needs manual test from QE? If yes, steps to reproduce]: No.

Setting qe-verify- based on Matt's assessment on manual testing needs.
Flags: qe-verify-
This keeps getting more confusing.

There's a bunch of crashes with the latest diagnostic patch, and this one is particularly interesting:

https://crash-stats.mozilla.com/report/index/e5aad7dd-483c-4653-ada5-d55600180214

The caller of nsTextFrame::BuildDisplayList called nsDisplayListBuilder::Check() right before we called BuildDisplayList.

I checked the minidump and isSelected is still Nothing(), so we didn't take the branch there, and basically didn't execute any code before creating the display item and crashing.

That really sounds like someone else is corrupting our memory and we just crash at random times.

As with bug 1359569, these crashes are all with D2DEnabled:false, which almost certainly is meaningful, but I have no idea how not having d2d means that another thread corrupts our memory.
I also went through all the other threads in the crash report, nothing seems to be running relevant code.
I can leave this tracked for 59 for now, but it sounds like this may not be actionable before the 59 release.
Matt, I understand this is a tricky issue, but maybe there are some newer crashes with your diagnostic patch applied that might be more conclusive? Would you please take a look?
Flags: needinfo?(matt.woodrow)
(In reply to Frederik Braun [:freddyb] from comment #64)
> Matt, I understand this is a tricky issue, but maybe there are some newer
> crashes with your diagnostic patch applied that might be more conclusive?
> Would you please take a look?

Thanks for reminding me to look at this.

Most of the previous diagnostic path was calls to nsDisplayListBuilder::Check, which is behind a MOZ_DIAGNOSTIC_ASSERT ifdef. My understanding is that this won't do anything in beta builds (though it should for dev edition?).

All the crashes I see from beta look to be true beta (based on telemetry environment settings:update:channel:"beta"), so I think the conclusion in comment 61 is incorrect.

I'm going to write some patches to try correct this, and hopefully get more interesting result.

I still see a strong correlation between the crashes and text, flexbox and tables. No progress on figuring out why that might be though.
Flags: needinfo?(matt.woodrow)
Fixed patch, one that should actually do something useful on beta. This is identical to the previous patch, but without the MOZ_DIAGNOSTIC_ASSERT_ENABLED, so it should be ok to carry the review and approval.
Attachment #8969134 - Flags: review+
Comment on attachment 8969134 [details] [diff] [review]
debug-dl-canary v2

Same patch as before, just without the ifdef that made it not work on beta.

[Security approval request comment]
Yet another diagnostic patch to try track this down. Not sure how hard it is to exploit since we haven't been able to figure out how to trigger it.

Approval Request Comment
[Feature/Bug causing the regression]: Unknown.
[User impact if declined]: Needed as part of trying to debug 1421345.
[Is this code covered by automated tests?]: No.
[Has the fix been verified in Nightly?]: No.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No
[Why is the change risky/not risky?]: Just a diagnostic assert to try track down a possible corruption issue.
[String changes made/needed]: None.
Attachment #8969134 - Flags: sec-approval?
Attachment #8969134 - Flags: approval-mozilla-beta?
Comment on attachment 8969134 [details] [diff] [review]
debug-dl-canary v2

Approvals given
Attachment #8969134 - Flags: sec-approval?
Attachment #8969134 - Flags: sec-approval+
Attachment #8969134 - Flags: approval-mozilla-beta?
Attachment #8969134 - Flags: approval-mozilla-beta+
Few reports starting to show up with the new checks:

https://crash-stats.mozilla.com/report/index/b0e0ea6e-8201-42ad-a19c-502c60180423
https://crash-stats.mozilla.com/report/index/651cea1f-0055-48fe-9a47-071390180422

ScrollFrameHelper::BuildDisplayList looks very suspicious at this point, still can't find anything obviously wrong with it though.
Attached patch more-checksSplinter Review
Same as before, just adding some more checks specific to the code that looks close.
Attachment #8970400 - Flags: sec-approval?
Attachment #8970400 - Flags: review?(mstange)
Attachment #8970400 - Flags: approval-mozilla-beta?
Al, I'm ok with this for beta 16 if you want to give it sec-approval.
Flags: needinfo?(abillings)
Comment on attachment 8970400 [details] [diff] [review]
more-checks

sec-approval+ given and I marked Beta as well.
Flags: needinfo?(abillings)
Attachment #8970400 - Flags: sec-approval?
Attachment #8970400 - Flags: sec-approval+
Attachment #8970400 - Flags: approval-mozilla-beta?
Attachment #8970400 - Flags: approval-mozilla-beta+
Attachment #8970400 - Flags: review?(mstange) → review+
Hi there, I have a user experiencing the this crash if you need more information. [Apologies, I don't mean to spam the bug] this is the thread we are working with the user in.https://support.mozilla.org/en-US/questions/1216050

I will ask them to try beta.
No longer blocks: RDLbugs
(In reply to guigs [guigs] PST (please needinfo me!) from comment #77)
> Hi there, I have a user experiencing the this crash if you need more
> information. [Apologies, I don't mean to spam the bug] this is the thread we
> are working with the user
> in.https://support.mozilla.org/en-US/questions/1216050
> 
> I will ask them to try beta.

That's really interesting! I did not expect this to be consistently reproducible.

Any chance you could try get them to test out a Nightly ASAN build?

The latest one is available here: https://queue.taskcluster.net/v1/task/TByH83HhRCunzxqwqq774A/runs/0/artifacts/public/build/target.zip
Flags: needinfo?(rmcguigan)
I went through all the relevant release-60 crash reports that I could find and tried to analyze them as best I could:

https://pastebin.mozilla.org/9085739

Some of the callstack information is confusing in crash-stats thanks to inlined code, and sometimes we seem to miss the BuildDisplayListForChild stack frame that calls into CheckCanary. The BuildDisplayListForChild line numbers in the pastebin are obtained by looking at the minidump, looking at stack memory, pulling the return address and then looking up what code that corresponds to.

Mats, as I said in bug 1436505, this seems like it might actually be similar.

I'd really appreciate if you could have a look over this when you get time. Thanks!
Flags: needinfo?(mats)
Flags: needinfo?(rmcguigan)
(replied in bug 1436505)
Flags: needinfo?(mats)
While I'm still very-much interested in fixing this in 61 if at all possible, I don't think tracking it is going to help at this point. Please don't hesitate to nominate a patch for uplift should the root cause be found and fixed in time.
I'd also happily take a fix here in 62 beta. 
Note, bug 1198710, bug 1406727, bug 1436505, and bug 1467526 may all be related.
Keywords: stalled

The leave-open keyword is there and there is no activity for 6 months.
:mattwoodrow, maybe it's time to close this bug?

Flags: needinfo?(matt.woodrow)

The crash rate is super low, but I guess we can leave this open, since it still happens sometimes.

Flags: needinfo?(matt.woodrow)

The bug assignee didn't login in Bugzilla in the last 7 months and this bug has priority 'P2'/severity 'critical'.
:miko, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: matt.woodrow → nobody
Flags: needinfo?(mikokm)

We have had a total of 10 crashes in the last six months, last crashing version is 91.6.0esr.

Status: REOPENED → RESOLVED
Closed: 7 years ago2 years ago
Flags: needinfo?(mikokm)
Resolution: --- → WORKSFORME
Keywords: leave-open

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit auto_nag documentation.

Keywords: stalled
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.