1359569 - Crash in nsDisplayList::GetClippedBoundsWithRespectToASR

Reporter

Description

•

8 years ago

This bug was filed from the Socorro interface and is 
report bp-e6ae93c0-4862-4512-9e6c-449dc0170425.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	nsDisplayList::GetClippedBoundsWithRespectToASR(nsDisplayListBuilder*, mozilla::ActiveScrolledRoot const*) 	layout/painting/nsDisplayList.cpp:1860
1 	xul.dll 	mozilla::FrameLayerBuilder::BuildContainerLayerFor(nsDisplayListBuilder*, mozilla::layers::LayerManager*, nsIFrame*, nsDisplayItem*, nsDisplayList*, mozilla::ContainerLayerParameters const&, mozilla::gfx::Matrix4x4Typed<mozilla::gfx::UnknownUnits, mozilla::gfx::UnknownUnits> const*, unsigned int) 	layout/painting/FrameLayerBuilder.cpp:5618
2 	xul.dll 	nsDisplayList::PaintRoot(nsDisplayListBuilder*, nsRenderingContext*, unsigned int) 	layout/painting/nsDisplayList.cpp:2120
3 	xul.dll 	nsLayoutUtils::PaintFrame(nsRenderingContext*, nsIFrame*, nsRegion const&, unsigned int, nsDisplayListBuilderMode, nsLayoutUtils::PaintFrameFlags) 	layout/base/nsLayoutUtils.cpp:3683
4 	xul.dll 	mozilla::PresShell::Paint(nsView*, nsRegion const&, unsigned int) 	layout/base/PresShell.cpp:6490
5 	xul.dll 	nsViewManager::ProcessPendingUpdatesPaint(nsIWidget*) 	view/nsViewManager.cpp:483
6 	xul.dll 	nsViewManager::ProcessPendingUpdatesForView(nsView*, bool) 	view/nsViewManager.cpp:415
7 	xul.dll 	nsViewManager::ProcessPendingUpdates() 	view/nsViewManager.cpp:1104

this crash signature on windows is regressing in firefox 54 builds and upwards which looks related to the landing of bug 1298218.

Gerry Chang [:gchang]

Comment 1

•

8 years ago

Hi :mstange, 
Can you help take a look at this one?

Flags: needinfo?(mstange)

Mats Palmgren (inactive)

Updated

•

8 years ago

Component: Layout → Layout: Web Painting

Peter Chang[:pchang]

Comment 2

•

8 years ago

Matt, could you take a look at this issue?

Flags: needinfo?(mstange) → needinfo?(matt.woodrow)

Matt Woodrow (:mattwoodrow)

Comment 3

•

8 years ago

Had a quick look. The crash is on youtube.

It looks like mClip for the affected display item is 0x8, and we're crashing trying to deference that (within ApplyNonRoundedIntersection, within GetClippedBounds).

Not sure how we'd end up with a bad clip pointer like that though. Markus is still probably the best person to look into this.

Flags: needinfo?(matt.woodrow) → needinfo?(mstange)

Markus Stange [:mstange]

Assignee

Comment 4

•

8 years ago

I'm not sure either.

mClip is written two in a few places, but there are only three non-null values that get assigned to it:
 - &mClipChain->mClip
 - DisplayItemClipChain::ClipForASR(mClipChain, aActiveScrolledRoot)
 - DisplayItemClipChain::ClipForASR(aClipChain, mActiveScrolledRoot)

In the first case, in order for &mClipChain->mClip to be 0x8, mClipChain needs to be 0x8 (because mClip is DisplayItemClipChain's first field), but mClipChain is allocated in the previous line and aBuilder->AllocateDisplayItemClipChain cannot return 0x8.

In the second case, in order for DisplayItemClipChain::ClipForASR(mClipChain, aActiveScrolledRoot) to be 0x8, either mClipChain needs to be 0x8 or the mParent pointer on one of its ancestor clip chains needs to be 0x8.
In order for mClipChain to be 0x8, aBuilder->ClipState().GetCurrentCombinedClipChain(aBuilder) needs to have returned 0x8.

That's as far as I've gotten now, more to follow later.

Assignee: nobody → mstange

Status: NEW → ASSIGNED

Flags: needinfo?(mstange)

Randell Jesup [:jesup] (needinfo me)

Comment 5

•

7 years ago

UAF signature in many cases.  Perhaps fallout from Bug 1298218?

Markus?

Group: core-security

Flags: needinfo?(mstange)

Keywords: csectype-uaf, sec-high

Daniel Veditz [:dveditz]

Updated

•

7 years ago

Group: core-security → layout-core-security

Keywords: testcase-wanted

Jet Villegas (inactive)

Comment 6

•

7 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #3)
> Had a quick look. The crash is on youtube.

Got a specific URL?

Flags: needinfo?(matt.woodrow)

Matt Woodrow (:mattwoodrow)

Comment 7

•

7 years ago

I don't think I'm supposed to post those unfortunately, even in sec bugs.

It's just a normal YouTube video url, can send it to you privately if you wish.

I had a look at the other reports and it's all over the place, not youtube specific.

Flags: needinfo?(matt.woodrow)

Astley Chen (inactive)

Comment 8

•

7 years ago

Markus, do you have any update on this bug?

u459114

Comment 9

•

7 years ago

(In reply to Markus Stange [:mstange] from comment #4)
> I'm not sure either.
> 
> mClip is written two in a few places, but there are only three non-null
> values that get assigned to it:
>  - &mClipChain->mClip
>  - DisplayItemClipChain::ClipForASR(mClipChain, aActiveScrolledRoot)
>  - DisplayItemClipChain::ClipForASR(aClipChain, mActiveScrolledRoot)
Or memory overrun in nsDisplayListBuilder::mPool? otherwise, it's seem not possible that nsDisplayItem::mClip can be 0x8. But I can not prove it since I can't repro this crash locally. It might help to add a dog tag in mozilla::ArenaAllocator?

Markus Stange [:mstange]

Assignee

Comment 10

•

7 years ago

(In reply to Astley Chen [:astley] (UTC+8) from comment #8)
> Markus, do you have any update on this bug?

Unfortunately not. I'm going to try to back out bug 1298218 from Beta.

(In reply to C.J. Ku[:cjku](UTC+8) from comment #9)
> Or memory overrun in nsDisplayListBuilder::mPool?

Not sure what you mean. If we're out of memory, Allocate aborts the process.

Flags: needinfo?(mstange)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 11

•

7 years ago

(In reply to Markus Stange [:mstange] from comment #10)
> Unfortunately not. I'm going to try to back out bug 1298218 from Beta.

That seems.. scary. Specially considering we're on RC builds now. I don't know if the RelMan will accept such a large change so late.

Markus Stange [:mstange]

Assignee

Comment 12

•

7 years ago

You're right. At this point I don't really know what options we have.

Markus Stange [:mstange]

Assignee

Comment 13

•

7 years ago

I've found one DisplayItemClipChain pointer that might conceivably be a dangling pointer, i.e. it might be used across paints when the display list bulider's arena has already gone away: http://searchfox.org/mozilla-central/rev/d441cb24482c2e5448accaf07379445059937080/layout/generic/ViewportFrame.cpp#120

I'll investigate more tomorrow.

Randell Jesup [:jesup] (needinfo me)

Comment 14

•

7 years ago

Let's get a fix; then we can decide if it will be a point release, a ride-along, or wait until 55 (and when in 55 (beta) it would land)

Markus Stange [:mstange]

Assignee

Comment 15

•

7 years ago

(In reply to Markus Stange [:mstange] from comment #13)
> I've found one DisplayItemClipChain pointer that might conceivably be a
> dangling pointer, i.e. it might be used across paints when the display list
> bulider's arena has already gone away:
> http://searchfox.org/mozilla-central/rev/
> d441cb24482c2e5448accaf07379445059937080/layout/generic/ViewportFrame.cpp#120
> 
> I'll investigate more tomorrow.

Unfortunately this idea went nowhere.
The frame property is added by nsDisplayListBuilder::MarkFramesForDisplayList and removed by nsDisplayListBuilder::ResetMarkedFramesForDisplayList in a way that's quite easy to reason about and I don't see how it might go wrong. The only way I could image it going wrong is if mFirstFrameMarkedForDisplay (which is a uint32_t) overflows, but in order for that to happen, we'd need to iterate over more than 2^32 frames during a single paint, and most of these crashes are happening in 32bit builds so that's impossible.

Nathan Froyd [:froydnj]

Updated

•

7 years ago

status-firefox54: affected → wontfix

Nathan Froyd [:froydnj]

Updated

•

7 years ago

status-firefox54: wontfix → fix-optional

Markus Stange [:mstange]

Assignee

Comment 16

•

7 years ago

There seem to be two different crashes in this bug. In some of these crashes, GetClippedBoundsWithRespectToASR is called from FrameLayerBuilder::BuildContainerLayerFor, and in others it's called from nsDisplayWrapList::UpdateBounds.

The former is what comment 3 and comment 4 analyzed. But the latter is almost as frequent.

E.g. among the last 21 crashes on Beta, the distribution is as follows:
  13 BuildContainerLayerFor
  8 UpdateBounds

In the UpdateBounds crashes, it looks like there's an nsDisplayItem in the nsDisplayList whose vtable pointer is garbage. This seems like an existing problem and is unlikely to be a regression from bug 1298218. (But this is still a really bad problem to have!)

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox54: fix-optional → wontfix

status-firefox56: --- → ?

Stephanie Ouillon [:arroway] (needinfo me)

Comment 17

•

7 years ago

Markus, what's the next step here? Do you need help to investigate this?

Flags: needinfo?(mstange)

Markus Stange [:mstange]

Assignee

Comment 18

•

7 years ago

I've run out of ideas on what to do here. We could add more instrumentation, but doing that without performance overhead is tricky, and at the rate this is crashing on Nightly, it'd probably take at least a week between iterations.

Flags: needinfo?(mstange)

Mike Taylor [:miketaylr]

Updated

•

7 years ago

status-firefox55: affected → wontfix

status-firefox56: ? → wontfix

status-firefox57: --- → affected

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox58: --- → affected

Jet Villegas (inactive)

Updated

•

7 years ago

Priority: -- → P2

Jim Mathies [:jimm]

Updated

•

7 years ago

status-firefox57: affected → fix-optional

Stephanie Ouillon [:arroway] (needinfo me)

Updated

•

7 years ago

Whiteboard: [sec-triage-backlog]

Randell Jesup [:jesup] (needinfo me)

Comment 19

•

7 years ago

Still crashing at about 50/day (there was a huge spike, but I presume that's unrelated to the baseline crash rate)

We should consider adding some sort of instrumentation.  Matt?  or other ideas how to find this?

Flags: needinfo?(matt.woodrow)

Matt Woodrow (:mattwoodrow)

Comment 20

•

7 years ago

I'm struggling for ideas on this one. It appears to be corruption, but figuring out what is corrupting it is hard.

Looking at the stats, it seems that Windows 7 is really high (71.4% over the last 6 months).

I tried narrowing the search to include the telemetry environment not containing "ContentBackend":"Skia", and 2354 results got filtered down to 8. Of those 8, some were reporting Cairo, and others Direct2D 1.1 (but D2DEnabled:false, seems like a reporting error).

I also searched for D2DEnabled:true, and got literally 0 results (I checked the same search on all reports and got plenty of results).

That strikes me as likely meaningful, and I suspect the content backend distribution is causing the skewed platform distribution (we never get d2d on win7, and usually get it on Win10).

I have no idea how the content backend could be causing corruption during display list building though.

Crashes for bug 1406727 seem to have very similar correlations too, so it's possibly the same underlying bug.

Jeff, Markus, anybody got ideas here?

Flags: needinfo?(mstange)

Flags: needinfo?(matt.woodrow)

Flags: needinfo?(jmuizelaar)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 21

•

7 years ago

There would be the timing difference between the backends, but that's a shot in the dark.

Continuing on Markus' idea from comment 18, maybe it is time to regress performance on nightly for a week to see if extra instrumentation helps us.

Jeff Muizelaar [:jrmuizel]

Comment 22

•

7 years ago

I have none.

Flags: needinfo?(jmuizelaar)

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox57: fix-optional → wontfix

status-firefox58: affected → wontfix

status-firefox59: --- → wontfix

status-firefox60: --- → affected

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

status-firefox60: affected → wontfix

status-firefox61: --- → fix-optional

status-firefox62: --- → affected

status-firefox-esr60: --- → affected

Matt Woodrow (:mattwoodrow)

Updated

•

6 years ago

Keywords: stalled

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 23

•

6 years ago

So I just took a look at bp-73b4b368-3e6e-4581-be91-90a920180831, and it looks like it could be a bit flip, or could be some other form of memory corruption.  It's crashing trying to make a virtual function call (looks like nsDisplayItem::GetBounds, I think) where the vtable pointer (in RAX) is 0x20000007FEEF7E5D.  Except for the the 3rd-highest bit being set, this looks like it could be a valid pointer...

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 24

•

6 years ago

(In reply to David Baron :dbaron: 🏴󠁵󠁳󠁣󠁡󠁿 ⌚UTC-7 from comment #23)
> So I just took a look at bp-73b4b368-3e6e-4581-be91-90a920180831, and it
> looks like it could be a bit flip, or could be some other form of memory
> corruption.  It's crashing trying to make a virtual function call (looks
> like nsDisplayItem::GetBounds, I think) where the vtable pointer (in RAX) is
> 0x20000007FEEF7E5D.  Except for the the 3rd-highest bit being set, this
> looks like it could be a valid pointer...

Actually, given what the instruction pointers are, that one looks more like the vtable pointer itself was read one byte off, since the 0x20 at the top doesn't belong, and shifting everything left by 8 bits would make it look a lot more like a code address.  This makes sense since the this pointer (in RDI and RCX) is 0x000000000e5164a1... which is clearly not a valid pointer, but could be if you consider the low bit flipped.


I also looked at bp-f6c6f504-02d3-47d0-b88e-32fce0180831, which is crashing at the same virtual function call (again, presumably nsDisplayItem::GetBounds inside the inlined nsDisplayItem::GetClippedBounds), and this time RAX looks a lot more like a code address (0x000007fe70db2d00) but it's not a correct address within the binary.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 25

•

6 years ago

(In reply to David Baron :dbaron: 🏴󠁵󠁳󠁣󠁡󠁿 ⌚UTC-7 from comment #24)
> (In reply to David Baron :dbaron: 🏴󠁵󠁳󠁣󠁡󠁿 ⌚UTC-7 from comment #23)
> > So I just took a look at bp-73b4b368-3e6e-4581-be91-90a920180831, and it
> > looks like it could be a bit flip, or could be some other form of memory
> > corruption.  It's crashing trying to make a virtual function call (looks
> > like nsDisplayItem::GetBounds, I think) where the vtable pointer (in RAX) is
> > 0x20000007FEEF7E5D.  Except for the the 3rd-highest bit being set, this
> > looks like it could be a valid pointer...
> 
> Actually, given what the instruction pointers are, that one looks more like
> the vtable pointer itself was read one byte off, since the 0x20 at the top
> doesn't belong, and shifting everything left by 8 bits would make it look a
> lot more like a code address.  This makes sense since the this pointer (in
> RDI and RCX) is 0x000000000e5164a1... which is clearly not a valid pointer,
> but could be if you consider the low bit flipped.

And after further poking around at the memory in visual studio (dereferencing things in that memory to see what functions they are... which leads to some confusion due to identical function coalescing), since stuff looked like a vtable, I determined that Visual Studio says that nsDisplayCompositorHitTestInfo::`vftable' is 0x000007FEEF7E5DD8, which means that the this pointer in RDX/RCX, with the 1-bit unflipped, looks like it points to a valid nsDisplayCompositorHitTestInfo.  (I'm reasonably confident that's the only start-of-vtable in that 256-byte range, since the start of the range is in the middle of nsDisplayBlendContainer::`vftable' which starts at 0x000007FEEF7E5C08, and nsDisplayItem has 56 virtual functions (counting the destructor once) in 61.0.2, which seems to match the 58-word gap between those vtables pretty well.)

So I'm pretty confident in the bit-flip analysis for bp-73b4b368-3e6e-4581-be91-90a920180831, though still confused about the other one.

Matt Woodrow (:mattwoodrow)

Comment 26

•

6 years ago

Thanks David, that's really useful analysis! There's quite a few open Web Painting bugs that all appear to be some form of corruption, and I suspect that they might all be related (or the same underlying bug).

I have a vague suspicion that a single bit being flipped would be a symptom of us setting a frame state bit on an invalid nsIFrame*, which then corrupts whoever now owns that memory. Display list building not only sets state bits, but also allocates big chunks of memory (using the same arena chunk size as the pres shell), so would be a likely candidate to have taken ownership.

https://bugzilla.mozilla.org/show_bug.cgi?id=1436505#c19 has some discussion on what appears to be a different outcome from maybe the same problem, where we just crash accessing an invalid frame in the frame tree.

Is it possible to look at the memory layout of nsIFrame and figure out which frame state bits (or inline bool:1s) could possibly line up with the bit that changed here?

Randell Jesup [:jesup] (needinfo me)

Comment 27

•

6 years ago

Remember that single-bit-flips is the dominant failure mode for RAM -- especially overclocked, but with multi-GB of non-ECC ram, and with 100's of MB (or a few GB) of ram in use by the browser, you will see "random" bit-flips.  This happens enough that people bitsquat DNS addresses (http://dinaburg.org/bitsquatting.html)

That doesn't mean we can't set a 1-bit value in a previously-freed address...but my default assumption is HW flip unless there's a reason to suspect a pattern in the flips or an inordinate number hitting a specific allocation or set

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 28

•

6 years ago

That bit would be NS_FRAME_IN_REFLOW... which doesn't seem all that likely?

Liz Henry (:lizzard) (relman/hg->git project)

Comment 29

•

6 years ago

Marking wontfix for 62/63 since we are about to release 63.

status-firefox61: fix-optional → wontfix

status-firefox62: affected → wontfix

status-firefox63: --- → wontfix

status-firefox64: --- → affected

Daniel Veditz [:dveditz]

Updated

•

6 years ago

Group: layout-core-security → gfx-core-security

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

status-firefox64: affected → wontfix

status-firefox-esr60: affected → wontfix

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

3 years ago

Crash Signature: [@ nsDisplayList::GetClippedBoundsWithRespectToASR] → [@ mozilla::nsDisplayList::GetClippedBoundsWithRespectToASR] [@ nsDisplayList::GetClippedBoundsWithRespectToASR]

BMO Automation

Updated

•

2 years ago

Severity: critical → S2

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

2 years ago

Crash Signature: [@ mozilla::nsDisplayList::GetClippedBoundsWithRespectToASR] [@ nsDisplayList::GetClippedBoundsWithRespectToASR] → [@ mozilla::nsDisplayList::GetClippedBoundsWithRespectToASR] [@ nsDisplayList::GetClippedBoundsWithRespectToASR] [@ nsRect::SaturatingUnionEdges] [@ nsRect::SaturatingUnion]

Jeff Muizelaar [:jrmuizel]

Updated

•

2 years ago

Severity: S2 → S3

BugBot [:suhaib / :marco/ :calixte]

Comment 30

•

2 years ago

The severity field for this bug is set to S3. However, the bug is flagged with the sec-high keyword.
:mstange, could you consider increasing the severity of this security bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mstange.moz)

Markus Stange [:mstange]

Assignee

Comment 31

•

2 years ago

Oh wow, this is still happening, it seems. Timothy, can you take over?

Status: ASSIGNED → NEW

Flags: needinfo?(tnikkel)

Flags: needinfo?(mstange.moz)

BugBot [:suhaib / :marco/ :calixte]

Comment 32

•

2 years ago

The bug is linked to topcrash signatures, which match the following criterion:

Top 10 content process crashes on beta

:mstange, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mstange.moz)

Keywords: topcrash

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 33

•

2 years ago

See bug 1796126 for the recent crashes.

Comment 34

•

2 years ago

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

Markus Stange [:mstange]

Assignee

Updated

•

2 years ago

Flags: needinfo?(mstange.moz)

Andrew McCreight [:mccr8]

Comment 35

•

7 months ago

I don't see any recent crashes that are obvious crashes on the jemalloc poison value. A fair number of sig ill kind of crashes, but I don't think leaving a bug open that is this old is useful.

Status: NEW → RESOLVED

Closed: 7 months ago

Flags: needinfo?(tnikkel)

Resolution: --- → INCOMPLETE

BugBot (nomail) [:suhaib / :marco/ :calixte]

Comment 36

•

7 months ago

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.

Keywords: stalled

Daniel Veditz [:dveditz]

Updated

•

7 months ago

Group: gfx-core-security