Closed Bug 1359569 Opened 8 years ago Closed 7 months ago

Crash in nsDisplayList::GetClippedBoundsWithRespectToASR

Categories

(Core :: Web Painting, defect, P2)

54 Branch
All
Windows
defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox-esr45 --- unaffected
firefox-esr52 --- unaffected
firefox-esr60 --- wontfix
firefox53 --- unaffected
firefox54 --- wontfix
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- wontfix
firefox61 --- wontfix
firefox62 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix

People

(Reporter: philipp, Assigned: mstange)

References

Details

(5 keywords, Whiteboard: [sec-triage-backlog])

Crash Data

This bug was filed from the Socorro interface and is 
report bp-e6ae93c0-4862-4512-9e6c-449dc0170425.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	nsDisplayList::GetClippedBoundsWithRespectToASR(nsDisplayListBuilder*, mozilla::ActiveScrolledRoot const*) 	layout/painting/nsDisplayList.cpp:1860
1 	xul.dll 	mozilla::FrameLayerBuilder::BuildContainerLayerFor(nsDisplayListBuilder*, mozilla::layers::LayerManager*, nsIFrame*, nsDisplayItem*, nsDisplayList*, mozilla::ContainerLayerParameters const&, mozilla::gfx::Matrix4x4Typed<mozilla::gfx::UnknownUnits, mozilla::gfx::UnknownUnits> const*, unsigned int) 	layout/painting/FrameLayerBuilder.cpp:5618
2 	xul.dll 	nsDisplayList::PaintRoot(nsDisplayListBuilder*, nsRenderingContext*, unsigned int) 	layout/painting/nsDisplayList.cpp:2120
3 	xul.dll 	nsLayoutUtils::PaintFrame(nsRenderingContext*, nsIFrame*, nsRegion const&, unsigned int, nsDisplayListBuilderMode, nsLayoutUtils::PaintFrameFlags) 	layout/base/nsLayoutUtils.cpp:3683
4 	xul.dll 	mozilla::PresShell::Paint(nsView*, nsRegion const&, unsigned int) 	layout/base/PresShell.cpp:6490
5 	xul.dll 	nsViewManager::ProcessPendingUpdatesPaint(nsIWidget*) 	view/nsViewManager.cpp:483
6 	xul.dll 	nsViewManager::ProcessPendingUpdatesForView(nsView*, bool) 	view/nsViewManager.cpp:415
7 	xul.dll 	nsViewManager::ProcessPendingUpdates() 	view/nsViewManager.cpp:1104

this crash signature on windows is regressing in firefox 54 builds and upwards which looks related to the landing of bug 1298218.
Hi :mstange, 
Can you help take a look at this one?
Flags: needinfo?(mstange)
Component: Layout → Layout: Web Painting
Matt, could you take a look at this issue?
Flags: needinfo?(mstange) → needinfo?(matt.woodrow)
Had a quick look. The crash is on youtube.

It looks like mClip for the affected display item is 0x8, and we're crashing trying to deference that (within ApplyNonRoundedIntersection, within GetClippedBounds).

Not sure how we'd end up with a bad clip pointer like that though. Markus is still probably the best person to look into this.
Flags: needinfo?(matt.woodrow) → needinfo?(mstange)
I'm not sure either.

mClip is written two in a few places, but there are only three non-null values that get assigned to it:
 - &mClipChain->mClip
 - DisplayItemClipChain::ClipForASR(mClipChain, aActiveScrolledRoot)
 - DisplayItemClipChain::ClipForASR(aClipChain, mActiveScrolledRoot)

In the first case, in order for &mClipChain->mClip to be 0x8, mClipChain needs to be 0x8 (because mClip is DisplayItemClipChain's first field), but mClipChain is allocated in the previous line and aBuilder->AllocateDisplayItemClipChain cannot return 0x8.

In the second case, in order for DisplayItemClipChain::ClipForASR(mClipChain, aActiveScrolledRoot) to be 0x8, either mClipChain needs to be 0x8 or the mParent pointer on one of its ancestor clip chains needs to be 0x8.
In order for mClipChain to be 0x8, aBuilder->ClipState().GetCurrentCombinedClipChain(aBuilder) needs to have returned 0x8.

That's as far as I've gotten now, more to follow later.
Assignee: nobody → mstange
Status: NEW → ASSIGNED
Flags: needinfo?(mstange)
UAF signature in many cases.  Perhaps fallout from Bug 1298218?

Markus?
Group: core-security
Flags: needinfo?(mstange)
Group: core-security → layout-core-security
Keywords: testcase-wanted
(In reply to Matt Woodrow (:mattwoodrow) from comment #3)
> Had a quick look. The crash is on youtube.

Got a specific URL?
Flags: needinfo?(matt.woodrow)
I don't think I'm supposed to post those unfortunately, even in sec bugs.

It's just a normal YouTube video url, can send it to you privately if you wish.

I had a look at the other reports and it's all over the place, not youtube specific.
Flags: needinfo?(matt.woodrow)
Markus, do you have any update on this bug?
(In reply to Markus Stange [:mstange] from comment #4)
> I'm not sure either.
> 
> mClip is written two in a few places, but there are only three non-null
> values that get assigned to it:
>  - &mClipChain->mClip
>  - DisplayItemClipChain::ClipForASR(mClipChain, aActiveScrolledRoot)
>  - DisplayItemClipChain::ClipForASR(aClipChain, mActiveScrolledRoot)
Or memory overrun in nsDisplayListBuilder::mPool? otherwise, it's seem not possible that nsDisplayItem::mClip can be 0x8. But I can not prove it since I can't repro this crash locally. It might help to add a dog tag in mozilla::ArenaAllocator?
(In reply to Astley Chen [:astley] (UTC+8) from comment #8)
> Markus, do you have any update on this bug?

Unfortunately not. I'm going to try to back out bug 1298218 from Beta.

(In reply to C.J. Ku[:cjku](UTC+8) from comment #9)
> Or memory overrun in nsDisplayListBuilder::mPool?

Not sure what you mean. If we're out of memory, Allocate aborts the process.
Flags: needinfo?(mstange)
(In reply to Markus Stange [:mstange] from comment #10)
> Unfortunately not. I'm going to try to back out bug 1298218 from Beta.

That seems.. scary. Specially considering we're on RC builds now. I don't know if the RelMan will accept such a large change so late.
You're right. At this point I don't really know what options we have.
I've found one DisplayItemClipChain pointer that might conceivably be a dangling pointer, i.e. it might be used across paints when the display list bulider's arena has already gone away: http://searchfox.org/mozilla-central/rev/d441cb24482c2e5448accaf07379445059937080/layout/generic/ViewportFrame.cpp#120

I'll investigate more tomorrow.
Let's get a fix; then we can decide if it will be a point release, a ride-along, or wait until 55 (and when in 55 (beta) it would land)
(In reply to Markus Stange [:mstange] from comment #13)
> I've found one DisplayItemClipChain pointer that might conceivably be a
> dangling pointer, i.e. it might be used across paints when the display list
> bulider's arena has already gone away:
> http://searchfox.org/mozilla-central/rev/
> d441cb24482c2e5448accaf07379445059937080/layout/generic/ViewportFrame.cpp#120
> 
> I'll investigate more tomorrow.

Unfortunately this idea went nowhere.
The frame property is added by nsDisplayListBuilder::MarkFramesForDisplayList and removed by nsDisplayListBuilder::ResetMarkedFramesForDisplayList in a way that's quite easy to reason about and I don't see how it might go wrong. The only way I could image it going wrong is if mFirstFrameMarkedForDisplay (which is a uint32_t) overflows, but in order for that to happen, we'd need to iterate over more than 2^32 frames during a single paint, and most of these crashes are happening in 32bit builds so that's impossible.
There seem to be two different crashes in this bug. In some of these crashes, GetClippedBoundsWithRespectToASR is called from FrameLayerBuilder::BuildContainerLayerFor, and in others it's called from nsDisplayWrapList::UpdateBounds.

The former is what comment 3 and comment 4 analyzed. But the latter is almost as frequent.

E.g. among the last 21 crashes on Beta, the distribution is as follows:
  13 BuildContainerLayerFor
  8 UpdateBounds

In the UpdateBounds crashes, it looks like there's an nsDisplayItem in the nsDisplayList whose vtable pointer is garbage. This seems like an existing problem and is unlikely to be a regression from bug 1298218. (But this is still a really bad problem to have!)
Markus, what's the next step here? Do you need help to investigate this?
Flags: needinfo?(mstange)
I've run out of ideas on what to do here. We could add more instrumentation, but doing that without performance overhead is tricky, and at the rate this is crashing on Nightly, it'd probably take at least a week between iterations.
Flags: needinfo?(mstange)
Priority: -- → P2
Whiteboard: [sec-triage-backlog]
Still crashing at about 50/day (there was a huge spike, but I presume that's unrelated to the baseline crash rate)

We should consider adding some sort of instrumentation.  Matt?  or other ideas how to find this?
Flags: needinfo?(matt.woodrow)
I'm struggling for ideas on this one. It appears to be corruption, but figuring out what is corrupting it is hard.

Looking at the stats, it seems that Windows 7 is really high (71.4% over the last 6 months).

I tried narrowing the search to include the telemetry environment not containing "ContentBackend":"Skia", and 2354 results got filtered down to 8. Of those 8, some were reporting Cairo, and others Direct2D 1.1 (but D2DEnabled:false, seems like a reporting error).

I also searched for D2DEnabled:true, and got literally 0 results (I checked the same search on all reports and got plenty of results).

That strikes me as likely meaningful, and I suspect the content backend distribution is causing the skewed platform distribution (we never get d2d on win7, and usually get it on Win10).

I have no idea how the content backend could be causing corruption during display list building though.

Crashes for bug 1406727 seem to have very similar correlations too, so it's possibly the same underlying bug.

Jeff, Markus, anybody got ideas here?
Flags: needinfo?(mstange)
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(jmuizelaar)
There would be the timing difference between the backends, but that's a shot in the dark.

Continuing on Markus' idea from comment 18, maybe it is time to regress performance on nightly for a week to see if extra instrumentation helps us.
I have none.
Flags: needinfo?(jmuizelaar)
Keywords: stalled
So I just took a look at bp-73b4b368-3e6e-4581-be91-90a920180831, and it looks like it could be a bit flip, or could be some other form of memory corruption.  It's crashing trying to make a virtual function call (looks like nsDisplayItem::GetBounds, I think) where the vtable pointer (in RAX) is 0x20000007FEEF7E5D.  Except for the the 3rd-highest bit being set, this looks like it could be a valid pointer...
(In reply to David Baron :dbaron: 🏴󠁵󠁳󠁣󠁡󠁿 ⌚UTC-7 from comment #23)
> So I just took a look at bp-73b4b368-3e6e-4581-be91-90a920180831, and it
> looks like it could be a bit flip, or could be some other form of memory
> corruption.  It's crashing trying to make a virtual function call (looks
> like nsDisplayItem::GetBounds, I think) where the vtable pointer (in RAX) is
> 0x20000007FEEF7E5D.  Except for the the 3rd-highest bit being set, this
> looks like it could be a valid pointer...

Actually, given what the instruction pointers are, that one looks more like the vtable pointer itself was read one byte off, since the 0x20 at the top doesn't belong, and shifting everything left by 8 bits would make it look a lot more like a code address.  This makes sense since the this pointer (in RDI and RCX) is 0x000000000e5164a1... which is clearly not a valid pointer, but could be if you consider the low bit flipped.


I also looked at bp-f6c6f504-02d3-47d0-b88e-32fce0180831, which is crashing at the same virtual function call (again, presumably nsDisplayItem::GetBounds inside the inlined nsDisplayItem::GetClippedBounds), and this time RAX looks a lot more like a code address (0x000007fe70db2d00) but it's not a correct address within the binary.
(In reply to David Baron :dbaron: 🏴󠁵󠁳󠁣󠁡󠁿 ⌚UTC-7 from comment #24)
> (In reply to David Baron :dbaron: 🏴󠁵󠁳󠁣󠁡󠁿 ⌚UTC-7 from comment #23)
> > So I just took a look at bp-73b4b368-3e6e-4581-be91-90a920180831, and it
> > looks like it could be a bit flip, or could be some other form of memory
> > corruption.  It's crashing trying to make a virtual function call (looks
> > like nsDisplayItem::GetBounds, I think) where the vtable pointer (in RAX) is
> > 0x20000007FEEF7E5D.  Except for the the 3rd-highest bit being set, this
> > looks like it could be a valid pointer...
> 
> Actually, given what the instruction pointers are, that one looks more like
> the vtable pointer itself was read one byte off, since the 0x20 at the top
> doesn't belong, and shifting everything left by 8 bits would make it look a
> lot more like a code address.  This makes sense since the this pointer (in
> RDI and RCX) is 0x000000000e5164a1... which is clearly not a valid pointer,
> but could be if you consider the low bit flipped.

And after further poking around at the memory in visual studio (dereferencing things in that memory to see what functions they are... which leads to some confusion due to identical function coalescing), since stuff looked like a vtable, I determined that Visual Studio says that nsDisplayCompositorHitTestInfo::`vftable' is 0x000007FEEF7E5DD8, which means that the this pointer in RDX/RCX, with the 1-bit unflipped, looks like it points to a valid nsDisplayCompositorHitTestInfo.  (I'm reasonably confident that's the only start-of-vtable in that 256-byte range, since the start of the range is in the middle of nsDisplayBlendContainer::`vftable' which starts at 0x000007FEEF7E5C08, and nsDisplayItem has 56 virtual functions (counting the destructor once) in 61.0.2, which seems to match the 58-word gap between those vtables pretty well.)

So I'm pretty confident in the bit-flip analysis for bp-73b4b368-3e6e-4581-be91-90a920180831, though still confused about the other one.
Thanks David, that's really useful analysis! There's quite a few open Web Painting bugs that all appear to be some form of corruption, and I suspect that they might all be related (or the same underlying bug).

I have a vague suspicion that a single bit being flipped would be a symptom of us setting a frame state bit on an invalid nsIFrame*, which then corrupts whoever now owns that memory. Display list building not only sets state bits, but also allocates big chunks of memory (using the same arena chunk size as the pres shell), so would be a likely candidate to have taken ownership.

https://bugzilla.mozilla.org/show_bug.cgi?id=1436505#c19 has some discussion on what appears to be a different outcome from maybe the same problem, where we just crash accessing an invalid frame in the frame tree.

Is it possible to look at the memory layout of nsIFrame and figure out which frame state bits (or inline bool:1s) could possibly line up with the bit that changed here?
Remember that single-bit-flips is the dominant failure mode for RAM -- especially overclocked, but with multi-GB of non-ECC ram, and with 100's of MB (or a few GB) of ram in use by the browser, you will see "random" bit-flips.  This happens enough that people bitsquat DNS addresses (http://dinaburg.org/bitsquatting.html)

That doesn't mean we can't set a 1-bit value in a previously-freed address...but my default assumption is HW flip unless there's a reason to suspect a pattern in the flips or an inordinate number hitting a specific allocation or set
That bit would be NS_FRAME_IN_REFLOW... which doesn't seem all that likely?
Marking wontfix for 62/63 since we are about to release 63.
Group: layout-core-security → gfx-core-security
Crash Signature: [@ nsDisplayList::GetClippedBoundsWithRespectToASR] → [@ mozilla::nsDisplayList::GetClippedBoundsWithRespectToASR] [@ nsDisplayList::GetClippedBoundsWithRespectToASR]
Severity: critical → S2
Crash Signature: [@ mozilla::nsDisplayList::GetClippedBoundsWithRespectToASR] [@ nsDisplayList::GetClippedBoundsWithRespectToASR] → [@ mozilla::nsDisplayList::GetClippedBoundsWithRespectToASR] [@ nsDisplayList::GetClippedBoundsWithRespectToASR] [@ nsRect::SaturatingUnionEdges] [@ nsRect::SaturatingUnion]
Severity: S2 → S3

The severity field for this bug is set to S3. However, the bug is flagged with the sec-high keyword.
:mstange, could you consider increasing the severity of this security bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mstange.moz)

Oh wow, this is still happening, it seems. Timothy, can you take over?

Status: ASSIGNED → NEW
Flags: needinfo?(tnikkel)
Flags: needinfo?(mstange.moz)

The bug is linked to topcrash signatures, which match the following criterion:

  • Top 10 content process crashes on beta

:mstange, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mstange.moz)
Keywords: topcrash

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash
Flags: needinfo?(mstange.moz)

I don't see any recent crashes that are obvious crashes on the jemalloc poison value. A fair number of sig ill kind of crashes, but I don't think leaving a bug open that is this old is useful.

Status: NEW → RESOLVED
Closed: 7 months ago
Flags: needinfo?(tnikkel)
Resolution: --- → INCOMPLETE

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.

Keywords: stalled
Group: gfx-core-security
You need to log in before you can comment on or make changes to this bug.