Closed Bug 1328129 Opened 7 years ago Closed 4 months ago

[e10s] Crash in nsLayoutUtils::GetDisplayRootFrame

Categories

(Core :: Layout, defect, P2)

51 Branch
All
Windows
defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox50 --- unaffected
firefox51 --- wontfix
firefox52 --- wontfix
firefox-esr52 --- wontfix
firefox-esr60 --- wontfix
firefox53 --- wontfix
firefox54 --- wontfix
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- wontfix
firefox61 --- wontfix
firefox62 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix

People

(Reporter: philipp, Unassigned)

References

Details

(5 keywords, Whiteboard: [sec-triage-backlog])

Crash Data

Attachments

(2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-6705fc6e-1fad-43b4-8acd-919662161227.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	nsLayoutUtils::GetReferenceFrame(nsIFrame*) 	layout/base/nsLayoutUtils.cpp:7149
1 	xul.dll 	mozilla::ScrollFrameHelper::GetScrolledRect() 	layout/generic/nsGfxScrollFrame.cpp:5764
2 	xul.dll 	mozilla::ScrollFrameHelper::GetScrollRange(int, int) 	layout/generic/nsGfxScrollFrame.cpp:3841
3 	xul.dll 	mozilla::ScrollFrameHelper::WantAsyncScroll() 	layout/generic/nsGfxScrollFrame.cpp:1312
4 	xul.dll 	nsLayoutUtils::GetNearestScrollableFrame(nsIFrame*, unsigned int) 	layout/base/nsLayoutUtils.cpp:2089
5 	xul.dll 	nsLayoutUtils::ExpireDisplayPortOnAsyncScrollableAncestor(nsIFrame*) 	layout/base/nsLayoutUtils.cpp:3355
6 	xul.dll 	RemoveDisplayPortCallback(nsITimer*, void*) 	layout/generic/nsGfxScrollFrame.cpp:2537
7 	xul.dll 	nsTimerImpl::Fire() 	xpcom/threads/nsTimerImpl.cpp:521
8 	xul.dll 	nsTimerEvent::Run() 	xpcom/threads/TimerThread.cpp:286
9 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1067
10 	xul.dll 	mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:96
11 	xul.dll 	mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:301
12 	xul.dll 	_SEH_epilog4

crashes in the content process on all versions of windows are occurring with greater frequency on the beta channel since 51.0b6 - overall it's still rather low volume though. i'm not sure if there is really something in our code that caused the regression then (as there were also a handful of crashes on 51.0a1 & 51.0a2 before that), but here would be the pushlog between beta 4 and beta 6: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_51_0b4_RELEASE&tochange=FIREFOX_51_0b6_RELEASE

(marking the bug as security sensitive as a precaution, since many reports have a crashing address indicating a uaf situation)
Dan, I see that Jet is on PTO until next week. Any chance you could get the ball rolling in investigating this (or pass it off to someone else)?
Flags: needinfo?(dholbert)
This is in Async Panning & Zooming code (stacklevels 1-6), with a backtrace going up to a timer that was added back in bug 990916 (mDisplayPortExpiryTimer, which is what calls RemoveDisplayPortCallback).  kats, looks like you added this timer -- could you take a look?

Spitballing, my initial guess is that we're somehow firing this timer after frames have been destroyed -- we're missing a case where we need to clean it up (or we're cleaning it up too late). It looks like right now, we cancel the timer in ~ScrollFrameHelper (which will get invoked when the nsHTMLScrollFrame / nsXULScrollFrame destructors are invoked, since the ScrollFrameHelper is a member-variable).  Maybe that's not soon enough, though -- maybe all of the ~ScrollFrameHelper cleanup should really move to ScrollFrameHelper::Destroy?  Not sure.

(I'm also skeptical that the pushlog linked in comment 0 is related to this crash, given that we had a few instances of this crash before then -- nothing there looks particularly tied to this backtrace, IMO.)
Component: Layout → Panning and Zooming
Flags: needinfo?(dholbert) → needinfo?(bugmail)
I agree with what Daniel said in comment 2. Timothy, do you know offhand if there are scenarios where a ScrollFrameHelper's mOuter pointer might be pointing to garbage before the ScrollFrameHelper's destructor is run?

If not, then I'll move the cleanup to ScrollFrameHelper::Destroy and see if that fixes the problem.
Assignee: nobody → bugmail
Flags: needinfo?(bugmail) → needinfo?(tnikkel)
Group: core-security → layout-core-security
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #3)
> I agree with what Daniel said in comment 2. Timothy, do you know offhand if
> there are scenarios where a ScrollFrameHelper's mOuter pointer might be
> pointing to garbage before the ScrollFrameHelper's destructor is run?

Nope.

> If not, then I'll move the cleanup to ScrollFrameHelper::Destroy and see if
> that fixes the problem.

We can try this (and it seems like a good idea anyway) but I'm not convinced the problem is the mOuter pointer. In RemoveDisplayPortCallback we have

  nsLayoutUtils::RemoveDisplayPort(helper->mOuter->GetContent());
  nsLayoutUtils::ExpireDisplayPortOnAsyncScrollableAncestor(helper->mOuter);

We crash inside the ExpireDisplayPortOnAsyncScrollableAncestor call. If mOuter was a bad pointer we should crash on the previous line (I searched crashstats for crashes in RemoveDisplayPort, GetContent, DeleteProperty, all don't exist are too low volume to be this crash). Further, the first thing we do in ExpireDisplayPortOnAsyncScrollableAncestor is call nsLayoutUtils::GetCrossDocParentFrame on the passed in frame. So when we crash later on we are looking at a proper ancestor of the scroll frame that is getting its displayport removed.

If the frame tree inside one document was messed up to cause this I would expect we'd have a lot more crashes. So a guess would be that the frames for a parent document have gone away but that didn't clear the parent view pointer on the root view in the child document. So when we try to get to the parent document we walk a bad pointer.
Flags: needinfo?(tnikkel)
Good points. I looked for other crashes with "nsLayoutUtils" in the signature and 0xffffffffe5e5e5ed as the crash address (this showed up as the address for a large portion of the crashes associated with this bug). I found two other matching signatures, nsLayoutUtils::GetDisplayRootFrame (e.g. [1]) and nsLayoutUtils::GetCrossDocParentFrame (e.g. [2]). The former seems to *also* be triggered from the displayport expiry timer. The latter seems to have a few different crash stacks including stuff like imagelib (e.g. [3]).

So from all this I'm assuming the root cause is somewhere else but the APZ code seems to hit it most frequently because it often exercises codepaths that find the scrollable ancestor or walk up the frame tree across documents.

[1] https://crash-stats.mozilla.com/report/index/9fa9d46c-b149-49c7-ad75-29b5f2170105
[2] https://crash-stats.mozilla.com/report/index/e3088b73-db7c-4eef-895f-43f372161222
[3] https://crash-stats.mozilla.com/report/index/7320adac-0e36-40e0-bfcc-b2da82161231
Crash Signature: [@ nsLayoutUtils::GetReferenceFrame] → [@ nsLayoutUtils::GetReferenceFrame] [@ nsLayoutUtils::GetDisplayRootFrame]
Seems unlikely that this will fix the bug but we can land it. Although maybe I should land it in a separate, non-security bug that this one depends on?
Attachment #8824434 - Flags: review?(tnikkel)
Comment on attachment 8824434 [details] [diff] [review]
Move cancellation of mDisplayPortExpiryTimer to ScrollFrameHelper::Destroy()

If we think that we should move the mDisplayPortExpiryTimer cancel from the destructor to destroy, then shouldn't we also move everything else in the destructor? If one is a bad idea, then the others are too.

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #6)
> Seems unlikely that this will fix the bug but we can land it. Although maybe
> I should land it in a separate, non-security bug that this one depends on?

Yeah, seems like a good idea.
Attachment #8824434 - Flags: review?(tnikkel) → review+
I moved the patch to bug 1329663. I didn't link it to this bug in case malicious actors are looking for patches related to security bugs, and also because I'm unconvinced that will even help with this bug.
Kats - can we close this one now?  The bug the patch landed on is marked fixed on 52 and 53.  Does this affect 51, as noted here?  If so, we should mark it fix-optional for 51 (and try to uplift on a point release), make an argument to block the release on this, or simply wontfix for 51.  It is a sec-high.
Flags: needinfo?(bugmail)
I'm not convinced that the patch in bug 1329663 will actually fix this. We don't have enough crash-stats data to confirm or refute it right now. I think this bug does affect 51 (because there are crash reports on that version) but I don't think we should block the release on it because we don't really have any idea what's causing this or how to fix it. Any attempt to fix it "for reals" will probably involve a multi-step process where we land some sort of corruption canary to investigate what's going on, and then we can fix it properly. I'm going to mark this fix-optional for 51 for now.
Flags: needinfo?(bugmail)
Thus far there are zero reports on 52 and 53, and considering it's nearing the end of the cycle I'm inclined to think that it doesn't affect those branches. Marking them as unaffected, but we can update that if we start seeing crashes on those branches.
It could just be low frequency that it doesn't appear anywhere before beta, or the people on those channels don't do the things that trigger this crash. For similar crashes where it looks like the frame tree is messed up at cross document boundaries they also didn't show up on aurora/nightly (or were very low frequency).
Agreed that it might just be low frequency. However 51 had 4 crashes on nightly and 4 crashes on aurora - so I would expect 52 and 53 to have comparable numbers. Also 51's crash rate went up in b6 - if there was something that was uplifted to 51 in b6 that caused the spike, that means that uplifted patch should already have been in 52 or 53, so I would definitely expect to see crashes there. But yeah, we should wait until 52 has some beta time to see if the crashes show up there.
No crashes so far on 52 beta builds.
There are still no crashes for the GetReferenceFrame signature on 52 beta, but there are crashes with the GetDisplayRootFrame signature, so morphing this by bug to cover those. The stack still originates in the display port removal callback.

Example crash report on 52b3: https://crash-stats.mozilla.com/report/index/db3bb95f-b96e-4801-b94b-6e6772170204
Summary: [e10s] Crash in nsLayoutUtils::GetReferenceFrame → [e10s] Crash in nsLayoutUtils::GetDisplayRootFrame
I think the AsyncPanZoomEnabled check at [1] is not really needed, we can probably remove it. That might help, since that call is part of the crash stack. Also we can add a guard against frame being null there - there is an assert already but maybe in release builds we can encounter that condition? I'll write a patch on Monday.

[1] http://searchfox.org/mozilla-central/rev/b1aadb3572eaf7d2c70e19a2ba5413809d9ac698/layout/base/nsLayoutUtils.cpp#3399
Attached patch Patch (obsolete) — Splinter Review
Attachment #8834170 - Flags: review?(tnikkel)
Comment on attachment 8834170 [details] [diff] [review]
Patch

Looks good.

But is this actually going to fix the bug? Or are we just going to crash later on in processing the RemoveDisplayPortCallback? If we're not sure we should land this in a separate bug to prevent this bug from getting confusing (not resolved but with patches landed).
Attachment #8834170 - Flags: review?(tnikkel) → review+
(In reply to Timothy Nikkel (:tnikkel) from comment #19)
> But is this actually going to fix the bug?

I don't know.

> Or are we just going to crash
> later on in processing the RemoveDisplayPortCallback?

I don't think that should happen. The rest of RemoveDisplayPortCallback (after the call to ExpireDisplayPortOnAsyncScrollableAncestor) doesn't do anything really, just uses helper->mOuter which we've already determined should be non-null.

> If we're not sure we
> should land this in a separate bug to prevent this bug from getting
> confusing (not resolved but with patches landed).

Yeah that's fair. I'll split it into another bug.
Comment on attachment 8834170 [details] [diff] [review]
Patch

I moved the patch to bug 1337388.
Attachment #8834170 - Attachment is obsolete: true
kats, it looks like we hoped your bug 1337388 might fix this issue.
Can you take a look how things worked out?
There hasn't yet been a beta build with the fix. Need to wait more.
52.0b6 has the fix and so far hasn't shown any of the GetDisplayRootFrame crashes. So that's looking good. The previous betas had significant numbers of the crash, so if b6 has zero I think we can safely say it is fixed.

52.0b1 	70 	34.83 %
52.0b2 	45 	22.39 %
52.0b3 	40 	19.90 %
52.0b4 	29 	14.43 %
52.0b5 	17 	8.46 %
52.0b6  (zero so far)
b6 and b7 do still have some crashes with the @GetDisplayRootFrame signature, but they no longer go through the APZ code, and they have a different crash address. For example:

https://crash-stats.mozilla.com/report/index/9534c54b-e4c0-49dd-a86d-569742170221
https://crash-stats.mozilla.com/report/index/ddebb794-10ef-477b-be6c-2dde62170215

It looks like the underlying root problem (the cross-doc parent pointer is bad) is still there, but since we've avoided the displayport-expiry code from hitting it, it's getting hit from elsewhere instead (albeit at a much lower frequency).

Consider the root problem is that the cross-doc parent pointer is bad, this seems like a layout problem.
Assignee: bugmail → nobody
Component: Panning and Zooming → Layout
Jet, can you find someone to continue investigating?
Flags: needinfo?(bugs)
(In reply to Frederik Braun [:freddyb] from comment #26)
> Jet, can you find someone to continue investigating?

This smells like raciness at document and/or window destruction, and we're traversing Layout frames that have gone away. 

Matt: do you have ideas for bulletproofing this?
Flags: needinfo?(bugs) → needinfo?(matt.woodrow)
This is probably easiest for Timothy to keep looking at, as he knows View lifetimes well.

I had a look at some of the crash dumps though:

In the initial crash from comment 0, it appears that nsIFrame::GetView (within an inlined GetCrossDocParentFrame call) is returning a deleted pointer.

nsIFrame should be poisoned, so if we're returning a valid pointer from GetView(), then I assume the root frame must still be alive.

We call GetParent() on the returned view, which returns 0xe5e5e5e5 (the jemalloc poison value), and then we crash trying to call GetParent() on that.

This is similar to what tnikkel suggested in comment 4, but it seems that we destroyed the root view and didn't remove it from the frame?

For the first crash in comment 25:

We're crashing on the f->PresContext()->FrameManager()->GetRootFrame() call, looks like the PresShell has been deleted (and poisoned) so the return of FrameManager() is 0xe5e5e500.

I think we skip PresShell arena poisoning when we tear down the entire PresShell? So I guess 'f' is dead too here (but not poisoned) along with the PresContext/PresShell/FrameManager, but it takes a few derefs before we hit memory that's been poisoned.

I guess this could be similar to above, except we manage to deref our way through the dead nsViews and return a valid-looking nsIFrame from GetCrossDocParentFrame.

For the second crash in comment 25:

Crashing on f->HasAnyStateBits(NS_FRAME_IS_POPUP) because f is garbage (0xb4b4b2b3).

Again, looks like we returned something invalid from GetCrossDocParentFrame.


Any ideas Tim?
Flags: needinfo?(matt.woodrow) → needinfo?(tnikkel)
Also, should nsView be in the PresShell arena and be poisoned when we release it?
(In reply to Matt Woodrow (:mattwoodrow) from comment #29)
> Also, should nsView be in the PresShell arena and be poisoned when we
> release it?

We do poison views when we destroy them

https://dxr.mozilla.org/mozilla-central/rev/e03e0c60462c775c7558a1dc9d5cf2076c3cd1f9/view/nsView.cpp#184

but it can only be so effective because they aren't allocated in an arena so the allocator can do whatever with that memory after we release it.

We could allocate views in an arena but it may not help because all the views that are a problem here seem to be root views, and hence they would likely only be going away with the presshell, so if we skip poisoning like you say when tearing down the whole presshell that wouldn't do any good.


I'm working on a few patches that make it impossible to hit certain kinds of problems with nsViews. I don't know if we even hit these problems in practice and if we do if it would help with these crashes though. My hope is that eventually one of the speculative patches will fix these crashes.
Flags: needinfo?(tnikkel)
Too late for firefox 52, mass-wontfix.
Crashes in 53b8
Given status, fix-optional fro 53 (and likely won't be).  10 53bN crashes in the last week
(In reply to Timothy Nikkel (:tnikkel) from comment #30)
> ...
> I'm working on a few patches that make it impossible to hit certain kinds of
> problems with nsViews. I don't know if we even hit these problems in
> practice and if we do if it would help with these crashes though. My hope is
> that eventually one of the speculative patches will fix these crashes.

Are these patches happening here or in another bug?
I'm asking because this bug has been inactive for almost 2 weeks.
Flags: needinfo?(tnikkel)
(In reply to Frederik Braun [:freddyb] from comment #33)
> (In reply to Timothy Nikkel (:tnikkel) from comment #30)
> > ...
> > I'm working on a few patches that make it impossible to hit certain kinds of
> > problems with nsViews. I don't know if we even hit these problems in
> > practice and if we do if it would help with these crashes though. My hope is
> > that eventually one of the speculative patches will fix these crashes.
> 
> Are these patches happening here or in another bug?
> I'm asking because this bug has been inactive for almost 2 weeks.

They don't have a bug yet, they are very early. I've been swamped, and am going to be continue to be swamped. I have had zero time to work on them.
Flags: needinfo?(tnikkel)
Matt, kats, can one of you take over this sec-high bug from Tim?
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(bugmail)
At this point I don't have any ideas on how to fix this or even what to do here.
Flags: needinfo?(bugmail)
Neither do I sorry.
Flags: needinfo?(matt.woodrow)
Whiteboard: [sec-triage-backlog]
:jesup - verified disabled for 58 was an off by one error?
Flags: needinfo?(rjesup)
(In reply to Milan Sreckovic [:milan] from comment #39)
> :jesup - verified disabled for 58 was an off by one error?

Dangers of the "page down" button on a dropdown...
Flags: needinfo?(rjesup)
Priority: -- → P2
Hi Jet:

I have assigned these security bugs to you to reassign them to appropriate developers in your team to investigate and fix them.

Thanks!

Wennie
Assignee: nobody → bugs
Jet - can you reassign this?  Thanks
Flags: needinfo?(bugs)
(In reply to Randell Jesup [:jesup] from comment #42)
> Jet - can you reassign this?  Thanks

I no longer see crashes for nsLayoutUtils::GetReferenceFrame, but still see low-volume crashiness for nsLayoutUtils::GetDisplayRootFrame

We need STR or a URL here.
Flags: needinfo?(bugs)
A few possible URLs, though I suspect it's more "what ads were there" or "timing", or the crash itself isn't given to the actual url that caused it.

5 from http://aray.derpacif.com/contable/main.php but all on 50.0, comment "NO PUEDO TRABAJAR"; code is  nsIFrame* parent = GetCrossDocParentFrame(f);

https://www.youtube.com/watch?v=eACohWVwTOc  (seen twice on the same system a few minutes apart, 59b3)

pretty much all over the map on URLs, even a couple of localhost:8080 or 7070's.  We even have about:newtab listed a few times (including 59b10, on  f = f->PresContext()->FrameManager()->GetRootFrame();), which tells me this is one of the many reports where URL is not very useful, if at all.
:mattwoodrow – I still see crashes for 61 in nsLayoutUtils::GetDisplayRootFrame. But given the history of investigation here, is it fair to mark this as stalled for now?
Flags: needinfo?(matt.woodrow)
Keywords: stalled
I agree that stalled is the right status for this bug.
Flags: needinfo?(matt.woodrow)
So I debugged bp-641c7e92-e5ae-41b1-b841-2e7be0180824 a little bit.  It looks like f (and aFrame, if I believe the debugger) are 0x0000000007BA2020 while aFrame->mComputedStyle.mRawPtr is 0x000000BD0BAD9C08, and dereferencing the latter is crashing.  That makes it seem (at least assuming I believe f == aFrame) like there's a bad frame pointer in the pres shell's mDirtyRoots list.
I also debugged bp-1be7f209-3d8e-4801-a823-6ba550180826.  The debugger again showed f == aFrame (and it again seemed plausible, in that the frame coming from the stack wasn't recently touched in ways that would be sure to crash, like calling virtual functions).  So I looked at the disassembly in more detail.  The argument aFrame is passed to the function in RCX, but then maintained during the function in RSI.  Then f is maintained in RBX.  The return value is returned in RAX.  I read much of the function (though not the full inlining of IsPopup() and GetCrossDocParentFrame()) and everything seems consistent with that.  So I'm reasonably confident that seeing RBX and RSI as the same value means we're crashing the first time through the loop, and means GetDisplayRootFrame was passed a bad input rather than reaching that bad frame by traversing across GetCrossDocParentFrame.
Anyway, the crashes I looked at feel like "this is the first time we're touching this data structure in a bit, and it's corrupt" type crashes.  Such crashes could be real memory corruption bugs, or could be bad RAM.  It's hard to know, though when they're low frequency like this, bad RAM is plausible.

I suspect (given the earlier analysis) also that the crashes we're seeing at these signatures aren't the same as what we were seeing when these bugs were filed -- that's why it's really important to categorize crash bugs by more than just the signature.
Assignee: bugs → svoisen
Assignee: sean → nobody
Severity: critical → S2

Low crash volume, decreasing severity, -> S3.

Severity: S2 → S3

The severity field for this bug is set to S3. However, the bug is flagged with the sec-high keyword.
:dholbert, could you consider increasing the severity of this security bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(dholbert)

(In reply to Release mgmt bot [:suhaib / :marco/ :calixte] from comment #52)

:dholbert, could you consider increasing the severity of this security bug?

Comment 49 has the conclusions from the most thorough investigation that we've got, where the working theory was that this could easily just be cases of bad RAM.

So: given that (and given that earlier investigations here stalled out as well), I don't think we should consider this high-severity, and we might want to drop the security rating, unless we get something to disprove the "bad RAM" theory (e.g. a substantial increase in crash volume, or reliable STR, or a testcase that can reliably reproduce this).

Also, side note: some of these are likely similar to what tnikkel described in bug 1708808 comment 6; i.e. it looks like we've got some corruption in the frame tree from something that went wrong earlier (or possibly just from bad RAM, as noted above) and we're doing a whole-tree walk which makes us trip over it. e.g. bp-32b3dbe2-66f8-4133-9cf4-562300220802 seems to be a case of that, where we're in a recursive call to MaybeCreateDisplayPortInFirstScrollFrameEncountered (which walks the frame's child list at the end, which potentially results in a whole-frametree walk).

Flags: needinfo?(dholbert)

While this is still happening at a low rate (56 in 6 months, mostly on old ESR versions) we're working to close unactionable stalled bugs and until we have a reproducer this isn't actionable and may not be valid

Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → INCOMPLETE

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.

Keywords: stalled
Group: layout-core-security
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: