Closed Bug 1411689 Opened 6 years ago Closed 2 months ago

Crash near null in [@ nsLayoutUtils::MaybeCreateDisplayPortInFirstScrollFrameEncountered]

Categories

(Core :: Web Painting, defect, P2)

defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- wontfix
firefox61 - wontfix
firefox62 - wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix

People

(Reporter: RyanVM, Unassigned)

References

Details

(6 keywords)

Crash Data

Attachments

(2 files, 2 obsolete files)

Attached file index.html (obsolete) —
As originally discussed in bug 1234701 comment 5 and onward. Crashes opt builds and hits the below asserts in debug builds:

ASSERTION: frame tree not empty, but caller reported complete status: 'aSubtreeRoot->GetPrevInFlow()', file z:/build/build/src/layout/base/nsLayoutUtils.cpp, line 7881
ASSERTION: frame tree not empty, but caller reported complete status: 'aSubtreeRoot->GetPrevInFlow()', file z:/build/build/src/layout/base/nsLayoutUtils.cpp, line 7881
ASSERTION: Placeholder relationship should have been torn down already; this might mean we have a stray placeholder in the tree.: '!placeholder || nsLayoutUtils::IsProperAncestorFrame(aDestructRoot, placeholder)', file z:/build/build/src/layout/generic/nsFrame.cpp, line 757
ASSERTION: Null out-of-flow for placeholder?: 'outOfFlow', file z:\build\build\src\layout\generic\nsPlaceholderFrame.h, line 182
Has Regression Range: --- → yes
Attached file grizzly-testcase.html (obsolete) —
Another testcase found by grizzly.
See Also: → 1428892
See Also: → 1429172
Looks like I can't reproduce this now, mind confirming?
Flags: needinfo?(jschwartzentruber)
I can't repro either.
Flags: needinfo?(jschwartzentruber)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
This is reproducing consistently on m-c:
BuildID=20180529095249
SourceStamp=f01bb6245db1ea2a87e5360104a4110571265137
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Attached file testcase.html
Attachment #8922024 - Attachment is obsolete: true
Attachment #8932249 - Attachment is obsolete: true
Priority: P3 → --
Timothy, looks like you touched the code in question in bug 1364295? Can you please take a look?
Flags: needinfo?(tnikkel)
I think what is happening here is that after reflow we have a broken frame tree. The code from bug 1364295 (MaybeCreateDisplayPortInFirstScrollFrameEncountered) just walks the frame tree so it trips on a deleted frame. I get a bunch of assertions during reflow about the list of floats and the float cache not matching. So the real problem is probably in reflow somewhere.
Flags: needinfo?(tnikkel) → needinfo?(mats)
Mats -- I'd like to assign this to you for further investigation since it's a sec-crit that is now reproducing consistently.  If the problem is not reflow and you're not the right owner to fix this, please needinfo me back.  And thanks!
Assignee: nobody → mats
Priority: -- → P1
Oh, it's dereferencing a deleted frame, which is poisoned, so probably not as severe as sec crit.
I can't reproduce the crash, nor any assertions in a debug build,
using any of the three testcases attached to this bug so far.
I tried both a local ASAN Opt build and a debug build, on Linux.
(Using a profile w. the DOMFuzzHelper addon in all cases.)

Please ping me again once there is a testcase that is reproducible.
Assignee: mats → nobody
Flags: needinfo?(mats)
(In reply to Timothy Nikkel (:tnikkel) from comment #11)
> Oh, it's dereferencing a deleted frame, which is poisoned, so probably not
> as severe as sec crit.

Right, that's a non-exploitable crash.  Lowering to P3 based on that
until it's reproducible again.
(In reply to Timothy Nikkel (:tnikkel) from comment #11)
> Oh, it's dereferencing a deleted frame, which is poisoned

After looking at the crash-stats reports I don't think that's true.
The Crash Address fields there indicates UAF, e.g. 0xffffffffe5e5e5e5 in:
bp-174fb95f-c201-4868-80aa-e8eaa0180529
Severity: normal → critical
Component: Layout → Layout: Web Painting
Priority: P3 → P1
(In reply to Mats Palmgren (:mats) from comment #12)
> I can't reproduce the crash, nor any assertions in a debug build,
> using any of the three testcases attached to this bug so far.
> I tried both a local ASAN Opt build and a debug build, on Linux.
> (Using a profile w. the DOMFuzzHelper addon in all cases.)
> 
> Please ping me again once there is a testcase that is reproducible.

I can consistently reproduce this with the current testcase on an ASan opt build with a clean profile and no addons.
I am running Ubuntu 16.04.
I verified with m-c:
BuildID=20180531100449
SourceStamp=763f30c3421233a45ef9e67a695c5c241a2c8a3a

Ping me on IRC if you are still having issues.
Flags: needinfo?(mats)
Try changing the full zoom (ctrl +/-) to reproduce. That worked consistently for me.
Attached file frame dump + stack
Thanks, I was able to get a crash it in rr now.

The crash I get is a known null-pointer crash where we destroy a column
that contains a float for which the placeholder is in an earlier column
that we keep. (bug 913162, bug 913162 etc)

This should always lead to a null-pointer crash at worst though,
so it doesn't explain the UAF crashes we're seeing in the wild.
Flags: needinfo?(mats)
I guess we can mark it as dependent on bug 913162 until we solve
the Layout issue there.  I'd be interested in debugging this
again though, if we have a testcase that *isn't* a null-pointer
crash.
(In reply to Mats Palmgren (:mats) from comment #17)
> The crash I get is a known null-pointer crash where we destroy a column
> that contains a float for which the placeholder is in an earlier column
> that we keep. (bug 913162, bug 913162 etc)

Is bug 1452277 a dupe of one of these?
Looking at the "analysis.txt" attachment on that bug, I'd say yes,
it's the same underlying problem in all of those: we're destroying
the last column that contains an OOF which has a placeholder in
an earlier column that we keep.  The bug is that we should either:
a) not have returned COMPLETE status for that column, or
b) the OOF should have been "pulled up" when reflowing
   the preceding column

Such a crash should be safe though, since when we delete the OOF
we null out the pointer to it on the placeholder.  So, as I said
in comment 17, this doesn't explain the reported UAF crashes for
this signature (there are lot's of "Crash Address" values that
aren't near zero).  I'd prefer to make this bug focus on the UAF
issue, which I believe is a separate bug from bug 1452277 et al.

It'd be great if someone could fix the null crash in the meantime
though, just to rule out that as a possible cause.
(In reply to Mats Palmgren (:mats) (on vacation) from comment #20)
> Such a crash should be safe though, since when we delete the OOF
> we null out the pointer to it on the placeholder.  So, as I said
> in comment 17, this doesn't explain the reported UAF crashes for
> this signature (there are lot's of "Crash Address" values that
> aren't near zero).  I'd prefer to make this bug focus on the UAF
> issue, which I believe is a separate bug from bug 1452277 et al.

It seems like we are crashing in the do_QueryFrame call. If we called do_QueryFrame on a deleted and poisoned frame is it possible to get these UAF looking crashes? I don't claim to understand QueryFrame but it looks like we can use the mClass variable as an index into an array, so if the frame and mClass contained poison we would access an index in an array based on that poison.

> It'd be great if someone could fix the null crash in the meantime
> though, just to rule out that as a possible cause.

-> bug 1474479
Flags: needinfo?(mats)
(In reply to Timothy Nikkel (:tnikkel) from comment #21)
> If we called do_QueryFrame on a deleted and poisoned frame
> is it possible to get these UAF looking crashes?

If the PresShell is deleted and thus its arena deallocated,
then yes.

> I don't claim to understand QueryFrame but it looks
> like we can use the mClass variable as an index into an array, so if the
> frame and mClass contained poison we would access an index in an array based
> on that poison.

I don't think we use mClass as an array index in QueryFrame.
(That's only in the methods Is*Frame(), IsLeaf etc.)

There are two cases of do_QueryFrame...

In bp-174fb95f-c201-4868-80aa-e8eaa0180529 we crash in:
>   nsIScrollableFrame* sf = do_QueryFrame(aFrame);
https://hg.mozilla.org/releases/mozilla-release/annotate/03d4f76300bedeffd47c726ce7fee0221873da11/layout/base/nsLayoutUtils.cpp#l3509

Since nsIScrollableFrame isn't a concrete frame class, we'll
always take the slow path there, i.e. we'll do a virtual call to
nsIFrame::QueryFrame which crashes (and might be exploitable).
(so mClass isn't really used in this case)

A do_QueryFrame call to a nsIFrame subclass might read a mClass
value that equals kFrameIID of the target type, in which case we
simply cast to that type (unlikely, but possible). Otherwise, we
call QueryFrame here too.

In case of a target nsIFrame type that is final, then the compiler
will likely de-virtualize the QueryFrame call and we'll compare
the bogus mClass to the kFrameIIDs in that class hierarchy (in
switch statements) and we may do a bogus cast there if we're
unlucky, otherwise we'll return null.

Whether or not the non-virtual QueryFrame result is exploitable
depends on how that frame pointer is used next.  I'd guess it's
likely exploitable in most cases.
Flags: needinfo?(mats)
> In case of a target nsIFrame type that is final, ...

Actually, that's not true since we call mRawPtr->QueryFrame
and mRawPtr is of type Source, so this case won't happen
in practice since if Source is final you wouldn't need to
call do_QueryFrame in the first place.
Mats, do you think there's anything else we can do here for now?
Flags: needinfo?(mats)
I think it's hard to fix this bug without STR, but maybe
we can make it more reproducible with diagnostic asserts?

We could beef up PresShell assertions that are related to
lifetime issues into diagnostic asserts, e.g.
https://hg.mozilla.org/releases/mozilla-release/annotate/03d4f76300bedeffd47c726ce7fee0221873da11/layout/base/PresShell.cpp#l6304
The nsLayoutUtils::PaintFrame call is fairly deep into
PresShell::Paint so it seems unlikely that we enter it
with a destroyed shell.  But I don't think it would hurt
to make such assertions crash in nightly/beta.
Flags: needinfo?(mats)
Mats, or Matt, what do you think of trying the idea in comment 25? This could still happen for 65 but it is too late for 64.
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(mats)
Group: layout-core-security → gfx-core-security
Keywords: stalled
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(mats)
Crash Signature: [@ nsLayoutUtils::MaybeCreateDisplayPortInFirstScrollFrameEncountered ] → [@ mozilla::DisplayPortUtils::MaybeCreateDisplayPortInFirstScrollFrameEncountered] [@ nsLayoutUtils::MaybeCreateDisplayPortInFirstScrollFrameEncountered ]

P1 seems too high for the volume we are seeing.

Priority: P1 → P2
Severity: critical → S2
See Also: → 1488781

nsLayoutUtils::MaybeCreateDisplayPortInFirstScrollFrameEncountered is a dead signature. DisplayPortUtils::MaybeCreateDisplayPortInFirstScrollFrameEncountered is live, with 15-30 coming in for each release.

70% are EXCEPTION_ACCESS_VIOLATION_READ, and the majority of addresses are offsets of NULL. The prior diagnosis seems accurate, so until we can get a reproducer, this doesn't seem actionable.

Status: REOPENED → RESOLVED
Closed: 6 years ago2 months ago
Resolution: --- → INCOMPLETE

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.

Keywords: stalled
Group: gfx-core-security
You need to log in before you can comment on or make changes to this bug.