Closed Bug 1261351 Opened 4 years ago Closed 3 years ago

B2G debug segfault on MOZ_ASSERT(!mOldRootNode, "Should have GCed old root node"), because undisplayed map (for handling display:none) doesn't work right for children of shadow roots

Categories

(Core :: CSS Parsing and Computation, defect, critical)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox50 --- fixed

People

(Reporter: gerard-majax, Unassigned)

References

Details

(Keywords: assertion, crash, Whiteboard: fixed-in-pine)

Attachments

(4 files, 2 obsolete files)

Hitting this when loading gaia on mulet debug. More to come ...
Attached file gdb stacktrace
Bobby, does it looks like something you can easily find why out of the stack?
Flags: needinfo?(bobbyholley)
Removing the assert, I hit:
> Assertion failure: !geckoStyleSet || geckoStyleSet->GetRuleTree() == mSource.AsGeckoRuleNode()->RuleTree() || geckoStyleSet->IsInRuleTreeReconstruct() (destroying style context from old rule tree too late), at /home/alex/codaz/Mozilla/gecko-cinnabar/layout/style/nsStyleContext.cpp:198

At shutdown.
Severity: normal → critical
Component: String → CSS Parsing and Computation
Keywords: assertion, crash
Both assertions indicate to me that somebody is holding a reference to an nsStyleContext such that it stays alive past a style reconstruction. I think this could mean either that some nodes in the tree aren't getting properly restyled (and thus keeping their old style context), or that somebody is holding an external reference to it.

The reason there are two assertions is that there are two trees, the nsStyleContext tree and the nsRuleNode tree. The former holds strong references to the latter. So after a reconstruction, we first assert that all the nsRuleNodes have been destroyed (culminating in destroying the root, since children hold strong refs to parents). If this isn't true, this means that somebody is holding something in the nsRuleNode tree alive, most likely an nsStyleContext. The second assertion confirms this hypothesis: an nsStyleContext finally gets destroyed, and when it does we notice that it references an nsRuleNode in an out-of-date tree.

And the second assertion stacktrace _also_ tells us who was holding a style context alive past the reconstruction: a mozilla::UndisplayedNode, which does indeed seem to hold a strong reference to an nsStyleContext.

I don't know enough about undisplayed nodes and the invariants of reconstruction to tell exactly what the bug is here, but presumably reconstruction should traverse the hash table of undisplayed nodes.
Flags: needinfo?(bobbyholley) → needinfo?(dbaron)
Right, thanks for this explanation. Out of curiosity, is it possible there is some relationship with what got fixed recently in bug 1253788 ?
Flags: needinfo?(bobbyholley)
(In reply to Alexandre LISSY :gerard-majax from comment #6)
> Right, thanks for this explanation. Out of curiosity, is it possible there
> is some relationship with what got fixed recently in bug 1253788 ?

This is almost certainly related to bug 1258017. But I'm not convinced it's a regression, since the new asserts may just be uncovering latent bugs.
Flags: needinfo?(bobbyholley)
Not sure if it is of any help, but tapping into AddRef/Release method, I can see the MOZ_ASSERT() gets triggered while mRefCnt is 5:
> nsRuleNode::Release(): this=0x7f2b51cd2020 mRefCnt=5
> nsStyleSet::GCRuleTrees(): This is going to be a problem, mOldRootNode=0x7f2b51cd2020
Attached file nodes_crash.log (obsolete) —
including restyle logs
I am not sure if that huge log file brings anything useful, maybe you can find the undisplayed items inside ? Anyway it's huge once uncompressed (437M)
Flags: needinfo?(bobbyholley)
The next step here is to get input from dbaron (or another style system peer) on how this is supposed to work.
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (busy) from comment #5)
> I don't know enough about undisplayed nodes and the invariants of
> reconstruction to tell exactly what the bug is here, but presumably
> reconstruction should traverse the hash table of undisplayed nodes.

It does, via ElementRestyler::RestyleUndisplayedNodes.

Presumably fuzzers should be able to find a simple testcase for this (or perhaps fuzzing-related tools could reduce gaia to a simple testcase); I don't think it's worth debugging gaia directly.
Flags: needinfo?(dbaron)
(In reply to David Baron [:dbaron] ⌚️UTC-7 from comment #12)
> (In reply to Bobby Holley (busy) from comment #5)
> > I don't know enough about undisplayed nodes and the invariants of
> > reconstruction to tell exactly what the bug is here, but presumably
> > reconstruction should traverse the hash table of undisplayed nodes.
> 
> It does, via ElementRestyler::RestyleUndisplayedNodes.

I see.

> Presumably fuzzers should be able to find a simple testcase for this (or
> perhaps fuzzing-related tools could reduce gaia to a simple testcase); I
> don't think it's worth debugging gaia directly.

Well, I assume Alexandre wants to debug this.

Alexandre, the way to debug this is to figure out why we don't call ElementRestyler::RestyleUndisplayedNodes between and StyleSet::BeginReconstruct and StyleSet::EndReconstruct.
(In reply to Bobby Holley (busy) from comment #13)
> (In reply to David Baron [:dbaron] ⌚️UTC-7 from comment #12)
> > (In reply to Bobby Holley (busy) from comment #5)
> > > I don't know enough about undisplayed nodes and the invariants of
> > > reconstruction to tell exactly what the bug is here, but presumably
> > > reconstruction should traverse the hash table of undisplayed nodes.
> > 
> > It does, via ElementRestyler::RestyleUndisplayedNodes.
> 
> I see.
> 
> > Presumably fuzzers should be able to find a simple testcase for this (or
> > perhaps fuzzing-related tools could reduce gaia to a simple testcase); I
> > don't think it's worth debugging gaia directly.
> 
> Well, I assume Alexandre wants to debug this.

Yes we do, this is breaking any B2G transition debug build (mulet or device, as Alberto reported today).

> 
> Alexandre, the way to debug this is to figure out why we don't call
> ElementRestyler::RestyleUndisplayedNodes between and
> StyleSet::BeginReconstruct and StyleSet::EndReconstruct.

Perfect, I will investigate around this then!
I'm having a hard time finding a way to link calls to ElementRestyler::RestyleUndisplayedNodes with the MOZ_ASSERT()
Changing title since Alberto reported this also happens on device.
Summary: Mulet debug segfault on MOZ_ASSERT(!mOldRootNode, "Should have GCed old root node"); → B2G debug segfault on MOZ_ASSERT(!mOldRootNode, "Should have GCed old root node");
A tool like https://github.com/MozillaSecurity/lithium/ can help you make a reduced testcase, as long as you know which CSS/HTML files are involved.
We can help you more if we know how to reproduce this on our side. Is there an easy way to get mulet running?

I downloaded and ran the firefox-*.dmg (actually firefox within FirefoxNightly.app/Contents/MacOS/) in:

https://archive.mozilla.org/pub/b2g/nightly/2016/04/2016-04-05-15-02-59-mozilla-central/

and went to:

chrome://b2g/content/shell.html

but didn't seem to hit anything. Perhaps I'm doing it wrong?
(In reply to Jesse Ruderman from comment #17)
> A tool like https://github.com/MozillaSecurity/lithium/ can help you make a
> reduced testcase, as long as you know which CSS/HTML files are involved.

Good to know but as of now, I still don't know which files are involved :)
(In reply to Gary Kwong [:gkw] [:nth10sd] from comment #18)
> We can help you more if we know how to reproduce this on our side. Is there
> an easy way to get mulet running?
> 
> I downloaded and ran the firefox-*.dmg (actually firefox within
> FirefoxNightly.app/Contents/MacOS/) in:
> 
> https://archive.mozilla.org/pub/b2g/nightly/2016/04/2016-04-05-15-02-59-
> mozilla-central/
> 
> and went to:
> 
> chrome://b2g/content/shell.html
> 
> but didn't seem to hit anything. Perhaps I'm doing it wrong?

Yes, you are doing it wrong :). You need to buid a debug B2G or Mulet from pine tree https://hg.mozilla.org/projects/pine/ and a Gaia from kanikani branch.

We don't have (yet) debug builds out of pine on taskcluster :(
Crash is bypassed if I remove *both* <gaia-container> elements from homescreen/index.html:
 - https://github.com/mozilla-b2g/gaia/blob/eb053a9a544a3cd011a9cf8d6500fb96ec75a84f/apps/homescreen/index.html#L74
 - https://github.com/mozilla-b2g/gaia/blob/eb053a9a544a3cd011a9cf8d6500fb96ec75a84f/apps/homescreen/index.html#L105

So far, I could find a LOT of calls to RestyleUndisplayedNodes() between BeginReconstruct() and EndReconstruct()
(In reply to Alexandre LISSY :gerard-majax from comment #21)
> Crash is bypassed if I remove *both* <gaia-container> elements from
> homescreen/index.html:
>  -
> https://github.com/mozilla-b2g/gaia/blob/
> eb053a9a544a3cd011a9cf8d6500fb96ec75a84f/apps/homescreen/index.html#L74
>  -
> https://github.com/mozilla-b2g/gaia/blob/
> eb053a9a544a3cd011a9cf8d6500fb96ec75a84f/apps/homescreen/index.html#L105
> 
> So far, I could find a LOT of calls to RestyleUndisplayedNodes() between
> BeginReconstruct() and EndReconstruct()

Keeping index.html unchanged EXCEPT removing <link> elements pointing to window.css *and* grid.css, crash is not happening anymore.
Hacking into gaia-container's component, I can safely assert:

crash:
> --- a/apps/homescreen/bower_components/gaia-container/script.js
> +++ b/apps/homescreen/bower_components/gaia-container/script.js
> @@ -761,9 +761,7 @@ window.GaiaContainer = (function(exports) {
>  
>   var template = document.createElement('template');
>  
> -  template.innerHTML =
> -    `<style>gaia-container { position: relative; display: block; }</style>` +
> -    `<content select='*'></content>`;
> +  template.innerHTML = `<style></style><content select='*'></content>`;
>  
>    function GaiaContainerChild(element) {
>      this._element = element;

no crash:
> --- a/apps/homescreen/bower_components/gaia-container/script.js
> +++ b/apps/homescreen/bower_components/gaia-container/script.js
> @@ -761,9 +761,7 @@ window.GaiaContainer = (function(exports) {
>  
>    var template = document.createElement('template');
>  
> -  template.innerHTML =
> -    `<style>gaia-container { position: relative; display: block; }</style>` +
> -    `<content select='*'></content>`;
> +  template.innerHTML = `<content select='*'></content>`;
>  
>    function GaiaContainerChild(element) {
>      this._element = element;
Ok, so this is related to html <template> elements and their undisplayed children.

I'm helping Alexandre debug this in rr over IRC. Hopefully he should have it figured out soon.
So, from debugging this with Alexandre, it looks like the culprit is Shadow DOM content.

In particular, we end up adding a <style> element (whose parent is a shadow root) to the undisplayed content map like so:

https://pastebin.mozilla.org/8866640
https://pastebin.mozilla.org/8866641

And then we never end up traversing that with ElementRestyler, which is supposed to call RestyleUndisplayedNodes.

wchen, how is this supposed to work?
Flags: needinfo?(wchen)
(The one thing I haven't been able to figure out yet is exactly which caller adds the shadow content to the list of undisplayed items, which we use at [1]. Alexandre is going to figure that out with rr)

[1] https://hg.mozilla.org/mozilla-central/file/494289c72ba3/layout/base/nsCSSFrameConstructor.h#l800
(In reply to Bobby Holley (busy) from comment #26)
> [1]. Alexandre is going to figure that out with rr)

Ok, the backtrace is here: https://pastebin.mozilla.org/8866647

Which means that this is the code that adds the content to the list of undisplayed items:

https://hg.mozilla.org/mozilla-central/file/494289c72ba3/layout/base/nsCSSFrameConstructor.cpp#l5767

So apparently the parent frame of this shadow <template> content is a colgroup frame? The plot thickens, but I don't understand shadow DOM stuff well enough to go further.
I'm not sure why we do the call to SetAsUndisplayedContent, given at least one condition of the if() seems not true. Am I missing something obvious?

(rr) f 1
#1  0x00007fd3f6407835 in nsCSSFrameConstructor::AddFrameConstructionItemsInternal (this=this@entry=0x7fd3b7463d00, aState=..., aContent=aContent@entry=0x7fd3e6a13cc0, aParentFrame=aParentFrame@entry=0x7fd3cfefa788, aTag=0x7fd3e9676790, aNameSpaceID=3, 
    aSuppressWhiteSpaceOptimizations=true, aStyleContext=0x7fd3cfefac80, aFlags=3, aAnonChildren=0x0, aItems=...) at /home/alex/codaz/Mozilla/gecko-cinnabar/layout/base/nsCSSFrameConstructor.cpp:5767
5767	    SetAsUndisplayedContent(aState, aItems, aContent, styleContext, isGeneratedContent);
(rr) l
5762	  // Inside colgroups, suppress everything except columns.
5763	  if (aParentFrame &&
5764	      aParentFrame->GetType() == nsGkAtoms::tableColGroupFrame &&
5765	      (!(bits & FCDATA_IS_TABLE_PART) ||
5766	       display->mDisplay != NS_STYLE_DISPLAY_TABLE_COLUMN)) {
5767	    SetAsUndisplayedContent(aState, aItems, aContent, styleContext, isGeneratedContent);
5768	    return;
5769	  }
5770	
5771	  bool canHavePageBreak =
(rr) print aParentFrame
$45 = (nsContainerFrame *) 0x7fd3cfefa788
(rr) print aParentFrame->GetType()
$46 = (nsIAtom *) 0x7fd3e9672a30
(rr) print *(aParentFrame->GetType())
$47 = {<nsISupports> = {_vptr.nsISupports = 0x7fd3f9aaf090 <vtable for PermanentAtomImpl+16>}, mLength = 18, mHash = 3713396032, mString = 0x7fd3f9f58988 <flexContainerFrame_buffer+8> u"FlexContainerFrame"}
(rr) print nsGkAtoms::tableColGroupFrame 
$48 = (nsIAtom *) 0x7fd3e9673280
(rr) print *nsGkAtoms::tableColGroupFrame 
$49 = {<nsISupports> = {_vptr.nsISupports = 0x7fd3f9aaf090 <vtable for PermanentAtomImpl+16>}, mLength = 18, mHash = 416995712, mString = 0x7fd3f9f580a8 <tableColGroupFrame_buffer+8> u"TableColGroupFrame"}
(rr) 

Are nsGkAtoms::tableColGroupFrame and nsGkAtoms::flexContainerFrame equals?
I'm also hitting this when trying to print the following on Windows with latest m-c:
https://html.spec.whatwg.org/multipage/iana.html#iana
> Are nsGkAtoms::tableColGroupFrame and nsGkAtoms::flexContainerFrame equals?

No, they are not.
Right, so maybe the stuff about nsGkAtoms was bogus, I did a new rr session and I'm getting a little bit different backtrace. The nsStyleContext triggering assert at shutdown has been added to the map of undisplayed elements at https://dxr.mozilla.org/mozilla-central/rev/68c0b7d6f16ce5bb023e08050102b5f2fe4aacd8/layout/base/nsCSSFrameConstructor.cpp#5740
Attachment #8737794 - Attachment is obsolete: true
Attachment #8737811 - Attachment is obsolete: true
William, last time we spoke you were investigating this because it was not just a leaking issue but something more serious you said. Any news? Thanks!
Flags: needinfo?(wchen)
Attached patch PatchSplinter Review
I rediscovered this bug when noticing that the testcase in bug 1264380 would crash the browser when opened in e10s and then resized.  After minimization, searching for a programmatic substitute for resizing the browser, and some debugging with btseng and dbaron we located the root cause.
Attachment #8741187 - Flags: review?(dbaron)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #34)
> Created attachment 8741187 [details] [diff] [review]
> Patch
> 
> I rediscovered this bug when noticing that the testcase in bug 1264380 would
> crash the browser when opened in e10s and then resized.  After minimization,
> searching for a programmatic substitute for resizing the browser, and some
> debugging with btseng and dbaron we located the root cause.

Still hits the assertion when trying to print as in comment 30 with this patch applied.
Comment on attachment 8741187 [details] [diff] [review]
Patch

Fixes all the crashes of my debug mulet :=)
Attachment #8741187 - Flags: feedback+
Also works on device but does not fix bug 1261612
(In reply to Alexandre LISSY :gerard-majax from comment #36)
> Comment on attachment 8741187 [details] [diff] [review]
> Patch
> 
> Fixes all the crashes of my debug mulet :=)

Nice!

(In reply to Alexandre LISSY :gerard-majax from comment #37)
> Also works on device but does not fix bug 1261612

It's not clear to me why it would.


(In reply to Bob Owen (:bobowen) from comment #35)
> Still hits the assertion when trying to print as in comment 30 with this
> patch applied.

That's more worrisome. Kyle, is it related or something entirely different?

Would still be nice to get bug 1258017 uplifted to fix the style crashes, but it's getting pretty late in the cycle...
Flags: needinfo?(khuey)
(In reply to Bobby Holley (busy) from comment #38)
> (In reply to Alexandre LISSY :gerard-majax from comment #37)
> > Also works on device but does not fix bug 1261612
> 
> It's not clear to me why it would.
> 

Well I hoped that we could be lucky and that the white was because of this bug as the previous web component bug that got fixed was breaking all styles related to them :). But it's not the case and anyway kanru found why we have that other bug, so it's okay :)
(In reply to Bobby Holley (busy) from comment #38)
> (In reply to Bob Owen (:bobowen) from comment #35)
> > Still hits the assertion when trying to print as in comment 30 with this
> > patch applied.
> 
> That's more worrisome. Kyle, is it related or something entirely different?

I don't see why crashes when printing would be related to the shadow DOM ... that should probably go in another bug.
Flags: needinfo?(khuey)
One other thought:  maybe the assertion should be changed back to an NS_ASSERTION (like the assertion it replaced) rather than a MOZ_ASSERT so that it's safer to backport?  (Isn't that why these are looking like "regressions"?)
Flags: needinfo?(bobbyholley)
Er, let's discuss that in bug 1258017.
Flags: needinfo?(bobbyholley)
Comment on attachment 8741187 [details] [diff] [review]
Patch

>+  MOZ_ASSERT_IF(!parent, !aContent->GetParent()); // No non-Elements!

Please don't use MOZ_ASSERT_IF in layout code.  (Its parameters are named backwards.  If it's called MOZ_ASSERT_IF, it should assert param1 if param2, since the name has assert before if.  Standard smalltalk naming convention, which is really the only sane way to name functions where each word in the function's name corresponds to a parameter of the function.)

r=dbaron with that unwound into a single MOZ_ASSERT
Attachment #8741187 - Flags: review?(dbaron) → review+
landed in this shape on pine to unblock other work: https://hg.mozilla.org/projects/pine/rev/e417ce4b94e4
Whiteboard: fixed-in-pine
Ok, this is above my pay grade.

The attached patch doesn't work because we're not being consistent about always skipping the ShadowRoot whenever we use the UndisplayedMap.  I tried to fix that, but ran into another problem, which is that GetParentElementCrossingShadowRoot behaves weirdly with nested ShadowRoots.  Specifically, in

<my-custom-element>
  *ShadowRoot*
     <div id=1>
       *ShadowRoot*
          <div id=2>

Calling GetParentElementCrossingShadowRoot on div2 will return my-custom-element.  Which is not what we want here.  Similarly, GetFlattenedTreeParent appears to do the same thing, and also screws around with XBL.

Since GetParentElementCrossingShadowRoot doesn't actually do what it says (what is really does is GetParentElementUnlessParentIsShadowRootThenGetHostElement) I don't know what to do here.  I also don't know whether or not the other uses of GetParentElementCrossingShadowRoot are correct or whether they are also misbehaving in the case of nested ShadowRoots.
Assignee: khuey → nobody
Flags: needinfo?(wchen)
Flags: needinfo?(bzbarsky)
Is there a standalone testcase showing the problem around somewhere?  I'm not seeing one in the bug offhand.  :(
Flags: needinfo?(khuey)
There's one in the patch.
Flags: needinfo?(khuey)
Summary: B2G debug segfault on MOZ_ASSERT(!mOldRootNode, "Should have GCed old root node"); → B2G debug segfault on MOZ_ASSERT(!mOldRootNode, "Should have GCed old root node"), because undisplayed map (for handling display:none) doesn't work right for children of shadow roots
William told me this is happening because of multiple shadow roots and that feature is gone from the spec. William will likely land a patch to remove that from our current implementation in the near future.
Flags: needinfo?(wchen)
(In reply to Andrew Overholt [:overholt] from comment #48)
> William told me this is happening because of multiple shadow roots and that
> feature is gone from the spec. William will likely land a patch to remove
> that from our current implementation in the near future.

So it means we should update gaia web components to not make use of this feature anymore I guess?
Flags: needinfo?(overholt)
(In reply to Alexandre LISSY :gerard-majax from comment #49)
> (In reply to Andrew Overholt [:overholt] from comment #48)
> > William told me this is happening because of multiple shadow roots and that
> > feature is gone from the spec. William will likely land a patch to remove
> > that from our current implementation in the near future.
> 
> So it means we should update gaia web components to not make use of this
> feature anymore I guess?

I believe the problem that Kyle was seeing was related to having multiple shadow roots on the same element with the use of a shadow insertion point. Those features have been removed in the latest spec. As far as I know, Gaia only uses nested shadow DOM (which is fine) and does not use multiple shadow roots on the same element or shadow insertion points.
Flags: needinfo?(overholt)
Depends on: 1274539
If we don't care about nested shadowroot then maybe comment 45 from kyle does not stand anymore and this is okay?
Flags: needinfo?(wchen)
Pushed by wchen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f31f71abd551
Look past ShadowRoots when looking for the parent for the UndisplayedMap. r=dbaron
https://hg.mozilla.org/mozilla-central/rev/f31f71abd551
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
The patch in bug 1274539 fixes the inconsistency with using GetParentElementCrossingShadowRoot().
Flags: needinfo?(wchen)
Flags: needinfo?(bzbarsky)
You need to log in before you can comment on or make changes to this bug.