Querying a XBL child in a Custom Element connectedCallback function confuses the prototype walk and breaks things (was: "<radiogroup> inside a <wizard> breaks the wizard")

NEW
Unassigned

Status

()

defect
P3
normal
10 months ago
3 months ago

People

(Reporter: darktrojan, Unassigned, NeedInfo)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(7 attachments, 1 obsolete attachment)

Since bug 1481949, I've got a weird interaction between two custom elements.

In [1] we have some <radiogroup>s inside a <deck>, which is anonymous content of a <wizard>. It appears that only the <wizardpage> elements that contain the radio groups become children of the deck. The others can't be displayed by setting selectedIndex or selectedPanel on the deck, nor do they appear in the devtools inspector.

If there are no radiogroups, this doesn't happen.

On Mac and Windows, this doesn't happen.

Thunderbird Daily 2018-10-02 has this bug, go to File > New > Calendar to see the wizard.

[1] https://searchfox.org/comm-central/source/calendar/resources/content/calendarCreation.xul#14
There should be four <wizardpage>s here, but there are only two.
Posted file test.xul
Further investigation reveals:

* This doesn't happen if there's only one radiogroup.
* Only wizardpages after the one containing the second radiogroup are "missing".

In this simplified test case, pages 1, 2, and 3 work fine, but 4 and 5 do not.
Hm, that's concerning. We've seen issues where CE don't get upgraded (bug 1470242), but we haven't seen missing DOM elements due to a CE migration before. Especially odd that it's platform-specific. This may be related to XBL slotting in the CE. Though I'm curious, does the problem go away if you:

(1) Empty out the radio group
(2) Make the inner radio children not self closing (<radio></radio> instead of <radio />)

Also, do you see any errors in the Browser Console?
Flags: needinfo?(geoff)
Emptying the radiogroups worked but the others didn't.
Flags: needinfo?(geoff)
Blocks: 1495859
A couple notes:

- document.querySelectorAll("wizardpage") returns the right number of pages (5)
- if I comment out all the code in the radiogroup connectedCallback it works. It appears to be related to the call to this._getRadioChildren: https://searchfox.org/mozilla-central/rev/a11c128b90ea85d7bb68f00c5de52c0794e0b215/toolkit/content/widgets/radio.js#120
- For some reason the call to `[...this.querySelectorAll("radio")]` seems related. If I restore the previous createTreeWalker behavior for collecting radio children the problem seems to get fixed: https://hg.mozilla.org/mozilla-central/file/85c0ed4ef9c0/toolkit/content/widgets/radio.xml#l247
(In reply to Brian Grinstead [:bgrins] from comment #7)
> - For some reason the call to `[...this.querySelectorAll("radio")]` seems
> related. If I restore the previous createTreeWalker behavior for collecting
> radio children the problem seems to get fixed:
> https://hg.mozilla.org/mozilla-central/file/85c0ed4ef9c0/toolkit/content/
> widgets/radio.xml#l247

Nevermind about the createTreeWalker change fixing it. After double checking the broken behavior still persists even when collecting radio children that way.
I was wondering if converting <radio> to a CE (bug 1495861) would fix this, but that seems to make it worse - with that patch applied, only one wizardpage shows up. We'll need to figure out what's going on here before moving forward on that.
Blocks: 1495861
I don't yet know why, but deferring connectedCallback logic from running until DOMContentLoaded happens fixes the problem. This is something I've thought about doing anyway. I'd still like to know what's going wrong though.
I also noticed that in the connectedCallback during parse, if the <radio>s are Custom Elements then `this.innerHTML == ""`, which is the same behavior I see on a normal HTML page. But if they aren't a Custom Element (XBL or no binding), then the children do appear in the connectedCallback i.e. `this.innerHTML == "<radio value=\"foo\" label=\"foo\" selected=\"true\"/>"`
I've got as reduced a case as I could at https://phabricator.services.mozilla.com/D7619. It doesn't rely on the builtin <wizard> or <radiogroup> elements. As best as I can tell, this only reproduces if the following are true:

1) The child of the custom element must be a <radio> tag and must have a XBL binding with a <constructor> (no <content> needed). If I change the tag name to something else (even with the same XBL binding) it doesn't reproduce.
2) We have to access the radio children from JS in the connectedCallback i.e. `[...this.querySelectorAll("radio")]`.
3) The connectedCallback has to run before DOMContentLoaded happens.

I have a patch up that changes our built in CE base class to defer connectedCallback from running until after DOMContentLoaded to work around this, but I'm not confident this will fix all cases since I'm not sure what the root cause is. Smaug, given the reduced case above could you help figure out why this is failing and recommend if we should land a platform fix for it?
Flags: needinfo?(bugs)
Priority: -- → P3
Can you confirm this bug report is fixed now as of Bug 1496425? Since we aren't accessing child nodes in the radiogroup connectedCallback anymore I suspect it will be. Even if so, I'd like to leave this open so we can figure out best way to avoid it from happening in the future.
Flags: needinfo?(geoff)
Writeup of what's going on incoming.
So I took a look at this, the reason it happens is because the XUL prototype walking code is not really sound (yay).

In particular, the third kid is bound to the document tree here:

  https://searchfox.org/mozilla-central/rev/28fe656de6a022d1fc01bbd3811023fca99cf223/dom/xul/XULDocument.cpp#1601

By then, due to the previous connectedCallback, the parent already has its XBL binding loaded:

(rr) p aParent->NodeInfo()->NameAtom()->GetUTF16String()
$16 = 0x7f615e70e130 u"wiz"
(rr) p aParent->GetXBLBinding()
$17 = (nsXBLBinding *) 0x7f61627ec970

But of course we're not notifying of the insertion (bug 370111, or something), so the binding manager and such has no chance to update the insertion point of the new kid, which means it remains uninserted.

Now, why have we loaded the XBL binding even? See the stack I just attached, which corresponds to the following JS stack:

(rr) p DumpJSStack()
0 next() ["self-hosted":564]
    this = [object Array Iterator]
1 connectedCallback() ["chrome://mochitests/content/chrome/toolkit/content/tests/chrome/test_custom_element_in_xbl.xul":15]
    this = [object XULElement]

This is just our connectedCallback stuff! So that explains why querySelector() is needed and such, which is that it causes the reflector for the <radio> to be created, and we try to load the binding synchronously.

So we're in two layers of sync binding loads here... The querySelector() call causes the binding for the <radio> to be loaded via WrapObject.

But the attached handler of that same binding causes the constructor to run. And since there is a constructor, we need to construct the scope chain for it to run script. This explains why the empty <constructor> is needed. The <wiz> element is part of the scope chain, so we create a wrapper for it as well, which causes us to load its own binding, which causes us to run all the children assignment stuff which ends up causing the bug.
See Also: → 370111
So, comment 19 is the explanation of how this ends up happening. Thanks for the reduced test-case Brian, was incredibly helpful to figure out what's going on!

Anyway, now about how to fix it... I'm not sure I'm the best to make that call. As a workaround not creating any reflectors for elements with XBL bindings from connected callbacks works, I guess...

I suspect we don't really want to look into fixing up the XULDocument stuff / bug 370111? Though ICBW...

Another possible hack, which would allow us to get rid of GetBindingURL and such is to not do that GetBindingURL stuff is the document is not interactive, for example...

But that still wouldn't fix the issue of the .children caching you CCd me on today.
Boris, maybe you have some thoughts on all this? This is a pile of action-at-a-distance stuff, see comment 19 and the various attachments I posted before.

Feel free to just cancel the ni? request if you don't have much time to look at this :)
Flags: needinfo?(bzbarsky)
Given this, I think it's now safe to say that we should pretty much always use delayConnectedCallback (https://searchfox.org/mozilla-central/rev/28fe656de6a022d1fc01bbd3811023fca99cf223/toolkit/content/customElements.js#38-57) unless if we know for sure that the element won't have children or if the connectedCallback doesn't do anything with children.

That's probably best practice anyway since in HTML docs we can't expect childNodes to be there during parse. I do worry a bit that there could be some weird issues where a xbl binding gets constructed due to initial layout but we haven't processed delayed connectedCallbacks yet, and then the binding expects a child (CE) to be in the connected state. We could potentially process delayed connectedCallbacks before initial layout instead, but I think that could chip away at some of the perf gains in Bug 1496425 (specifically with hidden nodes where we really don't need to be in a rush to initialize them).
(In reply to Brian Grinstead [:bgrins] from comment #15)
> Can you confirm this bug report is fixed now as of Bug 1496425? Since we
> aren't accessing child nodes in the radiogroup connectedCallback anymore I
> suspect it will be. Even if so, I'd like to leave this open so we can figure
> out best way to avoid it from happening in the future.

I can. Thank you.
Flags: needinfo?(geoff)
> Boris, maybe you have some thoughts on all this?

Hmm.   The prototype walk can run scripts directly when it hits a nsXULPrototypeNode::eType_Script thing, right?  So in theory it should be ok with random script like connected callbacks running.   :(

If we just had a <script> that tried to touch the elements in that testcase, with no custom element stuff in  sight (but still with the relevant xbl bindings), would that have the same problem?
Flags: needinfo?(bzbarsky)
(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #24)
> > Boris, maybe you have some thoughts on all this?
> 
> Hmm.   The prototype walk can run scripts directly when it hits a
> nsXULPrototypeNode::eType_Script thing, right?  So in theory it should be ok
> with random script like connected callbacks running.   :(
> 
> If we just had a <script> that tried to touch the elements in that testcase,
> with no custom element stuff in  sight (but still with the relevant xbl
> bindings), would that have the same problem?

Yes, the following patch on top of the test also fails, for example, and commenting out the scripts makes it pass.
Assignee: nobody → emilio
Assignee: emilio → nobody
> Yes, the following patch on top of the test also fails

OK.  So it's just all broken, good.

So I think our basic options are: 

1) Notify when appending during the prototype walk.  This is actually my preferred fix if we can get away with it without other functionality regressions or performance problems.  Just get rid of the weird and have XUL act more like HTML here.   Yes, this means fixing bug 370111.  Importantly, this option does not incur new technical debt.

2) Avoid force-attaching bindings when creating reflectors during the prototype walk.  Once the prototype walk is done we'll try frame construction, which _should_ instantiate all the bindings.  One obvious problem is that things in display:none subtrees will stop getting their bindings attached, which is not great.  Once XBL is gone we can remove all this stuff, but in the meantime it's more complexity and an extra possible footgun.

3) Work around in the connected callbacks to avoid touching the DOM nodes in question, then back that out once we get rid of XBL.  Again, more technical debt and complexity, but may be quickest.

If we need a quick-fix, I guess we do some variant of #3.  If we're willing to invest a bit of work, I think we should do #1 (possibly after doing #3 as a quick-fix).
(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #26)
> > Yes, the following patch on top of the test also fails
> 
> OK.  So it's just all broken, good.
> 
> So I think our basic options are: 
> 
> [...]

Yeah, I agree that #1 looks nicer. Though given how broken incremental updates to XBL insertion points generally are, I wouldn't be surprised if we ended up having to do #3 regardless...
(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #26)
> If we need a quick-fix, I guess we do some variant of #3.  If we're willing
> to invest a bit of work, I think we should do #1 (possibly after doing #3 as
> a quick-fix).

Agreed that we should do #3 immediately, given this and other bugs like `.children` getting out of date (https://bugzilla.mozilla.org/show_bug.cgi?id=1476769#c5). I've already started to spread this around to the team working on CE conversions, but we can / should improve documentation around https://searchfox.org/mozilla-central/rev/eef79962ba73f7759fd74da658f6e5ceae0fc730/toolkit/content/customElements.js#39-43 to explain the additional reasoning here.

As for (1), I don't know how much work that is, but just to give some context on timing I'm hoping we can ship browser.xhtml soon (say, a few months). We're getting relatively close to green tests and are starting to dig into perf now (which initially appears mostly affected by layout happening earlier in XHTML vs XUL). There's obviously a long-tail of XUL documents in tree, but if we get the main browser window off of it I'm assuming others would be relatively easier to convert.
> As for (1), I don't know how much work that is

Changing "false" to "true" in the places that append the kids (2 or 3 of them in XULDocument.cpp), then fixing the fallout, if any.  That last bit is hard to predict.
I wonder why Tp5 is affected...

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #26)

Yes, the following patch on top of the test also fails

OK. So it's just all broken, good.

So I think our basic options are:

  1. Notify when appending during the prototype walk. This is actually my
    preferred fix if we can get away with it without other functionality
    regressions or performance problems. Just get rid of the weird and have XUL
    act more like HTML here. Yes, this means fixing bug 370111. Importantly,
    this option does not incur new technical debt.

Just to add some extra context here, we have a patch up in https://bugzilla.mozilla.org/show_bug.cgi?id=1527977 that will also allow chrome XHTML to use the prototype cache so the parsing / walking would move out of XULDocument and into a new class. So browser.xhtml would match the behavior we have with XUL doc. We may (probably?) want to still make this change if we can performance-wise just so we more closely align with the web-exposed behavior, but we do have a workaround for now.

  1. Work around in the connected callbacks to avoid touching the DOM nodes in
    question, then back that out once we get rid of XBL. Again, more technical
    debt and complexity, but may be quickest.

This is done now - if a Custom Element is going to touch the rest of the DOM we are using delayConnectedCallback() to wait until DOMContentLoaded to run the function: https://searchfox.org/mozilla-central/search?q=delayConnectedCallback&path=.

No longer blocks: 1495622
See Also: → 1495622
Summary: <radiogroup> inside a <wizard> breaks the wizard → Querying a XBL child in a Custom Element connectedCallback function confuses the prototype walk and breaks things (was: "<radiogroup> inside a <wizard> breaks the wizard")
No longer blocks: 1495861
See Also: → 1495861
Attachment #9014073 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.