Closed Bug 1426797 Opened 2 years ago Closed 2 years ago

image-size.xul is going to permafail on Windows when Gecko 59 merges to Beta on 2018-01-11 (fails in presence of -moz-window-inactive selector w/ release milestone.txt)

Categories

(Core :: Layout, defect, P1, critical)

All
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox-esr52 --- unaffected
firefox57 --- unaffected
firefox58 --- unaffected
firefox59 + wontfix
firefox60 --- fixed

People

(Reporter: RyanVM, Assigned: emilio)

References

Details

Attachments

(1 file)

Due to a lot of bustage issues, I can't easily bisect what caused this. I think it's going to need some good old fashioned debugging :(. It appears to be Windows-only.

https://treeherder.mozilla.org/logviewer.html#?job_id=153065715&repo=try

The patch below should suffice for reproducing it:
https://hg.mozilla.org/try/rev/0154852baa88e427a1957e67ec1fd9d3e0acd80b
Though maybe the Windows-only crashtest asserts I'm also seeing are shedding some light on it?

https://treeherder.mozilla.org/logviewer.html#?job_id=153082682&repo=try

WARNING: stylo: requesting a Gecko declaration block?: file z:/build/build/src/layout/style/ServoBindings.cpp, line 453
WARNING: stylo: requesting a Gecko declaration block?: file z:/build/build/src/layout/style/ServoBindings.cpp, line 469
ASSERTION: Placeholder relationship should have been torn down already; this might mean we have a stray placeholder in the tree.: '!placeholder || nsLayoutUtils::IsProperAncestorFrame(aDestructRoot, placeholder)', file z:/build/build/src/layout/generic/nsFrame.cpp, line 767
Flags: needinfo?(xidorn+moz)
I checked locally with 366583-1.xul on m-c (for the failure in comment 1), and it raises the same stylo warning but no assertion. I had a look at why the warning is raised, and it seems that is because of <scrollbar> having gecko declaration likely due to xul prototype element model, and it is unrelated to the failure (the additional assertion). That warning should go away when we enable stylo-chrome.

And for the failure of 366583-1.xul itself, I have no idea. Looking at the code logic, I cannot think of anything going wrong there... Maybe someone would need to do build locally and attach a debugger to it to see what's happening.


And, I don't think the assertion failure of 366583-1.xul is anything related to the failure of image-size.xul.

As for the failures in image-size.xul, it seems to be related to the sizing of <xul:image>. Somehow when width or height is not specified explicitly, the image completely collapses in that direction. This is likely an issue in nsImageBoxFrame, but can also be an issue in style system if it provides some weird value for corresponding properties.

Does the issue happen in stylo-disabled task as well? If not, it could be an issue in style system, then...

(cc dholbert, since I'm seeing the sizing code of nsImageBoxFrame is mainly written by him, so he may have some other idea about why that's happening.)
Flags: needinfo?(xidorn+moz)
stylo-disabled tests only run on Linux AFAICT
dholbert, is this something you can take on in the next week?
Flags: needinfo?(dholbert)
Yeah, I'll take a look.
I needed one more tweak to get that Try run to get past "./mach build check" -- updated Try run:
 https://treeherder.mozilla.org/#/jobs?repo=try&revision=e24ed15199a35e179a454a2e84542c69fce86cb6&selectedJob=153935571

That try run does still show the issue. I can reproduce it locally, too, with...
  ./mach reftest layout/xul/reftest/image-size.xul

If I load the XUL file directly in a local nightly-as-beta build (after setting dom.allow_XUL_XBL_for_file = true), I can see a *brief* flash of the broken rendering that's captured in the reftest snapshot, and then it very quickly snaps to the correct[1] rendering.  So I think this is a case where we get the rendering correct[1] but we paint an incorrect partial rendering on the way, and the reftest harness is capturing that incorrect rendering for some reason.

Tentatively, I want to call this a WebRender bug (or possibly a reftest-harness/webrender interaction bug)...

[1] (The final "correct" rendering that we settle on actually has a slight brokenness at the top-right -- there's a tiny green rectangle with a blue rectangle inside of it, and the green rectangle is a bit wider in the reference case. But I see that difference even in unmodified Nightly builds on Windows -- I think it's a rounding issue specific to webrender, and it's what the test's "fuzzy-if(webrender,...)" annotation is meant to cover.)
CC'ing kats and aosmond who each touched the reftest.list line for this testcase during the scope of some webrender work (in bug 1389000 and bug 1183378 / bug 1420279, respectively).
Flags: needinfo?(dholbert)
(In reply to Daniel Holbert [:dholbert] from comment #8)
> [1] (The final "correct" rendering that we settle on actually has a slight
> brokenness at the top-right -- there's a tiny green rectangle with a blue
> rectangle inside of it, and the green rectangle is a bit wider in the
> reference case. But I see that difference even in unmodified Nightly builds
> on Windows -- I think it's a rounding issue specific to webrender, and it's
> what the test's "fuzzy-if(webrender,...)" annotation is meant to cover.)

(Actually -- it looks like this tiny misrendering on my local machine was due to Win10's default 125% HiDPI scaling. If I reset that to 100%, then I don't see this issue.  And it also doesn't make sense to blame this on webrender because it seems webrender is still disabled by default -- I'd misremembered it being enabled on Nightly on some platform.)
"reftest-wait" plus a small setTimeout does seem to fix the issue, FWIW:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=0cede1653008e2018b857b4f230025f9ba187cc5
OK, I was able to "hg bisect" this locally on Win10, and I determined that it broke starting with this commit for bug 1416493:
 https://hg.mozilla.org/mozilla-central/rev/5f57907f645b


Its parent 87368e9f8bad (plus the same config/milestone.txt tweak) does not show the problem.
That commit 5f57907f645b (plus a config/milestone.txt tweak to drop "a1") shows the problem.

Additionally, there's a performance regression for this specific reftest associated with that commit.  Before bug 1416493, this reftest takes 140ms-150ms.  After that bug, it takes 190ms-210ms.  Measurements taken on my laptop using this command:
  ./mach reftest layout/xul/reftest/image-size.xul  | grep "REFTEST"

(Occasionally -- less than 20% of the time -- a post-regression build will pass the reftest, but still take 190-210ms. So the perf regression seems to be reliable, even if the rendering regression isn't 100% reliable.)

The connection to bug 1416493 is somewhat believable, because that bug does make some tweaks to platform-specific XUL files (adding an @import for a platform-specific "menu.css" file).  My current theory is that that new "@import" makes XUL loading take a little bit longer (hence the perf regression), and we get a flash of unstyled/partially-styled content as a result (due to losing a race with some piece of the reftest harness, or something).  I'm not yet sure why this only happens in release builds (with the config/milestone.txt tweak), though.
Blocks: 1416493
FYI I'm not sure if this would change anything from Comment 12, but I'm working on importing menu.css differently in Bug 1420229, such that the @import will happen inside a new UA sheet loaded via nsDocumentViewer.cpp / nsLayoutStylesheetCache.cpp instead of in global.css (https://hg.mozilla.org/try/rev/1337c7dcf641a1cfd35201e1b3a0cb09f1d99a7d)
Thanks! That might help with the test failure.

But regardless, there's a legit bug here. Specifically, the reftest snapshot captures the wrong thing if this testcase has a "-moz-window-inactive" selector, *and* we have a release-build milestone.  (Those specific two things. Bug 1416493 gave us the second thing on Windows, because the menu.css file for Windows happens to have a -moz-window-inactive selector.)

Here's a Try run which makes those two tweaks, on last night's mozilla-inbound, which hits this reftest failure on all platforms:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=0a34d6ff18b91ab2271901574b80e4540a31af8d

And for comparison, here's a Try run with *just* the "-moz-window-inactive" tweak, and without the milestone tweak:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=c5542191dd5dc09559ba3593c9363764ad6d4823
(No instances of this reftest failure there).

Also -- I don't have a Try run to prove it, but local testing shows that this is only a problem when stylo is enabled. So this is something wrong with the relatively-new stylo support for "-moz-window-inactive", I think.  (This is also *not* bug 1428164 (about the same selector), because my Try runs that I linked in this comment *do* have that bug's patches, and yet they still fail.)

--> Kicking this back over to xidorn.  (or maybe emilio, who's got this selector code fresh in his mind from bug 1428164)

Worst case, if we can't get this fixed before the merge on Thursday [I think], I'm sure we could annotate the test as random for beta (since it doesn't fail 100% of the time) and fix this
Flags: needinfo?(xidorn+moz)
Summary: image-size.xul is going to permafail on Windows when Gecko 59 merges to Beta on 2018-01-11 → image-size.xul is going to permafail on Windows when Gecko 59 merges to Beta on 2018-01-11 (fails in presence of -moz-window-inactive selector w/ release milestone.txt)
For convenience, here's a single patch that should make this reproduce on all platforms (e.g. for local debugging), when you run this reftest command:
 ./mach reftest layout/xul/reftest/image-size.xul

(Note that it doesn't repro 100% of the time, so you might need to run that command a few times to see the failure.)
(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #2)
> I checked locally with 366583-1.xul on m-c (for the failure in comment 1),
> and it raises the same stylo warning but no assertion. [...]
> 
> And for the failure of 366583-1.xul itself, I have no idea. Looking at the
> code logic, I cannot think of anything going wrong there... Maybe someone
> would need to do build locally and attach a debugger to it to see what's
> happening.

Following up on this:
 - RyanVM spun off bug 1428087 for this 366583-1.xul failure
 - ...but I think it's actually the same issue as this bug -- and the assertion can be made to reproduce on all platforms (not just Windows) if you drop "a1" from config/milestone.txt and also add a -moz-window-inactive selector to the testcase, as noted in bug 1428087 comment 2 (see Try run linked there).

Noting this here in case it's easier to debug that testcase (now that we have a way of making it reproduce) rather than debugging image-size.xul.
See Also: → 1428087
Just in case it's helpful, it seems that :-moz-window-inactive makes us restyle the whole document, so it should be reliably reproducible with something like:

let s = document.createElement('style');
s.innerHTML = '* { color: red; }';
document.body.appendChild(s);
document.body.offsetTop;
s.remove();

Or something of the sort.
Daniel, if you have a build handy (I'm doing a build right now, but it'll take a bit), mind trying whether the assertion is reproducible with Stylo disabled and code like the one in comment 17 instead of :-moz-window-inactive?

If so, it may mean this is a longstanding issue that happens to be uncovered by the moz-window-inactive bit, and I'd just wallpaper it. If not and it's a stylo regression, it may be worth digging into a bit more.
Flags: needinfo?(dholbert)
(You might be onto something, but no, I wasn't able to get this to reproduce with variations on the comment 17 snippet.)
Flags: needinfo?(dholbert)
(In reply to Daniel Holbert [:dholbert] from comment #12)
> Additionally, there's a performance regression for this specific reftest
> associated with that commit.  Before bug 1416493, this reftest takes
> 140ms-150ms.  After that bug, it takes 190ms-210ms.  Measurements taken on
> my laptop using this command:
>   ./mach reftest layout/xul/reftest/image-size.xul  | grep "REFTEST"
> 
> (Occasionally -- less than 20% of the time -- a post-regression build will
> pass the reftest, but still take 190-210ms. So the perf regression seems to
> be reliable, even if the rendering regression isn't 100% reliable.)

I'm not too surprised that there's a regression (although that's bigger than I would have guessed), since we are loading the new CSS file. In practice this doesn't affect Firefox / Talos though since menu.css used to get loaded immediately by XBL at startup anyway due to the presence of <menu> tags in browser.xul.
(In reply to Daniel Holbert [:dholbert] from comment #12)
> The connection to bug 1416493 is somewhat believable, because that bug does
> make some tweaks to platform-specific XUL files (adding an @import for a
> platform-specific "menu.css" file).  My current theory is that that new
> "@import" makes XUL loading take a little bit longer (hence the perf
> regression), and we get a flash of unstyled/partially-styled content as a
> result (due to losing a race with some piece of the reftest harness, or
> something).  I'm not yet sure why this only happens in release builds (with
> the config/milestone.txt tweak), though.

By the way, I'm planning to have many more CSS files to get loaded in this same manner as we continue to replace XBL. Basically moving XBL <resources> stylesheets into UA styles that get loaded in all XUL docs. So if this is indeed a race that's getting triggered due to slightly slower CSS loading in the browser chrome I would expect this to get much worse as we go from 1 file loaded this way to tens of files. But I would have expected that reftest results wouldn't be affected by changes in the browser chrome styles.
(In reply to Brian Grinstead [:bgrins] from comment #20)
> (In reply to Daniel Holbert [:dholbert] from comment #12)
> > Additionally, there's a performance regression for this specific reftest
> > associated with that commit.
[...]
> 
> I'm not too surprised that there's a regression (although that's bigger than
> I would have guessed), since we are loading the new CSS file.

I'd bet that the observed slowdown (like the test failure itself) is caused specifically by the -moz-window-inactive selector itself (which happened to be in the stylesheet but would probably cause the same slowdown anywhere).
Depends on: 1409672
I'm going to land the wallpaper for now to avoid orange on tomorrow's uplift. Please remember to revert this change once the underlying issue is resolved.
Keywords: leave-open
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d8d5e1376664
Mark image-size.xul as random on Windows. a=test-only
(In reply to Emilio Cobos Álvarez [:emilio] from comment #17)
> Just in case it's helpful, it seems that :-moz-window-inactive makes us
> restyle the whole document, so it should be reliably reproducible with
> something like:
> 
> let s = document.createElement('style');
> s.innerHTML = '* { color: red; }';
> document.body.appendChild(s);
> document.body.offsetTop;
> s.remove();
> 
> Or something of the sort.

There is one thing different from doing a whole document restyle and doing this, that in a whole document restyle, the restyle starts from the root frame (which is usually the viewport frame), while with this change, the restyle would at most from the primary frame of the root element (which is probably the root element frame).

Restyling the viewport frame could be a key factor for this kind of issue.

Anyway, I suspect this would also go away once bug 1409672 lands.
Flags: needinfo?(xidorn+moz)
NI myself to see if this still reproduces.
Flags: needinfo?(ryanvm)
Indeed, this appears to pass in current simulation runs going off the logs. Aryx, can you please verify that comment 25 can be reverted in your next simulation run?
Flags: needinfo?(ryanvm) → needinfo?(aryx.bugmail)
Going to call this fixed by bug 1409672 and re-enable the test then. Thanks for testing it, Aryx!
Assignee: nobody → emilio
Keywords: leave-open
Backout by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/mozilla-inbound/rev/28d25664945a
Backed out changeset d8d5e1376664 because it doesn't fail anymore on Windows
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
You need to log in before you can comment on or make changes to this bug.