Closed Bug 1490969 Opened 2 years ago Closed 2 months ago

We run a lot of duplicated reftests

Categories

(Testing :: Reftest, enhancement, P4)

enhancement

Tracking

(firefox83 fixed)

RESOLVED FIXED
83 Branch
Tracking Status
firefox83 --- fixed

People

(Reporter: mats, Assigned: jgraham)

References

(Blocks 1 open bug)

Details

Attachments

(9 files, 2 obsolete files)

5.19 KB, patch
Details | Diff | Splinter Review
33.71 KB, text/html
Details
33.68 KB, text/html
Details
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
It appears we're running tests both in layout/reftests/w3c-css/submitted/
and testing/web-platform/tests/css/vendor-imports/mozilla/mozilla-central-reftests/ which are duplicates of each other.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=56aeabe8b84a3874b2ec1d06389894e22fdb9f5c

Shouldn't we exclude the latter to conserve computing resources?

Also, layout/reftests/w3c-css/submitted/README claims to be
the original and that changes should be made there.
But then if I make a code change that requires a test change,
am I supposed to  mark the temporarily obsolete copy in
the other place as failing, and then update the meta data
again once the updated test is vendored in again?
This seems rather tedious...
IIRC, the main blocker to this is that wpt don't run on all supported platforms yet (i.e. Android).
(In reply to Ryan VanderMeulen [:RyanVM] from comment #1)
> IIRC, the main blocker to this is that wpt don't run on all supported
> platforms yet (i.e. Android).

See also bug 1423971, which this might be a dupe of.
Bug 1423971 would be the ideal fix, but can we do something in the meantime?
I don't see the point of running the WPT copy until that bug is fixed.
Per bug 1481723 we don't even run them on Windows and Mac -- they're run only on Linux.

Given the various things that our reftest harness does that the wpt reftest harness doesn't do, I'm inclined to say that for now we shouldn't run the wpt copy.  The reason to add things to w3c-css/submitted rather than directly to the wpt directories is that we get the various features in our main reftest harness.  (Or, I guess, that they get run on platforms other than Linux...)
I think the argument from computing resources is fairly weak; reftests run fast in CI and the number of duplicate tests is small enough that we're talking about a single digit number of minutes per run, or a negligible fraction of the overall runtime.

The argument that updating the internal copy of the tests is difficult does make sense, and having duplicate copies occasionally causes me a lot of work.

Having said that, I would prefer to frame this discussion around how we can work together to reach a point where the layout team treats wpt tests as first class, rather than trying to figure out what it's acceptable to disable in order to avoid the pain that duplicating tests causes.
I think everyone on the Layout team would be very happy to treat the WPT
files as the originals (as I said in comment 3).  But that seems to be
blocked on a number of things outside of our control (WPT not working
on Android was mentioned in comment 1).
I don't think that all the blocking issues are out of your control.

Several of the issues aren't really about the harness but about unexplained behaviour in Gecko, for example the fact that we see lots of antialiasing-related failures on Windows.

Android is perhaps something of an exception; the recent backstory on that is that maja_zf was working on it before she went on parental leave, and finding cover to continue that work has proven more difficult than we would have liked. However there is the prospect of resuming work there in the near future.

I'm sure I haven't communicated the current state of affairs, or progress that's being made, as well as I should. But if we have agreement on where we want to end up then there's certainly ways to work together to get there faster.
(In reply to James Graham [:jgraham] from comment #7)
> Several of the issues aren't really about the harness but about unexplained
> behaviour in Gecko, for example the fact that we see lots of
> antialiasing-related failures on Windows.

Hmm, are you saying that we get antialiasing failures when running a WPT reftest
but not when running *the same test* in our traditional reftest framework?
If so, is that due to differences in the frameworks, or simply that WPT
lacks the relevant "fuzzy" annotations in the manifest?
What I know at the moment is that the try runs in bug 1481723 show failures on Windows on tests that pass on Linux, with substantial numbers of differences being apparently due to antialiasing. It is *possible* that all these tests need to be marked as fuzzy on Windows, but that seems like a pretty unfortunate situation for wpt since even when fuzzy support in the harness lands [1] we will continually add new tests from upstream without any automated way to update the fuzzy annotations. Unless people are paying very close attention, we're likely to end up with a lot of unnecessary EXPECTED-FAIL on Windows in those imported tests. Therefore I've been hoping there's some set of prefs I can set, or similar, that will give more reliable results on that platform. But so far I haven't managed to dig into this very deeply.

[1] https://github.com/web-platform-tests/wpt/pull/12187
> Therefore I've been hoping there's some set of prefs I can set, or similar,
> that will give more reliable results on that platform.

I think we've discussed that in the past and decided that we explicitly
don't want to do that because then we don't test the actual code path
the test will take in a release build.


> we're likely to end up with a lot of unnecessary EXPECTED-FAIL on Windows in
> those imported tests

I don't see that as a problem for WPT tests that we're currently not running
on these platforms.  BTW, I looked at some of the failures in bug 1481723 and
some of them appears to be bugs in the tests themselves, not anti-aliasing.
(dependencies on font metrics and whatnot)
I don't see any failures in any of the
testing/web-platform/tests/css/vendor-imports/mozilla/mozilla-central-reftests/
tests though.  If the failures are only for tests that we currently don't
run on these platforms we should just mark them as failing and enable
the suite IMO.  It will create an initial backlog of tests for us to
investigate but that seems better to me than not running them at all
on these platforms.

BTW, are the WPT machines from the same pool as other types of tests
on that platform?
> If the failures are only for tests that we currently don't
> run on these platforms we should just mark them as failing and enable
> the suite IMO.  It will create an initial backlog of tests for us to
> investigate but that seems better to me than not running them at all
> on these platforms.

If people are happy with hundreds of extra "pass-looking" failures then I can enable the tests on Windows 10 with very little effort. Windows 7 has something that looks like a scrollbar issue that I'm investigating, and Mac has some compareCanvases issue that needs work.

> BTW, are the WPT machines from the same pool as other types of tests on that platform?

On AWS wpt uses xlarge instances by default whereas reftest seems to use default large instances. I don't know what pooling exists outside of the AWS-backed infrastructure.
> On AWS wpt uses xlarge instances by default whereas reftest seems
> to use default large instances.

Interesting, so I don't know what differences there are between
xlarge/large, could there be differences in graphics cards/drivers,
Windows settings etc?

Would it be possible to do a WPT test run on a "large" machine -
just to see if you get a /different/ set of failures?
Oh actually I just found some different configuration related to Windows (xlarge vs large seems to just be linux). AFAICT Windows 10 reftests are run on hardware and others are not. jmaher seems to have set that up, so I bet he knows the context here.
Flags: needinfo?(jmaher)
we are unable to run reftests on windows 10 in AWS due to issues with the image hanging/crashing (will be blue jobs in treeherder).  I don't know the details, but we run code coverage tests on windows 10 on hardware instead of AWS for the same reason.

I would imagine in November we could be switching reftests on windows10 to AWS as the work to fix the driver issues and installation issues for that driver are recently fixed.
Flags: needinfo?(jmaher)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=78994cdedf7517619e8dc67987f17255ded7ac6a&selectedJob=199332028 is a wpt run using the same Windows machines as for reftests; it still seems to be failing tests in the same way.

Are tests like https://searchfox.org/mozilla-central/source/testing/web-platform/tests/css/CSS2/box-display/display-change-001.xht really buggy (linking to the reftest analyzer is really hard, but it's one of the tests in [1])? That seems to fail due to blurring of the borders of the rendering of the Ahem glyph, but are tests supposed to account for that somehow? If they are it seems like we really really want a lint.

[1] https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/aCOopG5FTRSOa8BdrZF5Ug/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1
display-change-001.xht makes the assumption that text using the Ahem font
isn't anti-aliased.  (I think that's a pretty common assumption for tests
using Ahem, BTW).  Apparently it fails on this platform.

https://searchfox.org/mozilla-central/source/testing/web-platform/tests/css/CSS2/box-display/display-change-001.xht

Maybe there's a config knob somewhere to suppress anti-aliasing for
Ahem specifically?  Maybe Jonathan knows.
Flags: needinfo?(jfkthame)
(In reply to Mats Palmgren (:mats) from comment #16)
> display-change-001.xht makes the assumption that text using the Ahem font
> isn't anti-aliased.  (I think that's a pretty common assumption for tests
> using Ahem, BTW).  Apparently it fails on this platform.
> 
> https://searchfox.org/mozilla-central/source/testing/web-platform/tests/css/
> CSS2/box-display/display-change-001.xht
> 
> Maybe there's a config knob somewhere to suppress anti-aliasing for
> Ahem specifically?  Maybe Jonathan knows.

I don't know of a config setting that would do this. In theory, it should be possible to use an appropriate 'gasp' table[1] in the Ahem font to request non-AA rendering; but I don't know whether the various rasterizers across all platforms actually respect this or not.

[1] https://docs.microsoft.com/en-us/typography/opentype/spec/gasp
Flags: needinfo?(jfkthame)
Hmm, I'm surprised there isn't a @font-face feature to control
anti-aliasing per font.

Maybe we could set gfxFont::mAntialiasOption to kAntialiasNone
for Ahem, based on a pref? (enabled in testing profiles)

I'm puzzled how/if other UAs pass these tests...
I think that blink and WebKit disable font antialiasing for tests.

gsnedders tried altering the gasp table for the Ahem font but it didn't seem to work; maybe we got the settings wrong [1]

[1] https://github.com/web-platform-tests/wpt/pull/13010
jfkthame: Do you have some time to look at https://github.com/web-platform-tests/wpt/pull/13010 and see if the changes to Ahem look like they ought to be doing the right thing?

If the consensus is that we should disable antialiasing for Ahem specifically in the test configs, I can look at implementing that but I'm totally unfamiliar with the code so would need some pointers to get going. Otherwise I'd be very happy to test a patch :)
Flags: needinfo?(jfkthame)
(In reply to James Graham [:jgraham] from comment #20)
> jfkthame: Do you have some time to look at
> https://github.com/web-platform-tests/wpt/pull/13010 and see if the changes
> to Ahem look like they ought to be doing the right thing?

Looks like it to me, in principle. I'm a bit doubtful, though, whether the font rasterizer always respects the 'gasp' table; e.g. in the case of Linux, I suspect this might be overridden by stuff in the fontconfig configuration.

Or I guess we might be overriding it somewhere in Moz2d-land or Skia or something.... that's getting pretty far into the gfx side of things, dealing with details of rasterization etc. Maybe Lee would know where to look for more detail.
Flags: needinfo?(jfkthame) → needinfo?(lsalzman)
(In reply to Jonathan Kew (:jfkthame) from comment #21)
> (In reply to James Graham [:jgraham] from comment #20)
> > jfkthame: Do you have some time to look at
> > https://github.com/web-platform-tests/wpt/pull/13010 and see if the changes
> > to Ahem look like they ought to be doing the right thing?
> 
> Looks like it to me, in principle. I'm a bit doubtful, though, whether the
> font rasterizer always respects the 'gasp' table; e.g. in the case of Linux,
> I suspect this might be overridden by stuff in the fontconfig configuration.
> 
> Or I guess we might be overriding it somewhere in Moz2d-land or Skia or
> something.... that's getting pretty far into the gfx side of things, dealing
> with details of rasterization etc. Maybe Lee would know where to look for
> more detail.

I would think it would be better to manipulate font AA via a pref. Trying to manipulate the font itself seems sketchy.
Flags: needinfo?(lsalzman)
I'd argue the opposite:  if we can make the reftests in web-platform-tests more portable between browsers and configurations, that's a pretty substantial win for the usability of those tests (and their accessibility to contributors, etc.).
Attached patch kAntialiasNone (obsolete) — Splinter Review
jgraham: could you test this patch please?

(It intentionally disables AA for *all* fonts currently,
because I wanted to see how many other tests we have that
depend on AA (around 30), but hopefully it should fix
the failing Ahem reftests.)
Flags: needinfo?(james)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=31fefee5116c44e07f67897a7a5e03ff7e7a8bb8&selectedJob=200154874

No change on Linux, fixes the vast majority of issues on Windows 10 (for wpt). Do you also want non-wpt reftests?
Flags: needinfo?(james)
Just WPT is fine, thanks.  I suspect that it would fix most
issues on Win10 if we limit this hack to Ahem only.  There
seems to be some remaining issue with the viewport though:
https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/D1FU5FkFRs6vy2jov8Xo0w/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1
(then click "Circle differences")

I'm assuming that the OSX failures are still that the Ahem
fails to install properly there.
Yeah, Windows 7 has some strange issues getting the correct viewport size, which I'm investigating in bug 1490096 (seems to be a gfx issue since I see different behaviour in headless and non-headless mode and they layer tree looks sensible).
OK, so we have several suggested solutions for the Ahem issue:

1. Change the Ahem font so it's no longer antialiased
2. Update all the tests so they only ever do glyph-to-glyph comparisons, so that the antialaising is always the same in test and ref.
3. Disable antialiasing for wpt reftests
4. Disable antialiasing for Ahem specifically in wpt reftests.

I agree with dbaron that 1 would be ideal if we could make it work, but my sense from all the font experts is that it's hard to achieve in practice. 2 seems like a theoretically pure solution that is difficult to implement because it would involve rewriting hundreds of existing tests and probably not be easy to maintain for new tests.

If we agree that 1 and 2 aren't practical, then that leaves 3 and 4. 3 has a considerable disadvantage that it makes the rendering different to a real user session, and may miss bugs where AA affects other layout. But it has the considerable advantage that it's the same as Blink and WebKit do, so we're less likely to run into tests that work in those browsers but fail with AA issues in gecko. 4 is the compromise option where we disable AA for Ahem only. I don't have a sense for how big the advantages/disadvantages of this are compared to option 3; do we have a lot of tests where layout depends on font metrics, but Ahem isn't the font being used in the test?

p.s. the Windows 7 issue seems to be that a window with the dialog flag doesn't necessarily open at the specified size on that platform.
Attached patch kAntialiasNoneSplinter Review
Here's a little more fleshed out patch.  It puts the feature behind
a pref and only enables it for WPT and reftests.  Also, it disables
AA for Ahem only.

There are some failures on OSX I don't understand though:
https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/cnB_dXsBTliKSrohP8GIwg/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1
It appears that disabling AA affected the Ahem baseline?
Attachment #9010072 - Attachment is obsolete: true
Another disadvantage of the w3c-css/submitted tests is that they make contributing or changing the tests really difficult to other developers from other engines.

Rego just mentioned that he wanted to adjust one of the contain/ tests for a spec change, and he right now would need to either copy the tests into wpt, or submit the patch to Gecko (marking the tests as failing here), which is probably fine but somewhat unfortunate.

I suggested him to submit a patch moving the reftests to testing/web-platform.
mats: I made a similar patch in bug 1494715, but I think your one already addresses most of the review issues. So I hope you don't mind that I replaced my patch with yours, set you as author, and updated the review [1] (I'll obviously check that the author information is preserved on landing; no idea what Phabricator/Lando do when the commit author changes).

I think in the interests of making progress we shouldn't set the pref on non-wpt reftests for now so that we can handle any changes separately?

[1] https://phabricator.services.mozilla.com/D7200
Depends on: 1481723, 1495430
Flags: needinfo?(mats)
I'm puzzled by the new failures on OSX though (comment 29).
I think we need to understand those first.  Any help with that
would be appreciated.

Also, there are a couple of @font-face reftests that legitimately
need to have AA for Ahem (they load Ahem.ttf under a different
name and compares it to Ahem).  I tried to fix that by just
adding a pref() in the manifest to disable this feature for
those tests but I discovered that it doesn't work.  I suspect
that Ahem was already loaded in that content process and once
it's in the font cache it's too late.  So we need to add a pref
callback to clear the font cache or something.
Alternatively, we could change those few tests to use
a different name than Ahem; that's probably easier.
Flags: needinfo?(mats)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=931935f9781eb22a92c4b739f018266d62c9f93e&selectedJob=202937460 shows 

Assertion failure: IsAtomic<bool>::value || NS_IsMainThread() (Non-atomic static pref 'gfx.font_ahem_antialias_none' being accessed on background thread), at /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/StaticPrefList.h:439

amongst other problems.
Could you try declaring it RelaxedAtomicBool instead?
(assuming it's for prefs like this)
The OSX failures are still that the Ahem font wasn't installed correctly, right?

The Win7 failures:
 [Exception... "Failure" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: chrome://marionette/content/reftest.js :: toBase64 :: line 224" data: no]
seems to be an error with the snapshotting feature on that platform?
(The debug build says: Assertion failure: [GFX1]: Failed to create DrawTarget, Type: 3 Size: Size(616,638)
so maybe we just ran out of memory?)

The Win10 failures (meshgradient*.svg) makes no sense to me.  The test
snapshot is blank, so I'm guessing they test a SVG feature we don't support?
We should probably just mark them as failing and file a bug in Core/SVG.

Other than that, it looks pretty good to me. ;-)
https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=203425483&revision=6c8d8af7621c834fb62fbeb156634c073aeb136f looks more healthy. Note that this includes updated metadata so we aren't actually passing all tests (and in particular we haven't fixed the AHem-on-OSX issue), but we don't see any more errors.
I think we need to figure out the OSX issue I noted in comment 29
before landing that patch though.
Jonathan, maybe you can take a look?
Flags: needinfo?(jfkthame)
(In reply to Mats Palmgren (:mats) from comment #37)
> I think we need to figure out the OSX issue I noted in comment 29
> before landing that patch though.

Yes, I'm puzzled by that too.

As a first step, I confirmed in a local build that the font metrics we get (as computed by gfxMacFont::InitMetrics()) are identical in the unpatched and patched cases.

The frame tree we end up with for an affected testcase is also identical. So there's no actual change in layout resulting from the patch; the issue is happening during painting only.

Next step is to see what's different about the display list, perhaps...
Flags: needinfo?(jfkthame)
(In reply to Mats Palmgren (:mats) from comment #29)
> It appears that disabling AA affected the Ahem baseline?

I'm not sure this is a baseline issue at all, actually; it just appears like one at first glance. But the failures I looked at involve Ahem glyphs in an element with vertical writing mode, so when we see a discrepancy at the top and/or bottom edges that corresponds to an issue in the *inline* direction.

Moreover, the examples I checked turn out to be already annotated as fuzzy already. The trouble is that when we disable AA for Ahem, the fuzz numbers tend (in vertical writing-mode examples) to get larger. Specifically, we seem to end up consistently painting the Ahem glyph 1 device pixel lower -- i.e. with an inline-position that is 1 dev px advanced -- than would be "correct" according to the element's rect. We *sometimes* also have it offset 1 dev px to the right (i.e. in the block direction), but that's variable depending on the font size.
Here's a testcase that (with a patched build and the gfx.font_ahem_antialias_none pref set to true) can be used to compare the rendering we get with and without AA, in both horizontal and vertical writing modes.

With antialiasing enabled, there's frequently some fuzziness on one or more edges of the block, depending on size and directionality, but there doesn't seem to be any consistent error, just antialias blur. With it disabled, however, the vertical writing-mode example (4th line) clearly shows a consistent 1 pixel offset in the vertical direction.
Updated to also test vertical writing-mode with text-orientation:upright. In this case the problem doesn't occur; the non-AA glyphs are rendered at the expected vertical position (though depending on size, there's sometimes a 1-pixel offset horizontally).

The other thing I notice is that with WebRender enabled, the issue here doesn't seem to occur; it's specific to the non-WR code path for drawing rotated ("sideways") text in vertical modes.

This suggests perhaps we need to pixel-snap the y-coordinate of the center of rotation in the code at https://searchfox.org/mozilla-central/rev/5786b9be9f887ed371c804e786081b476a403104/gfx/thebes/gfxFont.cpp#2306-2319, or something like that.
Attachment #9015021 - Attachment is obsolete: true
A further demonstration that the problem arises due to the rotation origin being at a fractional-device-pixel vertical position. Here, we just repeat the exact same testcase several times. The problem occurs only on lines 2 and 3 out of the 6 examples here; on the other lines, the vertical positioning is OK.
(In reply to Jonathan Kew (:jfkthame) from comment #41)
> This suggests perhaps we need to pixel-snap the y-coordinate of the center
> of rotation in the code at
> https://searchfox.org/mozilla-central/rev/
> 5786b9be9f887ed371c804e786081b476a403104/gfx/thebes/gfxFont.cpp#2306-2319,
> or something like that.

Maybe SetMatrixDouble -> SetMatrix -> ChangeTransform:
https://searchfox.org/mozilla-central/source/gfx/thebes/gfxContext.cpp#954
  mRect.NudgeToIntegers();
So, given bug 1425698 is still unsolved, it seems like blocking the Ahem antialiasing changes, scoped to wpt-only, on a mac-only issue prevents us turning on Windows tests for not much gain. Would there be objections to landing a set of fixes to enable the wpt CSS tests on Windows only? I think that is currently blocked on https://phabricator.services.mozilla.com/D7200 getting a second round of review from jfkthame, and me splitting the relevant patches into windows-specific and mac-specific parts.
Sounds good to me.

Also, once bug 1425698 is resolved, I think we should also enable it
on OSX even if the "rounding issue" discussed above is unresolved.
I'm assuming that we aim to make WebRender the default soon-ish,
so it seems OK to mark those tests as failing in a non-WR OSX build.
All CSS tests now run on Mac and Windows (both with Ahem)

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&selectedJob=213118961&revision=9c736fa46ae77b97f6d3cbb73e9b32f975c8dba0

is a try run on Android; there's some bustage that I think is related to dpi/scaling. I set devicePixelRatio to 1, but some of the failures e.g. in https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/dTSbpRpoSvSMW_9a5Gea5g/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1 look like the kind of slight misalignment that you get running reftests on high dpi devices. So I'm not sure whether there is some other setting I should tweak, or similar.
(In reply to James Graham [:jgraham] from comment #47)
> All CSS tests now run on Mac and Windows (both with Ahem)

Yay \o/

> is a try run on Android; there's some bustage

It looks to me like AA differences are causing some of these errors...
For example, css/CSS2/css1/c412-hz-box-000.xht which use Ahem text
in the test and compares that to <img> blocks in the reference.
The Ahem text is now NOT doing AA as we intended, but the image
rendering apply AA at the edges (which seems a bit odd to me).
We might be able to make some of these tests pass with an .ini file
that unset our ahem pref on Android only?

Some other failures seems to be unrelated to AA though, e.g.
css/CSS2/tables/fixed-table-layout-003f01.xht which has some
text (the "p" in "200px") protruding below the block size.
It seems weird to me that part of a descender can cause visual
overflow like this when using a regular font.  It might be
technically allowed for a font to do that though, I don't know.
Perhaps we should just mark these tests as failing on Android?
Depends on: 1509983
After figuring out what setting are required, Android reftests are close to landing. It seems like we have a lot of intermittents related to what I think is intermittent rendering of the scrollbar e.g. https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/aZ-5y73ER7Ki_EOFS7jhjg/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1 If anyone knows how to address that I would be very interested to know.
The best fix for cases like this is to split the test in two parts
so that it doesn't overflow the viewport.  Using a smaller font-size
might work too if the font-size isn't relevant to the test.
OK; there's more general work going on to ensure that wpt reftests don't overflow the bounds of the viewport (principally by making the viewport larger), so if that's the only thing to do here then I guess we can go ahead and disable these for now and re-enable them once we have addressed the underlying issue.
Although I'm unclear how to handle cases like the position:sticky test in https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/aHJ22_QqTaS9UfVP6Uhcbw/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1 which uses scrolling child elements. It looks like there are prefs to either disable overlay scrollbars or change the time for them to disappear to 0, so I'm going to try that, but I don't know if there's a preferred option here?
For those two specific tests in comment 52 (tall--auto-32px--omitted-width-nonpercent-height-viewbox.html and position-sticky-grid.html), we could probably adjust the test's CSS to use "overflow:hidden" rather than "scroll", right?  (In the former test, putting "overflow:hidden" on the <body> or <html> element.)

Those tests don't seem to be intending to test scrollbar rendering -- they're just testing rendering in a scrollable-via-JS area, and overflow:hidden is still scrollable-via-JS.  If scrollbars cause screenshotting-problems from fading etc., then that's a pretty good argument for avoiding overflow:scroll in this case.

But: having said that, I also expect there are *other* testcases that *do* intend to test scrollbar rendering (e.g. to exercise edge cases relating to the space that scrollbars may occupy if they're not overlaid). So we probably do need a more general solution beyond just fixing the tests in comment 52.
Perhaps we could set the fade time to "infinity"?
Then we still have general test coverage for overlay scrollbars
but with stable rendering.  Then we only need to reset that
setting for tests that checks the fade-away behavior.

All the recorded blockers here are fixed. Are there other issues that people feel should block deduping the reftests?

No, I don't see any blockers. We can fix any remaining tests
that fails when run as WPT but not as a reftest as we go.

Are we running the WPT reftests on Android now, and if not, should we be blocking on that? Looking at a m-c push, I see Tier 2 Wr jobs on Android 7.0 x86 opt, but no other Android configurations.

Flags: needinfo?(james)

I see the dependent bug 1509983 which enabled them for that configuration, in Tier 2. What prevents those from moving to Tier 1, and should we be running those tests in debug or other Android version configurations?

OK, so the Android configuration has been a little bit in flux, but the situation is clearer now.

We are currently running tests in Fennec as Tier-2 and (just landed, and not yet very green) GeckoView as Tier-3.

The intent is to green up the GeckoView version and then eventually promote that to Tier-1. I expect that Fennec tests will be dropped at some point, although that could be subject to product decisions. After the opt tests are Tier-1 (or at least Tier-2) we can look at debug, but I'm not confident we will be able to run the full suite in that configuration given the limited resources and relatively poor performance. But hopefully I'm wrong.

I don't think there's any plan to run the tests in other android configurations; given the relative performance it's hard to justify running things in ARM Andoid emulators on x86 hardware rather than in x86 Andoid emulators on x86 hardware unless there's a strong justification for doing so. For generic gecko code we've been taking the approach that Android vs !Android is a more significant distinction than ARM vs x86, although there are obviously exceptions for things like the Spidermonkey JIT. If that's something we need to do with (a subset of) wpt, we should discuss that.

Flags: needinfo?(james)

Are there any raminaing issues that people feel are blocking this change? We now have Tier 1 support for wpt reftests on the Android emulators.

The confusion caused by the vendor-imports directory just came up talking to people working on other browser engines as a pain point for interop testing layout features.

dbaron: Do you have feedback on whether there are remaining blockers here?

Flags: needinfo?(dbaron)

I'm still a bit worried about whether the developer experience for writing reftests for WPT is on par with that for our original reftest harness.

Here's an example of something that seems unnecessarily difficult with WPT but was easy with the reftest harness (when I did it just a few hours earlier, but I was asked by my reviewers to make the test a WPT instead, probably for good reason):

I'm adding a new failing test for bug 1584018, and I want to run just that test to check that I wrote the manifest correctly (since I'm going to remove that manifest in the next commit when I fix the failure, so nobody else will ever actually run it). So I type ./mach web-platform-tests testing/web-platform/tests/css/css-flexbox/child-percent-basis-resize-1.html, and instead of the normal process of running the tests given, closing, and printing output... since I ran just one test, the WPT harness decides that I magically wanted to leave the window open. And when I close the one of the two windows involved that has UI... that still doesn't close the other one, so I have to Ctrl+C the process and I don't get the printf'd output at the end that says whether the results were expected or not. Then I vaguely recall having this conversation with jgraham before, and recall that there was perhaps a --close-when-done argument... but there's nothing that looks like that in ./mach help web-platform-tests, and --close-when-done is rejected as an error. So I give up, and just add a second test that I didn't need to run to the command line arguments in order to get the output that I wanted:

web-platform-test
~~~~~~~~~~~~~~~~~
Ran 2 checks (2 tests)
Expected results: 2
Unexpected results: 0
OK

(This was literally my first interaction with WPT since you needinfo'd me... and the fact that I hit something with subpar developer experience seems entirely normal.)

Flags: needinfo?(dbaron)

Oh, I guess the option I wanted would have been --no-pause-after-test. (I still think the pause-after-test default doesn't make sense for reftests.)

(In reply to David Baron :dbaron: 🏴󠁵󠁳󠁣󠁡󠁿 ⌚UTC-7 from comment #63)

Oh, I guess the option I wanted would have been --no-pause-after-test. (I still think the pause-after-test default doesn't make sense for reftests.)

That seems reasonable, and hopefully a straightforward change to https://searchfox.org/mozilla-central/rev/2a355e56c490768be676689de18629485c9d699b/testing/web-platform/tests/tools/wptrunner/wptrunner/wptrunner.py#123 to check that test_loader.tests["reftest"] is the array that's got the one item in it?

I filed and fixed bug 1591349 for the --pause-after-test defaults to depend on the test type; now only testharness tests will default to pausing when only a single test is run, matching the mochitest behaviour.

It's not clear to me that blocking this change on wpt harness UX is the best way forward. The current situation is already causing problems for gecko developers (per comment 0) and developers of other engines who are trying to improve interop with gecko by using our tests (c.f. comment 60). I would expect that layout developers have to interact with the wpt harness anyway, so the downsides of increasing the number of tests for which that is required seems limited, and there are clear upsides in terms of reducing duplication, making editing the tests easier, and them more accessible to other engines.

That said, I'm of course keen to work with people to improve the wpt harness experience. It does of course have some constraints from the need to work cross-browser, and has some design decisions that are different from the gecko harness, but I think it's made a lot of progress in terms of performance and featureset over the last few years. Where there are still issues with missing features or poor UX for wpt reftests I'd really like to work with layout developers to get them fixed.

I'm ok with going ahead with this as long as you check that the fail/pass state matches as you deduplicate. If you find cases where tests are failing in one harness and passing in another, those need to be investigated and understood.

One other question: do the harnesses use different canvas sizes? (e.g., reftest harness uses 800x1000) If so, it might be worth a little bit more checking that the tests are testing what they're supposed to be testing at both sizes.

(In reply to David Baron :dbaron: 🏴󠁵󠁳󠁣󠁡󠁿 ⌚UTC-8 from comment #67)

One other question: do the harnesses use different canvas sizes? (e.g., reftest harness uses 800x1000) If so, it might be worth a little bit more checking that the tests are testing what they're supposed to be testing at both sizes.

Yes, they do. WPT uses 800x600. See also https://github.com/web-platform-tests/wpt/issues/7135 about making that configurable.

Depends on: 1641803

Status update:

The following tests have failures as wpt reftests (paths relative to testing/web-platform/tests/css/vendor-imports/mozilla/mozilla-central-reftests:

  • flexbox/flexbox-align-self-horiz-004.xhtml - Intermittent on Linux.
  • flexbox/flexbox-mbp-horiz-003-reverse.xhtml - Intermittent on Windows
  • flexbox/flexbox-min-height-auto-002b.html - Failing on Linux
  • masking/clip-path-geometryBox-2.html - Failing in WebRender
  • masking/mask-image-2.html, masking/mask-image-3a.html, masking/mask-image-3b.html, masking/mask-image-3c.html, masking/mask-image-3d.html, masking/mask-image-3e.html, masking/mask-image-3f.html, masking/mask-image-3g.html, masking/mask-image-3h.html, masking/mask-image-3i.html - Intermittent timeout on Android opt (Bug 1624089)
  • masking/mask-origin-2.html - Failing everywhere
  • shapes1/shape-outside-border-box-border-radius-002.html, shapes1/shape-outside-border-box-border-radius-004.html, shapes1/shape-outside-margin-box-border-radius-002.html, shapes1/shape-outside-margin-box-border-radius-004.html - Failing everywhere
  • sizing/vert-block-size-small-or-larger-than-container-with-min-or-max-content-2b.html - Failing everywhere
  • text3/segment-break-transformation-removable-1.html, text3/segment-break-transformation-removable-2.html, text3/segment-break-transformation-removable-3.html, text3/segment-break-transformation-removable-4.html - Failing on (linux) 32 bit.

If the intermittent or failing tests are "new" (i.e. last 6 months) they may need fuzzy annotations. We previously added logging for tests with a document larger than the viewport and elminated most/all of those, but it needs to be rechecked given there have been changes here in that time.

I spoke with :jgraham and :jmaher (separately) about this issue and the approach to stop running duplicate tests (especially so if they are intermittents).

As we know in the current climate of things, cost savings is a top priority. While not everything is about ROI, I think for things I'm involved in there's a heavy skew towards finding ways to cut costs, improve efficiency and ship a better product.

Intermittents cause sheriffs a lot of workload as they must annotate every Tier-1 and Tier-2 failures on autoland and mozilla-central.
If these tests are duplicated across Reftest and WPT(reftest), and it intermittently times out on one or the other, then I see two problems:

  1. we are causing additional workload to sheriffs for little to no gain, impacting their work.
  2. we are incurring additional costs in the CI system when duplicated tests are run as they add overhead, execution time and artifact storage space costs.

The duplicated tests are moderate in number and won't be a thousand-dollar impact immediately, but this is also probably one of the lower hanging fruits that we can pick off and not adversely impact the quality of tests or the shipped product in any way.

With the above stated, I am proposing that we either put the WPT(reftest) variant to a backlog status. Alternatively, mainline the WPT(reftest) variant leading to the Reftest variant being dropped.

I know there are a lot of duplicated tests between reftest and wpt-reftest, but wpt-reftest does have many unique reftests- I am not sure we want to put them all in backlog or turn them off.

I assume the proposal was just for these specific reftests, not all wpt reftests.

In any case I think that backlogging the wpt versions is the wrong solution here. Moving more things onto the wpt reftest harness and away from vendor-specific reftests is important for Gecko to maintain web compatibility. Note that a lot of the issues we have already fixed related to this bug were basically the tests testing the wrong thing in the cross-browser case (because of different viewport sizes, or incorrect fuzziness), meaning they weren't providing useful coverage for Blink or WebKit. Fixing those issues is good and doesn't happen if we continue to view internal reftests as the default and cross-browser reftests as the edge case.

AFIAK comment #69 identifies all the remaning tests that don't behave identically in the two harnesses. With the exception of the masking tests where there's a real issue on Android it's most likely that the tests are just requiring some additional fuzzy annotations or the platform specific failures aren't problematic. In any case, with the expection of those named tests I'd hope it's uncontroversial to remove the reftest harness copies of the remaining tests.

that is 24 tests- What would be the work required to fix these or the risk of not running these properly?

Do we support all modes in wpt-reftest (webrender, fission, linux/windows/osx/android, gpu, no-accel) ?

shapes1/shape-outside-border-box-border-radius-002.html, shapes1/shape-outside-border-box-border-radius-004.html, shapes1/shape-outside-margin-box-border-radius-002.html, shapes1/shape-outside-margin-box-border-radius-004.html - Failing everywhere

FWIW, these are already failing under our own reftest framework annotated in this reftest.list. I filed bug 1652112 to merge shapes tests into wpt to move this bug a bit forward.

that is 24 tests- What would be the work required to fix these or the risk of not running these properly?

Apart from the masking tests I suspect it's only verifying that these tests have the same behaviour as the reftest harness, or modifying something like a fuzzy annotation i.e. almost no work. For the masking tests there's a suspected bug somewhere where on android we can end up waiting for a layout that never happens or something. That's going to require a bit more work.

Do we support all modes in wpt-reftest (webrender, fission, linux/windows/osx/android, gpu, no-accel) ?

I think we never enabled the no-accel/gpu configurations? I don't know of any reason we couldn't do that though. But note that's largely an orthogonal issue to this one; if having full coverage on those configurations is important we already have a problem because we already have things that are only covered with wpt reftests (and enabling those is probably going to dwarf whatever cost savings we make here).

Assignee: nobody → james
Status: NEW → ASSIGNED

I couldn't reproduce this bug with a try push containing 20
retriggers, so I think it's safe to remove the annotations. If this
comes up again as an intermittent, we can revert this patch and
investigate further.

These tests are duplicated across the reftest harness and the wpt
reftest harness. This has a number of problems:

  • We run every test twice, taking unnecessary time and resources.

  • A bespoke sync process is required for this directory.

  • We often get metadata out of sync between the two copies, so they
    seem to have different results.

  • Other vendors often don't realise these tests exist, so they're less
    useful for interop.

  • When others do realise the test exist, they don't feel able to fix
    issues in them because of the complex sync.

This patch removes the reftest harness copy of the tests. It seems
like the wpt reftest harness is sufficiently aligned with the reftest
harness that this shouldn't lose much in the way of coverage (and any
remaining differences are obviously a problem for all other wpt
reftests as well so should be fixed in general if there's a problem).

A notable remaining difference is that many object-fit tests are using
image-rendering:-moz-crisp-edges. The export script changed this to
image-rendering:pixelated, whih is not actually supported by gecko.

This better matches the original mozilla-central copies of these tests

I just notice we have some reftest-paged reftests under layout/reftests/w3c-css/submitted. https://searchfox.org/mozilla-central/search?q=reftest-paged&path=layout%2Freftests%2Fw3c-css%2Fsubmitted&case=true&regexp=false

They are not running as wpt print reftest on wpt. I discover this because my patch breaks layout/reftests/w3c-css/submitted/break3/moz-block-fragmentation-001.html on my try run, but it doesn't fail in wpt jobs.

Could you convert them to proper wpt print reftest, please? Bug 1645272 contains some examples of our previous attempt.

Flags: needinfo?(james)

Done.

Flags: needinfo?(james)
Pushed by james@hoppipolla.co.uk:
https://hg.mozilla.org/integration/autoland/rev/b6502643bedc
Copy some missing annotations from submitted reftests to vendor-import, r=mats,TYLin,emilio
https://hg.mozilla.org/integration/autoland/rev/3cb57a090de2
Remove intermittent annotations on mask-image tests, r=heycam,mats,TYLin,emilio
https://hg.mozilla.org/integration/autoland/rev/f11d38d8ed40
Move conditional3 tests for moz-document to reftest harness, r=emilio
https://hg.mozilla.org/integration/autoland/rev/1a10423235ce
Remove duplicate reftest harness tests, r=mats,TYLin,emilio
https://hg.mozilla.org/integration/autoland/rev/25436958000f
Convert object-fit tests to use image-rendering:crisp-edges, r=mats,TYLin,emilio
https://hg.mozilla.org/integration/autoland/rev/69ffd9d94e53
Convert reftest-paged tests to print reftests, r=TYLin
Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/autoland/rev/af2cd2d1ca4e
set multicol-height-002-print.xht as failing on macOS. DONTBUILD CLOSED TREE
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/26218 for changes under testing/web-platform/tests
Upstream PR merged by moz-wptsync-bot
Duplicate of this bug: 1423971
You need to log in before you can comment on or make changes to this bug.