Compare stylo against non-stylo ref page in reftest

RESOLVED FIXED in Firefox 51

Status

defect
P1
normal
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: shinglyu, Assigned: shinglyu)

Tracking

Version 3
mozilla51
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox51 fixed)

Details

Attachments

(5 attachments, 11 obsolete attachments)

4.16 MB, text/x-log
Details
3.38 KB, application/x-shellscript
Details
662 bytes, application/x-shellscript
Details
1.26 MB, patch
heycam
: review+
Details | Diff | Splinter Review
58 bytes, text/x-review-board-request
manishearth
: review+
heycam
: review+
Details
When running reftests in stylo, we want to compare the stylo-rendered one with a non-stylo ref page. 

e.g. foo.html => Stylo
     foo-ref.html => Non-stylo
Blocks: stylo
Depends on: 1288349
Posted patch patch (obsolete) — Splinter Review
Oh wait, It only works in the first test case.
Posted patch patch v2 (obsolete) — Splinter Review
Attachment #8778731 - Attachment is obsolete: true
Attachment #8778763 - Flags: review?(cam)
Fixed. We can land this. and I'm currently working on a new reftest.list that runs stuff like

== test.html test.html 

The first will be run with stylo and the second will run with stylo disabled.
Comment on attachment 8778763 [details] [diff] [review]
patch v2

Review of attachment 8778763 [details] [diff] [review]:
-----------------------------------------------------------------

I think we only want to do this if (a) we have a MOZ_STYLO build, and (b) we run the reftest suite in a specific mode that means "compare stylo to non-stylo", since that's not how we'll want reftests to run long term.  For (a), I think we can just #ifdef MOZ_STYLO, since I see other #ifdefs in reftest.jsm.  For (b), we can just check another pref reftest.compareStyloToGecko (read it in InitAndStartRefTests and set a gCompareStyloToGecko variable) to determine if we're in this mode.  Then when we run reftests locally, we can do |./mach reftest --setpref=reftest.compareStyloToGecko=true|.  Does that sound reasonable?

::: layout/tools/reftest/reftest.jsm
@@ +1294,5 @@
>      gCurrentURL = gURLs[0]["url" + aState].spec;
>  
> +    var prefs = Components.classes["@mozilla.org/preferences-service;1"].
> +        getService(Components.interfaces.nsIPrefBranch);
> +    logger.info("Want to enable stylo? " + gState);

"Want to enable stylo? 2" is probably hard to understand if you read it in the logs.  How about we log "Enabling Servo-backed style system" or "Disabling Servo-backed style system" in the two if statement branches.

Also, let's do this after the RestoreChangedPreferences() call.  That will let use use pref() expressions in reftest manifests setting layout.css.servo.enabled, without it clobbering the following test run.

Then wrap this whole section in |if (gCompareStyloToGecko)|.

@@ +1296,5 @@
> +    var prefs = Components.classes["@mozilla.org/preferences-service;1"].
> +        getService(Components.interfaces.nsIPrefBranch);
> +    logger.info("Want to enable stylo? " + gState);
> +    if (gState == 2){
> +        prefs.setBoolPref('layout.stylo.enabled', false);

The pref got renamed to layout.css.servo.enabled.

@@ +1298,5 @@
> +    logger.info("Want to enable stylo? " + gState);
> +    if (gState == 2){
> +        prefs.setBoolPref('layout.stylo.enabled', false);
> +    }
> +    else {

Nit: "else {" on the previous line.
Attachment #8778763 - Flags: review?(cam) → review-
Thanks, I'll update the patch. 

An initial run shows that we have 654 passes and 1664 failure, but the total number looks weird. I probably missed some test while creating the reftest.list with shell scripts.
Posted patch patch v3 (obsolete) — Splinter Review
Attachment #8778763 - Attachment is obsolete: true
Attachment #8778811 - Flags: review?(cam)
Posted patch patch v3.1 (obsolete) — Splinter Review
Attachment #8778811 - Attachment is obsolete: true
Attachment #8778811 - Flags: review?(cam)
Attachment #8778812 - Flags: review?(cam)
Posted patch patch v3.1 (obsolete) — Splinter Review
Attachment #8778812 - Attachment is obsolete: true
Attachment #8778812 - Flags: review?(cam)
Attachment #8778813 - Flags: review?(cam)
Comment on attachment 8778813 [details] [diff] [review]
patch v3.1

Review of attachment 8778813 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with these things addressed.

::: layout/tools/reftest/reftest.jsm
@@ +54,4 @@
>  const FOCUS_FILTER_NEEDS_FOCUS_TESTS = "needs-focus";
>  const FOCUS_FILTER_NON_NEEDS_FOCUS_TESTS = "non-needs-focus";
>  var gFocusFilterMode = FOCUS_FILTER_ALL_TESTS;
> +#ifdef MOZ_STYLO

Can drop this #ifdef if you like, since it'll work fine without it.  Up to you.

@@ +1307,5 @@
> +#ifdef MOZ_STYLO
> +    if (gCompareStyloToGecko) {
> +        var prefs = Components.classes["@mozilla.org/preferences-service;1"].
> +            getService(Components.interfaces.nsIPrefBranch);
> +        if (gState == 2){

Nit: space before "{".

@@ +1308,5 @@
> +    if (gCompareStyloToGecko) {
> +        var prefs = Components.classes["@mozilla.org/preferences-service;1"].
> +            getService(Components.interfaces.nsIPrefBranch);
> +        if (gState == 2){
> +            logger.info("Disalbing Servo-backed style system");

*Disabling

@@ +1309,5 @@
> +        var prefs = Components.classes["@mozilla.org/preferences-service;1"].
> +            getService(Components.interfaces.nsIPrefBranch);
> +        if (gState == 2){
> +            logger.info("Disalbing Servo-backed style system");
> +            prefs.setBoolPref('layout.stylo.enabled', false);

Don't forget to update the pref names to layout.css.servo.enabled.
Attachment #8778813 - Flags: review?(cam) → review+
Assignee: nobody → slyu
Priority: -- → P1
Posted file reftest.log
This is the reftest result, skipping the crashes. 

A brief summary:

8944 FAIL
3785 PASS
26   CRASH
Posted patch crashes.patch (obsolete) — Splinter Review
Here are the crashes, the lines commented out are crashing reftests. Do we want to open a separate bug for this?
Flags: needinfo?(bobbyholley)
This is the custom reftest list that can compare stylo vs gecko style system. It may look ugly because it's generated from some shell scripting. 

I'm not sure how we should land this, probably only on servo/gecko-dev:stylo instead of m-c?
Attachment #8781390 - Flags: review?(cam)
Comment on attachment 8781390 [details] [diff] [review]
stylo vs gecko reftest list patch

Review of attachment 8781390 [details] [diff] [review]:
-----------------------------------------------------------------

Looking at the output, I'm guessing that your script removes any attributes like pref(), fuzzy-if(), etc. on the test lines.  These are usually required for correctness of the test, so I think we should be preserving them, unless we want to skip any tests that have these attributes.  But hopefully it is easy to preserve them.

Can you attach the script you used?  It should be easier for me to review than to go through the generated manifests in detail.

In terms of where this lands, since we're close to having no patches in servo/gecko-dev maybe it is better to get this landed in mozilla-central.  Since this will be used only in the short term, I assume we don't want people to manually update these files when the original reftest.list is updated.  Can you generate a comment at the top of each of these generated files that points out these are auto-generated, that there is no need to update them, and that they are temporary until we have CI for stylo, so that non-stylo contributors don't think they need to touch them?

Nit: Can you name the files reftest-stylo.list (it feels a little better to me to maintain the ".list" extension).

::: dom/canvas/test/reftest/filters/reftest.list.stylo-gecko
@@ +1,1 @@
> +== default-preferences default-preferences

This line doesn't look right.  Does your script handle url-prefix and default-preferences correctly?  https://dxr.mozilla.org/mozilla-central/source/layout/tools/reftest/README.txt

::: layout/reftests/font-inflation/reftest.list.stylo-gecko
@@ +25,5 @@
> +== textarea-3.html textarea-3.html
> +== css-transform-1.html css-transform-1.html
> +== css-transform-2.html css-transform-2.html
> +== container-with-clamping.html container-with-clamping.html
> +== test-pref(font.size.inflation.emPerLine,15) test-pref(font.size.inflation.emPerLine,15)

This line doesn't look right, probably because it was a "load" rather than a comparison test.
Attachment #8781390 - Flags: review?(cam) → review-
I only have part of the scripts written down, combined with a lot of manual steps. The reftest.lists are more heterogeneous then I expected, it's pretty hard to have a script that does it in one run.

What I tried to do is to extract the first (test) html and make it equal to itself. And I update all the reftest.list that contains a `inlclude subfolder/reftest.list` line to the modified path.

I'll updated the list as you suggested and manually check for bad lines.
Thinking about this a bit, I think the most ideal setup would be as follows:
* Generate a list of failing tests (which it sounds like we have already), and another list of crashing tests. Check both lists into mozilla-central.
* Modify the reftest runner to parse this list in stylo mode. In that mode, it runs the failing tests with expect-fail, and generates an UNEXPECTED-PASS if the test passes.
* We drive that list to zero as we implement features and fix bugs, ideally aided by try pushes (which it sounds like we'll have soon).
* Assuming the list of crashing tests isn't enormous, we can just have the test runner skip tests on the crash list, and we can drive that list to zero with manual testing.

The nice thing about this setup is that it converges towards a regular run as we fix more and more failures. Thoughts?
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #16)
> * Modify the reftest runner to parse this list in stylo mode. In that mode,
> it runs the failing tests with expect-fail, and generates an UNEXPECTED-PASS
> if the test passes.

Does "Stylo mode" here mean comparing Stylo-vs-Stylo ref images (like Gecko does) or comparing Stylo vs known-good ref images generated by Gecko?
(In reply to Chris Peterson [:cpeterson] from comment #17)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #16)
> > * Modify the reftest runner to parse this list in stylo mode. In that mode,
> > it runs the failing tests with expect-fail, and generates an UNEXPECTED-PASS
> > if the test passes.
> 
> Does "Stylo mode" here mean comparing Stylo-vs-Stylo ref images (like Gecko
> does) or comparing Stylo vs known-good ref images generated by Gecko?

I would prefer the latter.
In particular, assuming we continue to run Gecko-vs-Gecko reftests, we transitively get to compare Stylo-vs-Stylo by comparing Stylo-Gecko.
Posted file create_stylo_reftest.sh (obsolete) —
This is the script I use to create the reftest-stylo.lists, it also contains the list of crashing tests. Some of them seems to be intermittent, but I still disable them anyway, because I believe getting a sense of the pass/fail rate is more important then focusing on a few tricky reftests.
Attachment #8781383 - Attachment is obsolete: true
Posted patch reftest_lists.patch (obsolete) — Splinter Review
This patch is generated by the create_stylo_reftest.sh script. I tried to preserve as much of the original reftest.list information as possible. Some of the fuzzy-if markers may not make sense, but having some false positive won't hurt, because when Stylo reaches maturity we might want to switch back to the original reftest.list instead.
Attachment #8778813 - Attachment is obsolete: true
Attachment #8781390 - Attachment is obsolete: true
Attachment #8782782 - Flags: review?(cam)
Here is the latest test result:

   9282 FAIL
   3350 PASS
   49   CRASH
Attachment #8782781 - Attachment mime type: application/x-shellscript → text/plain
Per comment 16, it seems like the things that would be useful to check into the tree are:
(1) The list of failing tests
(2) A script that runs all the tests (presumably by generating a temporary unified reftest.list like you're doing), and then updates the expectation file in (1)
(3) A list of the crashing tests, which are skipped by the script in (2) (we can curate this list manually and hopefully drive it to zero soon)
Posted patch reftest_log.patch v2 (obsolete) — Splinter Review
Mark crash tests as `skip` and failing tests as `fails`.
Attachment #8782782 - Attachment is obsolete: true
Attachment #8782782 - Flags: review?(cam)
Attachment #8783834 - Flags: review?(cam)
You can find the latest script I used in https://github.com/shinglyu/gecko-dev/commits/reftest-new
Posted file mark_expected_fails.sh (obsolete) —
This script is used to mark `fails`
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #23)
> Per comment 16, it seems like the things that would be useful to check into
> the tree are:
> (1) The list of failing tests
> (2) A script that runs all the tests (presumably by generating a temporary
> unified reftest.list like you're doing), and then updates the expectation
> file in (1)
> (3) A list of the crashing tests, which are skipped by the script in (2) (we
> can curate this list manually and hopefully drive it to zero soon)

Bobby, I want to clarify what you're looking for here.  Are you intending this script to be run by developers locally when wanting to run tests under stylo, or (additionally) for it to be run on the build machines under CI?  The script probably needs a few changes to make it runnable on non-Linux platforms (the |grep -P| has to change, at least).

For (1) Shing probably has the output reftest log, which the comment 26 script reads to mark the currently failing tests as "fails".  But (2) would need a bit more work.  (3) is just for human consumption, yes?


I think there are three things we could be doing here:

A. Check in what Shing has in his patches here, which is the parallel set of reftest manifests that is the output of the script.  We run these reftests by |./mach reftest layout/reftests/reftest-stylo.list|.  When we have newly passing tests, we update these reftest-stylo.list files manually.  At some point, we will feel that the list of marked "fails" tests for stylo is small enough that we can just migrate them into actual fails-if(stylo) annotations in the real reftest.list manifests, and then we delete the reftest-stylo.list manifests.

B. What I think is your comment 23 suggestion.  We don't check in the output of Shing's script, but instead modify the script to generate the parallel reftest-stylo.list manifests (which then would probably better in the OBJDIR) and check in that script.  We also check in the list of UNEXPECTED-FAILs running these tests, which the script would use.  We add functionality to the script to update the expectation file.  When we have newly passing tests, we remove entries from this list of UNEXPECTED-FAILs, with the help of the script updating the expectations file.  Again, when we feel that we've driven down the stylo failures sufficiently, we migrate them to real fails-if(stylo) annotations in the primary reftest.list files, then get rid of the script and run reftests normally.

C. We tweak Shing's script to modify the primary reftest.list manifests directly by adding fails-if(stylo), and check in those changes.  Reftests are run normally.  When we have newly passing tests, we remove the fails-if(stylo) annotations manually.


If we don't want to spin our wheel too much making this available for stylo developers to run, I think we should do A or C.  I am leaning towards C, since we will need to support fails-if(stylo) at some point, and I'm not too bothered about the primary reftest.list being littered with fails-if(stylo) annotations.  The work needed to update the script to modify the reftest.list files themselves (which we run as a once off) should be minimal.  And as I think we mentioned in the meeting last week, it shouldn't be a big deal for developers to remove the fails-if() annotations manually.

WDYT?
Flags: needinfo?(slyu)
Flags: needinfo?(bobbyholley)
One of the difference between reftest-stylo.list and reftest.list is the former tests `== test.html test.html` instead of `== test.html ref.html`, and all the tests are expected to be `==`, no `!=` exists.  If we want to use the original reftest.list, we have to modify the test runner to read only the test.html, and force all the condition to be `==`. That is to say, a line: `fails-if(stylo) != A.html A-ref.html` will actually run as `fails-if(stylo) A.html A.html` when we turn on the stylo flag.

This has two issues: 1) It's consider a bad practice to have too many conditional branches in a test runner. You won't even be sure which test branch was exectuted, so you have to debug both the test infra and the production code at the same time. 2) The reftest.list is not explicit enough, people need to know the magic happening in the test runner to see why a `!= A.html A-ref.html` is expected to equal itself.

So in conclusion, I support A. Since we still have thousands of test failures, I think is too early to think about integrating with the original reftest.list.
Flags: needinfo?(slyu)
Yeah, IIRC in last week's meeting we settled on A. Given that we fail more tests than we pass, I think it's probably too early to pollute the master reftest lists with annotations.

> We run these reftests by |./mach reftest layout/reftests/reftest-stylo.list|

As shing mentioned, we'll still need some kind of stylo mode flag to the reftest command to indicate that we should be toggling the stylo pref between the two renderings.

I do think that it would be useful to periodically pull in newly-added tests (especially ones that _we_ might add). But that can probably be done manually, i.e. shing could re-run his script once per week and add any new tests that appeared in the tree that aren't marked as either skip or fail.

We should try to get these running on CI for our branch as soon as possible.
Flags: needinfo?(bobbyholley)
The most up to date command is:

./mach reftest --disable-e10s layout/reftests/reftest-stylo.list --setpref=reftest.compareStyloToGecko=true 

Optionally add --run-tests-in-parallel to speed it up
In https://github.com/shinglyu/gecko-dev/commit/dc1732ab1bce3007e6b4dcac59d12e598be6bafc:

+while grep -R "fails fails" --include *-stylo.list -q
+do
+  echo "Clean up multiple fails"
+  find . -name *-stylo.list | xargs sed -i 's/fails fails/fails/g'
+done

This could turn some "fails fail-if(...)" lines into "fails-if(...)", but if it failed under stylo we'd want to keep the "fails".  Maybe all these tests we were passing with stylo, so we didn't run into this problem?
FWIW, if we have lines like this in the original manifests:

  test-pref(a.b.c,10) ref-pref(d.e.f.,20) == x.html y.html

then I think we really should be generating this in the stylo manifests:

  test-pref(a.b.c,10) ref-pref(a.b.c.,10) == x.html x.html

If you can fix this easily, let's do it, but there are only about 150 reftest manifest lines that use test-pref so we can worry about it later too.
+  # Wrap inline comment to a new line for easier awk processing
+  find . -name ${1}-stylo.list | xargs sed -i "s/ # /\n# /g"

It may be confusing for people reading the generated manifests for the comments being after the line they apply to.  Can you make them appear before the line?  Or feel free to just drop the comments.
Comment on attachment 8783834 [details] [diff] [review]
reftest_log.patch v2

Quick skim through the generated files looks good.  r=me with those comments above addressed.
Attachment #8783834 - Flags: review?(cam) → review+
(In reply to Cameron McCormack (:heycam) from comment #33)
> +  # Wrap inline comment to a new line for easier awk processing
> +  find . -name ${1}-stylo.list | xargs sed -i "s/ # /\n# /g"

perl -i -pe 's/^(.*?) (#.*)/$2\n$1/'

(Not sure how to do ".*?" with sed.)
(In reply to Cameron McCormack (:heycam) from comment #31)
> In
> https://github.com/shinglyu/gecko-dev/commit/
> dc1732ab1bce3007e6b4dcac59d12e598be6bafc:
> 
> +while grep -R "fails fails" --include *-stylo.list -q
> +do
> +  echo "Clean up multiple fails"
> +  find . -name *-stylo.list | xargs sed -i 's/fails fails/fails/g'
> +done
> 
> This could turn some "fails fail-if(...)" lines into "fails-if(...)", but if
> it failed under stylo we'd want to keep the "fails".  Maybe all these tests
> we were passing with stylo, so we didn't run into this problem?

Let me change this to `/fails fails /` (with a space) so we don't match `fails-if`
(In reply to Cameron McCormack (:heycam) from comment #32)
> FWIW, if we have lines like this in the original manifests:
> 
>   test-pref(a.b.c,10) ref-pref(d.e.f.,20) == x.html y.html
> 
> then I think we really should be generating this in the stylo manifests:
> 
>   test-pref(a.b.c,10) ref-pref(a.b.c.,10) == x.html x.html
> 
> If you can fix this easily, let's do it, but there are only about 150
> reftest manifest lines that use test-pref so we can worry about it later too.

I only find 3 lines with this pattern in the reftest sanity test. I'll add a comment in the script and handle it later.
Attachment #8782781 - Attachment is obsolete: true
Attachment #8785119 - Attachment is obsolete: true
I updated the script. But the fail and crashing tests seem to change a lot. I'll rerun the test on the latest master and re-generate the list next week (PTO tomorrow).
(In reply to Shing Lyu [:shinglyu] from comment #37)
> I only find 3 lines with this pattern in the reftest sanity test. I'll add a
> comment in the script and handle it later.

I think it's also for this pattern:

  test-pref(a.b.c,10) == a.html b.html

i.e. without a ref-pref(), which is more common than with both a test-pref() and ref-pref() specified.
This is the latest reftest I got. I comment out some of the crashing tests because skip doesn't work well with other flags.

We might need to run that multiple times in TaskCluster to know which ones are intermittents.
Attachment #8783834 - Attachment is obsolete: true
Attachment #8789191 - Flags: review?(cam)
Comment on attachment 8789191 [details] [diff] [review]
reftest_list.patch v3

Great, so rs=me on that.  Let's land this in the stylo repo.
Attachment #8789191 - Flags: review?(cam) → review+
Blocks: 1301316
Attachment #8789267 - Flags: review?(manishearth)
This is the command: ./mach try -p linux64 -b o -u reftest-stylo[x64] 

Try result: https://treeherder.mozilla.org/#/jobs?repo=try&revision=26296d10f8da
Attachment #8789267 - Flags: review?(cam)
There are a bunch of TEST_UNEXPECTED_PASSes: is this supposed to happen?
Comment on attachment 8789267 [details]
Bug 1288350 - Temporary reftest list for Stylo vs Gecko test

https://reviewboard.mozilla.org/r/77566/#review76108
Attachment #8789267 - Flags: review?(manishearth) → review+
Comment on attachment 8789267 [details]
Bug 1288350 - Temporary reftest list for Stylo vs Gecko test

https://reviewboard.mozilla.org/r/77566/#review76112

rs=me to land this on mozilla-central.
Attachment #8789267 - Flags: review?(cam) → review+
Pushed by cmccormack@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/93c6fdc10d2a
Temporary reftest list for Stylo vs Gecko test r=heycam,manishearth
https://hg.mozilla.org/mozilla-central/rev/93c6fdc10d2a
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Re-run a try, addressed the previous test failures:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4136b38b6f23
Looks like the number of unexpected behavior are reduced, but still didn't reach 0. This is the patch: https://hg.mozilla.org/try/rev/cf293a04c023a6d2a5554a1e310bd443544501b9

Here is how to do it:

1. Download all the `log_info.log` or `live_backing.log` from Treeherder
2. Run `bash mark_expected_fails.sh log_info.log` on all of them. (You can find the script here: https://github.com/shinglyu/gecko-dev/blob/reftest-new/mark_expected_fails.sh)
3. Commit the changes made by the script, push to try with:

./mach try -p linux64 -b o -u reftest-stylo[x64] --rebuild 5 --tag reftest-stylo

IIUC, `--rebuild 5` make it repeat 5 times, `--tag reftest-stylo` make sure it only repeat the reftest-stylo test
Flags: needinfo?(manishearth)
Fixed all the things in bug 1302993
Flags: needinfo?(manishearth)
You need to log in before you can comment on or make changes to this bug.