Add subtests support for talos base vs ref pageloader tests

RESOLVED FIXED in Firefox 58

Status

enhancement
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: rwood, Assigned: rwood)

Tracking

Version 3
mozilla57
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox58 fixed)

Details

(Whiteboard: [PI:September])

Attachments

(3 attachments)

Currently in talos we support having base vs ref pageloader tests, for example bloom_basic. In this scenario, the pageloader test is run on the base page (i.e. bloom-basic) followed by the reference page (bloom-basic-ref), and ultimately the test result is reported as the difference between those two. This talos test type is specified in test.py with the "base_vs_ref=True" flag.

This is done by having the base url listed in the test manifest first, followed by the reference URL. This is passed into the pageloader add-on and pageloader sees it as two subtests; then back in talos we process the results and compare them,  and change the test output such that it reports a single test being run - with the results replicates actually being the difference between the two.

This works great for single tests, but doesn't support subtests. Add support to talos so that we can run base_vs_ref tests but have multiple of those in one manifest; so they are run and reported as subtests. Each subtest result will be the difference of the corresponding base vs ref for that subtest.

This requires test manifest changes to be able to list multiple base vs ref test pages, pageloader add-on changes to process and run tests accordingly, and talos changes to process and report corresponding results.
For the test manifest I would suggest something like this:

% http://localhost/tests/perf-reftest/bloom-basic.html vs http://localhost/tests/perf-reftest/bloom-basic-ref.html

And the ability to have multiple lines of those; each of those being one subtest. The pageloader addon can detect the 'vs' keyword and know how to run the tests accordingly (no need to set another tp pageloader command line arg).
I think:
& http://localhost/tests/perf-reftest/bloom-basic.html, http://localhost/tests/perf-reftest/bloom-basic-ref.html

this would assume the '&' character will compare the two files split by a ','.


as for the output we just need to have pageloader output the diff in this case and the resulting parsing will not need to be adjusted.

I can imagine the work to load two pages in pageloader, then aggregate the times will be a slight bit of work, is this something that you think would take a couple days to write code for?

Before doing this, we should maybe mock up what some results would look like
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #2)
> I think:
> & http://localhost/tests/perf-reftest/bloom-basic.html,
> http://localhost/tests/perf-reftest/bloom-basic-ref.html
> 
> this would assume the '&' character will compare the two files split by a
> ','.

Ok, as long as all the tests will always provide their own measurements i.e. current manifest entries start with '%' vs without.

> as for the output we just need to have pageloader output the diff in this
> case and the resulting parsing will not need to be adjusted.
> 
> I can imagine the work to load two pages in pageloader, then aggregate the
> times will be a slight bit of work, is this something that you think would
> take a couple days to write code for?

Yep I would think so.

> Before doing this, we should maybe mock up what some results would look like

Ok, will do.
Mock-up of what talos output would look like for perf-reftest suite if there is just one main test (perf_reftest) with all the tests as subtests; and each subtest is actually a base vs ref test. The mock-up just has 3 subtests just to make it shorter.

This is what the current output looks like for per-reftest (i.e. with 'base_replicates', 'ref_replicates', and 'replicates' being the difference) but added as subtests.
Attachment #8905640 - Flags: feedback?(jmaher)
Comment on attachment 8905640 [details]
perf-reftest-mock-local.json

this looks right, hard to tell without real data, luckily I just spent a while this morning looking at raw perfherder_data json blobs and comparing them in detail, so this was easy to 'visualize'
Attachment #8905640 - Flags: feedback?(jmaher) → feedback+
Whiteboard: [PI:September]
No longer blocks: 1374750
Blocks: 1398193
Posted file actual_local.json
local.json example as output from the attached patch
Comment on attachment 8909497 [details]
Bug 1397438 - Add subtests support for talos base vs ref pageloader tests;

https://reviewboard.mozilla.org/r/180972/#review186500

overall this is really nice to see and I think close to an r+, a list of small things I have called out.

::: testing/talos/talos/pageloader/chrome/pageloader.js:973
(Diff revision 3)
> +        // be the comparison values of those two pages; more than one line will result in base vs ref subtests
> +        if (items[0].indexOf("&") != -1) {
> +          baseVsRef = true;
> +          flags |= TEST_DOES_OWN_TIMING;
> +          var urlspecBase = items[1].slice(0, -1);
> +          var urlspecRef = items[2];

why do we slice the base and not the ref?  I assume the slide is for the , ?  If so, please document that in a brief comment.

::: testing/talos/talos/pageloader/chrome/pageloader.js:975
(Diff revision 3)
> +          baseVsRef = true;
> +          flags |= TEST_DOES_OWN_TIMING;
> +          var urlspecBase = items[1].slice(0, -1);
> +          var urlspecRef = items[2];
> +        } else {
> +          dumpLine("tp: Error - unknown manifest format!");

I would like to output the invalid line as well so the error message makes the failure more actionable.

::: testing/talos/talos/pageloader/chrome/pageloader.js:1003
(Diff revision 3)
> +
> +        url = gIOS.newURI(urlspecRef, null, manifestUri);
> +        if (pageFilterRegexp && !pageFilterRegexp.test(url.spec))
> +          continue;
> +        var pre = 'ref_page_'+ baseVsRefIndex + '_';
> +        d.push({ url, flags, pre });

we do d.push(...), but what is d and how does the 3rd value in the json get handled?

::: testing/talos/talos/run_tests.py:358
(Diff revision 3)
> +    # each set of two results is actually a base test followed by the
> +    # reference test; we want to go through each set of base vs reference
> +    for x in range(0, len(base_and_reference_results.results[0].results), 2):
> +
> +        # separate the 'base' and 'reference' result run values
> +        base_result_runs = base_and_reference_results.results[0].results[x]['runs']

this would be easier to read if we had:
results = base_and_reference_results.results[0].results

then:
base_result_runs = results[x]['runs']

::: testing/talos/talos/run_tests.py:362
(Diff revision 3)
> +        # separate the 'base' and 'reference' result run values
> +        base_result_runs = base_and_reference_results.results[0].results[x]['runs']
> +        ref_result_runs = base_and_reference_results.results[0].results[x + 1]['runs']
> +
> +        # for the subtest name, use the name of the base page
> +        sub_test_name = base_and_reference_results.results[0].results[x]['page']

could we verify that base_and_reference_results.results[0].results[x]['page'] == base_and_reference_results.results[0].results[x+1]['page']

::: testing/talos/talos/run_tests.py:373
(Diff revision 3)
> -                                                 'base_runs': base_result_runs,
> +                                                     'base_runs': base_result_runs,
> -                                                 'ref_runs': ref_result_runs})
> +                                                     'ref_runs': ref_result_runs})
>  
> -    # now step thru each result, compare 'base' vs 'ref', and store the difference in 'runs'
> +        # now step thru each result, compare 'base' vs 'ref', and store the difference in 'runs'
> -    _index = 0
> +        _index = 0
> -    for next_ref in comparison_result.results[0].results[0]['ref_runs']:
> +        for next_ref in comparison_result.results[0].results[subtest_index]['ref_runs']:

similar for this section as above:
results = comparison_result.results[0].results
Attachment #8909497 - Flags: review?(jmaher) → review-
Comment on attachment 8909497 [details]
Bug 1397438 - Add subtests support for talos base vs ref pageloader tests;

https://reviewboard.mozilla.org/r/180972/#review186500

> why do we slice the base and not the ref?  I assume the slide is for the , ?  If so, please document that in a brief comment.

yes exactly, ok done

> we do d.push(...), but what is d and how does the 3rd value in the json get handled?

'd' is a carry-over from existing pageloader code, defined at the top of function plLoadURLsFromURI; I'll change the name and update the comments more. The third value ('pre') is used in function plRecordTime; it is needed to differentiate the pagename for base vs ref type tests. When recording results / time values, we are basing it all on the pagename to be the unique key in the results. In the case of base vs ref then I add a 'pre' value so the pagename is unique, even when using the same test page as a reference page more than once in the same suite. I'll add more comments.

> could we verify that base_and_reference_results.results[0].results[x]['page'] == base_and_reference_results.results[0].results[x+1]['page']

No, because those are two different pages; the first is the base test page i.e. 'bloom-basic.html' and the second (x+1) is the reference test page i.e. 'bloom-basic-ref.html'. For the results (difference between both) subtest name, we want to use the base page name.
Comment on attachment 8909497 [details]
Bug 1397438 - Add subtests support for talos base vs ref pageloader tests;

https://reviewboard.mozilla.org/r/180972/#review186648

great stuff!
Attachment #8909497 - Flags: review?(jmaher) → review+
Thanks for the review!

Along with the base vs ref subtest support this patch includes bloom-basic in this new subtest format, and the addition of bloom-basic-2 and coalesce-1. Bug 1398193 will be used to land the rest of the tests (and ensure the tests are green, fast enough, etc. first).

Updated try run:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ed1fe957f94c1df6a8179bf041aaef97e6c639c

:bholley, do the base vs ref style-perf-tests require 25 iterations each, or is it ok to reduce them to 15 cycles like we did for the perf-reftest-singletons?
Flags: needinfo?(bobbyholley)
(In reply to Robert Wood [:rwood] from comment #15)
> Thanks for the review!
> 
> Along with the base vs ref subtest support this patch includes bloom-basic
> in this new subtest format, and the addition of bloom-basic-2 and
> coalesce-1. Bug 1398193 will be used to land the rest of the tests (and
> ensure the tests are green, fast enough, etc. first).
> 
> Updated try run:
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=6ed1fe957f94c1df6a8179bf041aaef97e6c639c
> 
> :bholley, do the base vs ref style-perf-tests require 25 iterations each, or
> is it ok to reduce them to 15 cycles like we did for the
> perf-reftest-singletons?

That's fine. Fewer iterations may also be fine, realistically. With the base-vs-ref tests, we're looking for big differences, not small ones.
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #16)
...
> 
> That's fine. Fewer iterations may also be fine, realistically. With the
> base-vs-ref tests, we're looking for big differences, not small ones.

Ok great, I'll do some try runs at 10 iterations. The current alert threshold right now is 5% - do you want to increase that to only alert on larger regressions, or is the 5% threshold fine?
Flags: needinfo?(bobbyholley)
(In reply to Robert Wood [:rwood] from comment #17)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #16)
> ...
> > 
> > That's fine. Fewer iterations may also be fine, realistically. With the
> > base-vs-ref tests, we're looking for big differences, not small ones.
> 
> Ok great, I'll do some try runs at 10 iterations. The current alert
> threshold right now is 5% - do you want to increase that to only alert on
> larger regressions, or is the 5% threshold fine?

Is that 5% difference between the testcases, or 5% change in the difference?

In general keeping the threshold precise where possible seems like a good starting point, we can always bump it if it generates spurious alerts.
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #18)
...
> 
> Is that 5% difference between the testcases, or 5% change in the difference?
> 
> In general keeping the threshold precise where possible seems like a good
> starting point, we can always bump it if it generates spurious alerts.

5% change in the difference itself (i.e. 5% change in the overall perf-reftest suite result). I'll leave it at 5%.
Pushed by rwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fc352698b470
Add subtests support for talos base vs ref pageloader tests; r=jmaher
https://hg.mozilla.org/mozilla-central/rev/fc352698b470
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Backed out for permafailing talos perf-reftest-stylo-disabled-e10s as shown in the logs below.
https://hg.mozilla.org/mozilla-central/rev/47f7b6c64265bc7bdd22eef7ab71abc97cf3f8bf

https://treeherder.mozilla.org/logviewer.html#?job_id=132317866&repo=mozilla-central
Status: RESOLVED → REOPENED
Flags: needinfo?(rwood)
Resolution: FIXED → ---
Target Milestone: mozilla57 → ---
Ugh, apologies, thanks for the backout
Flags: needinfo?(rwood)
Attachment #8909497 - Attachment is obsolete: true
Comment on attachment 8909497 [details]
Bug 1397438 - Add subtests support for talos base vs ref pageloader tests;

Updated patch after backout (missed an entry for the no longer existing 'bloom-basic' in talos.json for sytlo-disabled). Also rebased so there's a couple changes from grabbing the latest.
Attachment #8909497 - Flags: review+ → review?
Attachment #8909497 - Flags: review? → review?(jmaher)
Comment on attachment 8909497 [details]
Bug 1397438 - Add subtests support for talos base vs ref pageloader tests;

https://reviewboard.mozilla.org/r/180972/#review187584

still looks great
Attachment #8909497 - Flags: review?(jmaher) → review+
Pushed by rwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/87ffa54a5436
Add subtests support for talos base vs ref pageloader tests; r=jmaher
https://hg.mozilla.org/mozilla-central/rev/87ffa54a5436
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
You need to log in before you can comment on or make changes to this bug.