Fix perf-reftest to compare perf numbers of basic vs ref pages.

RESOLVED FIXED in Firefox 56

Status

defect
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: jmaher, Assigned: rwood)

Tracking

53 Branch
mozilla56
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox56 fixed)

Details

(Whiteboard: [PI:June])

Attachments

(2 attachments)

Talos has detected a Firefox performance regression from push d4de1e480c2e5a00a93f25469676bd19d9a353f2. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

  9%  bloom_basic http: osx-10-10 opt e10s     696.43 -> 755.86
  8%  bloom_basic_ref http: osx-10-10 opt e10s 697.00 -> 754.84


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=6455

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
:emilio, this regression started when your change landed from bug 1351339, can you take a look at your code and come to a decision for resolving this bug as either: fixing the issue, backing out the code, or accepting the regression?
Component: Untriaged → CSS Parsing and Computation
Flags: needinfo?(emilio+bugs)
Product: Firefox → Core
Is this a stylo regression? Or a normal Gecko one?
Flags: needinfo?(emilio+bugs) → needinfo?(jmaher)
osx, so gecko.
Flags: needinfo?(jmaher)
Then it's pretty much impossible that it's due to my changes, because they're in ServoRestyleManager (aren't executed at all).

I was going through a talos push of mine not long ago and saw super noisy data on this test, are you sure it's not an instance of this?
it could be noisy, the test was accidentally disabled right after this push, so there is not future data, but I saw on inbound (the same regression, so I assumed it was merged and we have a cleaner looking sustained regression).  let me look at the inbound data and see what is going on there.
on inbound there is a different patch responsible for the change, you can see the pattern here:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bautoland,1d83527507421790c3e598b6fde76955bce2467d,1,1%5D&series=%5Bmozilla-inbound,1d83527507421790c3e598b6fde76955bce2467d,1,1%5D&zoom=1494196579873.1257,1494265950000,665.3183612037687,800.149821877926

We had adjusted this test to only alert on 5% changes, but maybe that isn't enough.

:bholley, this test is very bimodal on some platforms (windows 8 specifically) and still very noisy.  this is one bug that I didn't see was an obvious invalid change.  You can see other changes we mark as invalid:
https://treeherder.mozilla.org/perf.html#/alerts?status=3&framework=1&filter=bloom&page=1

so far we are not getting much value from this test, instead a lot of noise- can you work on making this test more reliable or consider disabling it?
Flags: needinfo?(bobbyholley)
How often is the 'test' changing without a corresponding change in the 'ref'? The main point of these tests is to make sure that 'test' and 'ref' do not diverge significantly (since that would indicate that the specific optimization being tested is broken).

There is also some secondary value in the raw performance measured by the absolute number, since it tends to track raw throughput. Sometimes changes in that throughput might be triggered by things outside of our control. In the case of this test, I think it's mainly related to data structure size and malloc patterns elsewhere in gecko, which will affect cache locality in selector matching, but there isn't much we can realistically do about it. In other situations though, it's the result of a real bug, and would be nice to catch. But I don't want to cause a bunch of extra work for the perf sheriffs.

Are there some platforms where the numbers are more stable? Might it be possible to only do the test/ref comparison for windows 8, and track the absolute number on other platforms?
Flags: needinfo?(bobbyholley) → needinfo?(jmaher)
Also, this testcase gets (literally) ten times faster with stylo. So the numbers should be more stable then (we eliminate most of the cache-dependence), and it would certainly be nice to have the perfherder infrastructure note that improvement. :-)
Currently the test case is not comparing bloom -> bloom-ref; it is just reporting numbers for each data series.  Should we make this only report a comparison between bloom and bloom-ref?
Flags: needinfo?(jmaher) → needinfo?(bobbyholley)
(In reply to Joel Maher ( :jmaher) from comment #9)
> Currently the test case is not comparing bloom -> bloom-ref; it is just
> reporting numbers for each data series.  Should we make this only report a
> comparison between bloom and bloom-ref?

That is the most important metric, so making that more visible would be useful. Cameron and I (mostly Cameron) have been developing a whole bunch more of these tests out-of-tree, so that we only have to go through the trouble of landing them into Talos once. You can see what we've got (along with a reftest-like manifest) here: https://github.com/heycam/style-perf-tests/tree/master/perf-reftest

That said, as noted above, it would still be nice to track absolute values, at least for the (test, platform) tuples that don't turn out to be noisy. Hence comment 7 and comment 8.
Flags: needinfo?(bobbyholley)
right now we get many alerts on osx, a few on linux and almost none on win7/win8.  Win8 is very bi-modal, osx is as well, but goes up/down in chunks (which accounts for the alerts we see).

here is an example of the graphs for different platforms:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,9ea7681c1a99cdb586358585618618c4c838d410,0,1%5D&series=%5Bmozilla-inbound,0ea13b9b6cc509b2bd5372f59504f4151d147802,0,1%5D&series=%5Bmozilla-inbound,2d7b02949d1cd7587b4bc546b19f3f4af2928873,0,1%5D&series=%5Bmozilla-inbound,c6ec2be895db9ba5d9657ac8c1ea74de3837ca89,0,1%5D

It sounds like the priority is comparing base to ref- we should figure out a better way to do that and ensure that is solid.

Secondarily it would be nice to see absolute values- off the top of my head, I see that as loading extra pages, or alternatively displaying the raw results, not the comparison.

Possibly we can figure out a solution in the manifest file to determine which two files to compare.  Maybe a new feature in the pageloader addon, or some mode in talos that will compare X.html vs X-ref.html
Component: CSS Parsing and Computation → Talos
Product: Core → Testing
Summary: 8.3 - 8.53% bloom_basic http: / bloom_basic_ref http: (osx-10-10) regression on push d4de1e480c2e5a00a93f25469676bd19d9a353f2 (Mon May 8 2017) → Fix perf-reftest to compare perf numbers of basic vs ref pages.
Assignee: nobody → rwood
Note that heycam built a standalone harness and manifest format for the tests that we've been developing out-of-tree, but eventually would like to check into the tree: https://github.com/heycam/style-perf-tests

It would be good to align those to the extent that is practical.
Whiteboard: [PI:June]
Assignee

Comment 13

2 years ago
With this patch:

I've added a 'base_vs_ref' flag in test.py for the perf-reftest. The manifest has both pages listed (no special syntax). There are no changes to the pageloader add-on itself.

The pageloader add-on runs both of the pages as it normally would, reporting back the results in one results object, with each page being reported as if they were talos subtests.

With the 'base_vs_ref' flag set, that results object (that has results for both the 'reference' and 'base' pages) is used to create a new result object that:

- has the 'bloom_basic_ref' results saved as 'reference_runs'
- has the 'bloom_basic' results saved as 'base_runs'
- compares both of those (takes the absolute value of reference minus base) and saves as 'runs'
- talos then sees this results object like any other single test and calculates the median / reports the results based on 'runs'
- the 'reference_runs' are reported in the talos output json structure as 'reference_replicates'
- the 'base_runs' are reported in the talos output json structure as 'base_replicates'
- the 'replicates' in the talos output json are actually the abs difference between the 'ref' and 'base'
- in treeherder/perfherder the result shown will be the difference value; so that will be used to trigger alerts
- the talos results json is already archived by treeherder and available as a 'perfherder-data.json' link on the treeherder 'details' tab for the job; that file will now also contain the 'reference_replicates' and 'base_replicates' along side the difference 'replicates', so if you wish to see the original replicate results you can

I will attach a sample talos output json file.
Assignee

Comment 14

2 years ago
Posted file local.json
Example talos output json for perf-reftest with this patch.
great progress here- the output.json looks good on a quick glance.
Comment hidden (mozreview-request)
Assignee

Updated

2 years ago
Status: NEW → ASSIGNED
Assignee

Comment 17

2 years ago
(In reply to Joel Maher ( :jmaher) from comment #15)
> great progress here- the output.json looks good on a quick glance.

Ok cool, thanks!
Reporter

Comment 18

2 years ago
mozreview-review
Comment on attachment 8882378 [details]
Bug 1363104 - Fix perf-reftest to compare perf numbers of basic vs ref pages;

https://reviewboard.mozilla.org/r/153482/#review158684

a couple of nits

::: testing/talos/talos/run_tests.py:265
(Diff revision 1)
> +            # and store the base and reference test replicates in results.json for upload
> +            elif test.get('base_vs_ref', False):
> +                # run the test, results will be reported for each page like two tests in the suite
> +                base_and_reference_results = mytest.runTest(browser_config, test)
> +                # now compare each test, and create a new test object for the comparison
> +                talos_results.add(make_comparison_result(base_and_reference_results))

shouldn't this line be indented?

::: testing/talos/talos/test.py:798
(Diff revision 1)
>  @register_test()
>  class bloom_basic(PageloaderTest):
>      """
> -    Stylo bloom_basic test
> +    Stylo bloom_basic: runs bloom_basic and bloom_basic_ref and reports difference
>      """
> +    base_vs_ref = True # compare the two test pages with eachother and report comparison

given this will we only report the _ref data, not the original?  I think that is what we want, but want to clarify this.
Attachment #8882378 - Flags: review?(jmaher) → review+
Assignee

Comment 19

2 years ago
(In reply to Joel Maher ( :jmaher) from comment #18)
> Comment on attachment 8882378 [details]
...
> 
> ::: testing/talos/talos/test.py:798
> (Diff revision 1)
> >  @register_test()
> >  class bloom_basic(PageloaderTest):
> >      """
> > -    Stylo bloom_basic test
> > +    Stylo bloom_basic: runs bloom_basic and bloom_basic_ref and reports difference
> >      """
> > +    base_vs_ref = True # compare the two test pages with eachother and report comparison
> 
> given this will we only report the _ref data, not the original?  I think
> that is what we want, but want to clarify this.

We will report the difference between the reference (bloom_basic_ref) and base (bloom_basic). The original data will be made available as 'ref_replicates' and 'base_replicates' inside the output json, but perfherder will only see/report the difference.
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Assignee

Comment 22

2 years ago
(In reply to Robert Wood [:rwood] from comment #19)
> (In reply to Joel Maher ( :jmaher) from comment #18)
> > Comment on attachment 8882378 [details]
> ...
> > 
> > ::: testing/talos/talos/test.py:798
> > (Diff revision 1)
> > >  @register_test()
> > >  class bloom_basic(PageloaderTest):
> > >      """
> > > -    Stylo bloom_basic test
> > > +    Stylo bloom_basic: runs bloom_basic and bloom_basic_ref and reports difference
> > >      """
> > > +    base_vs_ref = True # compare the two test pages with eachother and report comparison
> > 
> > given this will we only report the _ref data, not the original?  I think
> > that is what we want, but want to clarify this.
> 
> We will report the difference between the reference (bloom_basic_ref) and
> base (bloom_basic). The original data will be made available as
> 'ref_replicates' and 'base_replicates' inside the output json, but
> perfherder will only see/report the difference.

That is what we want, correct?
Flags: needinfo?(jmaher)
Flags: needinfo?(bobbyholley)
I think this is good- only displaying the ref, but making the other data available in the logs.  Lets see what :bholley thinks.
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) from comment #24)
> I think this is good- only displaying the ref, but making the other data
> available in the logs.  Lets see what :bholley thinks.

In general that sounds good. If possible, it would be nice to opt into absolute numbers for certain tests. In particular, I've put a lot of work into making bloom-basic extremely fast on stylo, and the numbers are much more stable (on my machine) than on Gecko (pretty reliably 40ms, whereas Gecko fluctuates between 500ms and 650ms).

Given how carefully-optimized it is, I'd really like to make sure we stay fast (in an absolute sense) on bloom-basic, even though that's secondary to the overall goal of the perf-reftests. Would such an option be practical?
Flags: needinfo?(bobbyholley)
Assignee

Comment 26

2 years ago
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #25)
> (In reply to Joel Maher ( :jmaher) from comment #24)
> > I think this is good- only displaying the ref, but making the other data
> > available in the logs.  Lets see what :bholley thinks.
> 
> In general that sounds good. If possible, it would be nice to opt into
> absolute numbers for certain tests. In particular, I've put a lot of work
> into making bloom-basic extremely fast on stylo, and the numbers are much
> more stable (on my machine) than on Gecko (pretty reliably 40ms, whereas
> Gecko fluctuates between 500ms and 650ms).
> 
> Given how carefully-optimized it is, I'd really like to make sure we stay
> fast (in an absolute sense) on bloom-basic, even though that's secondary to
> the overall goal of the perf-reftests. Would such an option be practical?

Thanks for the feedback. IMO we can't have it both ways because of the nature of our perf regression alerting. If we have the perf-reftest suite reporting and alerting on the *dif* between bloom_basic and bloom_basic ref on everything except stylo, but on stylo have the perf-reftest reporting and alerting on the absolute value of bloom_basic, that will be too confusing in perfherder comparisons, alerting, sheriffing regressions, etc.

If bloom_basic and bloom_basic ref are too noisy on non-stylo, should we just turn them off? And instead of perf-reftest, have a new test suite that only runs on stylo, and only runs, reports, and alerts on bloom_basic alone (and not bloom_basic_ref)?
Flags: needinfo?(jmaher)
Flags: needinfo?(bobbyholley)
I am fine with two unique tests, and we run the bloom_basic on stylo only.  Lately using the 5% threshold we have had little to no alerts, so that might be a solution to keep around- if that is the case, then we would just add bloom_basic_ref as a new test in addition to bloom_basic.
Flags: needinfo?(jmaher)
Yes, that sounds great. Let's go ahead with the delta-based perf-reftest framework as planned (so that we can eventually integrate the tests in [1]), and then set up a separate suite for singletons whose absolute value is interesting, and track those values on tests/platforms where the results are not noisy.

[1] https://github.com/bholley/style-perf-tests
Flags: needinfo?(bobbyholley)

Comment 29

2 years ago
Pushed by rwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a7d99ed85d5e
Fix perf-reftest to compare perf numbers of basic vs ref pages; r=jmaher
Assignee

Comment 30

2 years ago
Landed the delta-based perf-reftest that compares bloom_basic and bloom_basic ref. For the second bloom_basic suite do you want bloom_basic.html or bloom_basic_ref.html? Also what do you want the suite name to be called that will run that single test on stylo? This will also require new buildbot / tc configs and a new treeherder symbol name. I am going to file new bugs for this second bloom_basic suite.
Flags: needinfo?(bobbyholley)
Assignee

Comment 31

2 years ago
:igoldan, FYI, this patch that just landed changes the values reported in the 'perf-reftest' talos test suite (treeherder job symbol 'p'). Instead of raw values for 'bloom_basic' and 'bloom_basic_ref', with this patch we now are reporting comparison of the two pages, so the reported values are way lower. This will cause perfherder regression alerts for perf-reftest / bloom_basic - so FYI any alerts that result of this patch you can just mark as invalid (maybe link them to this bug?). Thanks!
Flags: needinfo?(ionut.goldan)
(In reply to Robert Wood [:rwood] from comment #30)
> Landed the delta-based perf-reftest that compares bloom_basic and
> bloom_basic ref. For the second bloom_basic suite do you want
> bloom_basic.html or bloom_basic_ref.html?

bloom_basic is good. Can you use the copy from [1], which we've tweaked a bit to make more accurate? Stylo is too smart for the version that's currently in mozilla-central, and will optimize the whole thing away. :-)

We'll update the version in perf-reftest when we import the rest of the style-perf-tests.

> Also what do you want the suite
> name to be called that will run that single test on stylo? This will also
> require new buildbot / tc configs and a new treeherder symbol name. I am
> going to file new bugs for this second bloom_basic suite.

A few proposals for the name of the suite:
* simple-benchmarks
* perf-singletons
* perf-reftest-singletons
* web-microbench

Do any of those sound ok?

[1] https://github.com/heycam/style-perf-tests/tree/master/perf-reftest
Flags: needinfo?(bobbyholley)

Comment 33

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/a7d99ed85d5e
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Thanks Robert, for the notification. Looking forward into investigating this new alert.
Flags: needinfo?(ionut.goldan)
You need to log in before you can comment on or make changes to this bug.