1363104 - Fix perf-reftest to compare perf numbers of basic vs ref pages.

Reporter

Description

•

8 years ago

Talos has detected a Firefox performance regression from push d4de1e480c2e5a00a93f25469676bd19d9a353f2. As author of one of the patches included in that push, we need your help to address this regression. Regressions: 9% bloom_basic http: osx-10-10 opt e10s 696.43 -> 755.86 8% bloom_basic_ref http: osx-10-10 opt e10s 697.00 -> 754.84 You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=6455 On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format. To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running *** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! *** Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 1

•

8 years ago

:emilio, this regression started when your change landed from bug 1351339, can you take a look at your code and come to a decision for resolving this bug as either: fixing the issue, backing out the code, or accepting the regression?

Component: Untriaged → CSS Parsing and Computation

Flags: needinfo?(emilio+bugs)

Product: Firefox → Core

Emilio Cobos Álvarez (:emilio)

Comment 2

•

8 years ago

Is this a stylo regression? Or a normal Gecko one?

Flags: needinfo?(emilio+bugs) → needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 3

•

8 years ago

osx, so gecko.

Flags: needinfo?(jmaher)

Emilio Cobos Álvarez (:emilio)

Comment 4

•

8 years ago

Then it's pretty much impossible that it's due to my changes, because they're in ServoRestyleManager (aren't executed at all). I was going through a talos push of mine not long ago and saw super noisy data on this test, are you sure it's not an instance of this?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 5

•

8 years ago

it could be noisy, the test was accidentally disabled right after this push, so there is not future data, but I saw on inbound (the same regression, so I assumed it was merged and we have a cleaner looking sustained regression). let me look at the inbound data and see what is going on there.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 6

•

8 years ago

on inbound there is a different patch responsible for the change, you can see the pattern here: https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bautoland,1d83527507421790c3e598b6fde76955bce2467d,1,1%5D&series=%5Bmozilla-inbound,1d83527507421790c3e598b6fde76955bce2467d,1,1%5D&zoom=1494196579873.1257,1494265950000,665.3183612037687,800.149821877926 We had adjusted this test to only alert on 5% changes, but maybe that isn't enough. :bholley, this test is very bimodal on some platforms (windows 8 specifically) and still very noisy. this is one bug that I didn't see was an obvious invalid change. You can see other changes we mark as invalid: https://treeherder.mozilla.org/perf.html#/alerts?status=3&framework=1&filter=bloom&page=1 so far we are not getting much value from this test, instead a lot of noise- can you work on making this test more reliable or consider disabling it?

Flags: needinfo?(bobbyholley)

Bobby Holley (:bholley)

Comment 7

•

8 years ago

How often is the 'test' changing without a corresponding change in the 'ref'? The main point of these tests is to make sure that 'test' and 'ref' do not diverge significantly (since that would indicate that the specific optimization being tested is broken). There is also some secondary value in the raw performance measured by the absolute number, since it tends to track raw throughput. Sometimes changes in that throughput might be triggered by things outside of our control. In the case of this test, I think it's mainly related to data structure size and malloc patterns elsewhere in gecko, which will affect cache locality in selector matching, but there isn't much we can realistically do about it. In other situations though, it's the result of a real bug, and would be nice to catch. But I don't want to cause a bunch of extra work for the perf sheriffs. Are there some platforms where the numbers are more stable? Might it be possible to only do the test/ref comparison for windows 8, and track the absolute number on other platforms?

Flags: needinfo?(bobbyholley) → needinfo?(jmaher)

Bobby Holley (:bholley)

Comment 8

•

8 years ago

Also, this testcase gets (literally) ten times faster with stylo. So the numbers should be more stable then (we eliminate most of the cache-dependence), and it would certainly be nice to have the perfherder infrastructure note that improvement. :-)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 9

•

8 years ago

Currently the test case is not comparing bloom -> bloom-ref; it is just reporting numbers for each data series. Should we make this only report a comparison between bloom and bloom-ref?

Flags: needinfo?(jmaher) → needinfo?(bobbyholley)

Bobby Holley (:bholley)

Comment 10

•

8 years ago

(In reply to Joel Maher ( :jmaher) from comment #9) > Currently the test case is not comparing bloom -> bloom-ref; it is just > reporting numbers for each data series. Should we make this only report a > comparison between bloom and bloom-ref? That is the most important metric, so making that more visible would be useful. Cameron and I (mostly Cameron) have been developing a whole bunch more of these tests out-of-tree, so that we only have to go through the trouble of landing them into Talos once. You can see what we've got (along with a reftest-like manifest) here: https://github.com/heycam/style-perf-tests/tree/master/perf-reftest That said, as noted above, it would still be nice to track absolute values, at least for the (test, platform) tuples that don't turn out to be noisy. Hence comment 7 and comment 8.

Flags: needinfo?(bobbyholley)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 11

•

8 years ago

right now we get many alerts on osx, a few on linux and almost none on win7/win8. Win8 is very bi-modal, osx is as well, but goes up/down in chunks (which accounts for the alerts we see). here is an example of the graphs for different platforms: https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,9ea7681c1a99cdb586358585618618c4c838d410,0,1%5D&series=%5Bmozilla-inbound,0ea13b9b6cc509b2bd5372f59504f4151d147802,0,1%5D&series=%5Bmozilla-inbound,2d7b02949d1cd7587b4bc546b19f3f4af2928873,0,1%5D&series=%5Bmozilla-inbound,c6ec2be895db9ba5d9657ac8c1ea74de3837ca89,0,1%5D It sounds like the priority is comparing base to ref- we should figure out a better way to do that and ensure that is solid. Secondarily it would be nice to see absolute values- off the top of my head, I see that as loading extra pages, or alternatively displaying the raw results, not the comparison. Possibly we can figure out a solution in the manifest file to determine which two files to compare. Maybe a new feature in the pageloader addon, or some mode in talos that will compare X.html vs X-ref.html

Component: CSS Parsing and Computation → Talos

Keywords: perf, regression, talos-regression

Product: Core → Testing

Summary: 8.3 - 8.53% bloom_basic http: / bloom_basic_ref http: (osx-10-10) regression on push d4de1e480c2e5a00a93f25469676bd19d9a353f2 (Mon May 8 2017) → Fix perf-reftest to compare perf numbers of basic vs ref pages.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

8 years ago

Assignee: nobody → rwood

Bobby Holley (:bholley)

Comment 12

•

8 years ago

Note that heycam built a standalone harness and manifest format for the tests that we've been developing out-of-tree, but eventually would like to check into the tree: https://github.com/heycam/style-perf-tests It would be good to align those to the extent that is practical.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

8 years ago

Whiteboard: [PI:June]

Robert Wood [:rwood]

Assignee

Comment 13

•

8 years ago

With this patch: I've added a 'base_vs_ref' flag in test.py for the perf-reftest. The manifest has both pages listed (no special syntax). There are no changes to the pageloader add-on itself. The pageloader add-on runs both of the pages as it normally would, reporting back the results in one results object, with each page being reported as if they were talos subtests. With the 'base_vs_ref' flag set, that results object (that has results for both the 'reference' and 'base' pages) is used to create a new result object that: - has the 'bloom_basic_ref' results saved as 'reference_runs' - has the 'bloom_basic' results saved as 'base_runs' - compares both of those (takes the absolute value of reference minus base) and saves as 'runs' - talos then sees this results object like any other single test and calculates the median / reports the results based on 'runs' - the 'reference_runs' are reported in the talos output json structure as 'reference_replicates' - the 'base_runs' are reported in the talos output json structure as 'base_replicates' - the 'replicates' in the talos output json are actually the abs difference between the 'ref' and 'base' - in treeherder/perfherder the result shown will be the difference value; so that will be used to trigger alerts - the talos results json is already archived by treeherder and available as a 'perfherder-data.json' link on the treeherder 'details' tab for the job; that file will now also contain the 'reference_replicates' and 'base_replicates' along side the difference 'replicates', so if you wish to see the original replicate results you can I will attach a sample talos output json file.

Robert Wood [:rwood]

Assignee

Comment 14

•

8 years ago

Attached file local.json — Details

Example talos output json for perf-reftest with this patch.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 15

•

8 years ago

great progress here- the output.json looks good on a quick glance.

Comment hidden (mozreview-request)

Robert Wood [:rwood]

Assignee

Updated

•

8 years ago

Status: NEW → ASSIGNED

Robert Wood [:rwood]

Assignee

Comment 17

•

8 years ago

(In reply to Joel Maher ( :jmaher) from comment #15) > great progress here- the output.json looks good on a quick glance. Ok cool, thanks!

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 18

•

8 years ago

mozreview-review

Comment on attachment 8882378 [details] Bug 1363104 - Fix perf-reftest to compare perf numbers of basic vs ref pages; https://reviewboard.mozilla.org/r/153482/#review158684 a couple of nits ::: testing/talos/talos/run_tests.py:265 (Diff revision 1) > + # and store the base and reference test replicates in results.json for upload > + elif test.get('base_vs_ref', False): > + # run the test, results will be reported for each page like two tests in the suite > + base_and_reference_results = mytest.runTest(browser_config, test) > + # now compare each test, and create a new test object for the comparison > + talos_results.add(make_comparison_result(base_and_reference_results)) shouldn't this line be indented? ::: testing/talos/talos/test.py:798 (Diff revision 1) > @register_test() > class bloom_basic(PageloaderTest): > """ > - Stylo bloom_basic test > + Stylo bloom_basic: runs bloom_basic and bloom_basic_ref and reports difference > """ > + base_vs_ref = True # compare the two test pages with eachother and report comparison given this will we only report the _ref data, not the original? I think that is what we want, but want to clarify this.

Attachment #8882378 - Flags: review?(jmaher) → review+

Robert Wood [:rwood]

Assignee

Comment 19

•

8 years ago

(In reply to Joel Maher ( :jmaher) from comment #18) > Comment on attachment 8882378 [details] ... > > ::: testing/talos/talos/test.py:798 > (Diff revision 1) > > @register_test() > > class bloom_basic(PageloaderTest): > > """ > > - Stylo bloom_basic test > > + Stylo bloom_basic: runs bloom_basic and bloom_basic_ref and reports difference > > """ > > + base_vs_ref = True # compare the two test pages with eachother and report comparison > > given this will we only report the _ref data, not the original? I think > that is what we want, but want to clarify this. We will report the difference between the reference (bloom_basic_ref) and base (bloom_basic). The original data will be made available as 'ref_replicates' and 'base_replicates' inside the output json, but perfherder will only see/report the difference.

Comment hidden (mozreview-request)

Robert Wood [:rwood]

Assignee

Comment 22

•

8 years ago

(In reply to Robert Wood [:rwood] from comment #19) > (In reply to Joel Maher ( :jmaher) from comment #18) > > Comment on attachment 8882378 [details] > ... > > > > ::: testing/talos/talos/test.py:798 > > (Diff revision 1) > > > @register_test() > > > class bloom_basic(PageloaderTest): > > > """ > > > - Stylo bloom_basic test > > > + Stylo bloom_basic: runs bloom_basic and bloom_basic_ref and reports difference > > > """ > > > + base_vs_ref = True # compare the two test pages with eachother and report comparison > > > > given this will we only report the _ref data, not the original? I think > > that is what we want, but want to clarify this. > > We will report the difference between the reference (bloom_basic_ref) and > base (bloom_basic). The original data will be made available as > 'ref_replicates' and 'base_replicates' inside the output json, but > perfherder will only see/report the difference. That is what we want, correct?

Flags: needinfo?(jmaher)

Flags: needinfo?(bobbyholley)

Robert Wood [:rwood]

Assignee

Comment 23

•

8 years ago

Try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fabc1d54822b4a0fff33412cb084f583e6d5ef66&selectedJob=111095805 Resulting perfherder-data.json from the link on the 'details' tab for the 'p' job: http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/9b6b51b4f12c2e1a698b60afa8e177f94740c75af34b345a275fbd328bb5c58a131ed2f1ff32c669d5e3a4bd18d0ab67303a777c096777f86c23703e37fb36ec

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 24

•

8 years ago

I think this is good- only displaying the ref, but making the other data available in the logs. Lets see what :bholley thinks.

Flags: needinfo?(jmaher)

Bobby Holley (:bholley)

Comment 25

•

8 years ago

(In reply to Joel Maher ( :jmaher) from comment #24) > I think this is good- only displaying the ref, but making the other data > available in the logs. Lets see what :bholley thinks. In general that sounds good. If possible, it would be nice to opt into absolute numbers for certain tests. In particular, I've put a lot of work into making bloom-basic extremely fast on stylo, and the numbers are much more stable (on my machine) than on Gecko (pretty reliably 40ms, whereas Gecko fluctuates between 500ms and 650ms). Given how carefully-optimized it is, I'd really like to make sure we stay fast (in an absolute sense) on bloom-basic, even though that's secondary to the overall goal of the perf-reftests. Would such an option be practical?

Flags: needinfo?(bobbyholley)

Robert Wood [:rwood]

Assignee

Comment 26

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #25) > (In reply to Joel Maher ( :jmaher) from comment #24) > > I think this is good- only displaying the ref, but making the other data > > available in the logs. Lets see what :bholley thinks. > > In general that sounds good. If possible, it would be nice to opt into > absolute numbers for certain tests. In particular, I've put a lot of work > into making bloom-basic extremely fast on stylo, and the numbers are much > more stable (on my machine) than on Gecko (pretty reliably 40ms, whereas > Gecko fluctuates between 500ms and 650ms). > > Given how carefully-optimized it is, I'd really like to make sure we stay > fast (in an absolute sense) on bloom-basic, even though that's secondary to > the overall goal of the perf-reftests. Would such an option be practical? Thanks for the feedback. IMO we can't have it both ways because of the nature of our perf regression alerting. If we have the perf-reftest suite reporting and alerting on the *dif* between bloom_basic and bloom_basic ref on everything except stylo, but on stylo have the perf-reftest reporting and alerting on the absolute value of bloom_basic, that will be too confusing in perfherder comparisons, alerting, sheriffing regressions, etc. If bloom_basic and bloom_basic ref are too noisy on non-stylo, should we just turn them off? And instead of perf-reftest, have a new test suite that only runs on stylo, and only runs, reports, and alerts on bloom_basic alone (and not bloom_basic_ref)?

Flags: needinfo?(jmaher)

Flags: needinfo?(bobbyholley)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 27

•

8 years ago

I am fine with two unique tests, and we run the bloom_basic on stylo only. Lately using the 5% threshold we have had little to no alerts, so that might be a solution to keep around- if that is the case, then we would just add bloom_basic_ref as a new test in addition to bloom_basic.

Flags: needinfo?(jmaher)

Bobby Holley (:bholley)

Comment 28

•

8 years ago

Yes, that sounds great. Let's go ahead with the delta-based perf-reftest framework as planned (so that we can eventually integrate the tests in [1]), and then set up a separate suite for singletons whose absolute value is interesting, and track those values on tests/platforms where the results are not noisy. [1] https://github.com/bholley/style-perf-tests

Flags: needinfo?(bobbyholley)

Pulsebot

Comment 29

•

8 years ago

Pushed by rwood@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a7d99ed85d5e Fix perf-reftest to compare perf numbers of basic vs ref pages; r=jmaher

Robert Wood [:rwood]

Assignee

Comment 30

•

8 years ago

Landed the delta-based perf-reftest that compares bloom_basic and bloom_basic ref. For the second bloom_basic suite do you want bloom_basic.html or bloom_basic_ref.html? Also what do you want the suite name to be called that will run that single test on stylo? This will also require new buildbot / tc configs and a new treeherder symbol name. I am going to file new bugs for this second bloom_basic suite.

Flags: needinfo?(bobbyholley)

Robert Wood [:rwood]

Assignee

Comment 31

•

8 years ago

:igoldan, FYI, this patch that just landed changes the values reported in the 'perf-reftest' talos test suite (treeherder job symbol 'p'). Instead of raw values for 'bloom_basic' and 'bloom_basic_ref', with this patch we now are reporting comparison of the two pages, so the reported values are way lower. This will cause perfherder regression alerts for perf-reftest / bloom_basic - so FYI any alerts that result of this patch you can just mark as invalid (maybe link them to this bug?). Thanks!

Flags: needinfo?(ionut.goldan)

Bobby Holley (:bholley)

Comment 32

•

8 years ago

(In reply to Robert Wood [:rwood] from comment #30) > Landed the delta-based perf-reftest that compares bloom_basic and > bloom_basic ref. For the second bloom_basic suite do you want > bloom_basic.html or bloom_basic_ref.html? bloom_basic is good. Can you use the copy from [1], which we've tweaked a bit to make more accurate? Stylo is too smart for the version that's currently in mozilla-central, and will optimize the whole thing away. :-) We'll update the version in perf-reftest when we import the rest of the style-perf-tests. > Also what do you want the suite > name to be called that will run that single test on stylo? This will also > require new buildbot / tc configs and a new treeherder symbol name. I am > going to file new bugs for this second bloom_basic suite. A few proposals for the name of the suite: * simple-benchmarks * perf-singletons * perf-reftest-singletons * web-microbench Do any of those sound ok? [1] https://github.com/heycam/style-perf-tests/tree/master/perf-reftest

Flags: needinfo?(bobbyholley)

Phil Ringnalda (:philor)

Comment 33

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/a7d99ed85d5e

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox56: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla56

Ionuț Goldan [:igoldan]

Comment 34

•

8 years ago

Thanks Robert, for the notification. Looking forward into investigating this new alert.

Flags: needinfo?(ionut.goldan)

local.json 8 years ago Robert Wood [:rwood] 2.16 KB, application/json		Details
Bug 1363104 - Fix perf-reftest to compare perf numbers of basic vs ref pages; 8 years ago Robert Wood [:rwood] 59 bytes, text/x-review-board-request	jmaher : review+	Details