Closed Bug 1374750 Opened 3 years ago Closed 2 years ago

Port stand-alone style-perf-tests into talos as singleton tests

Categories

(Testing :: Talos, enhancement, P4)

enhancement

Tracking

(firefox57 fixed)

RESOLVED FIXED
mozilla57
Tracking Status
firefox57 --- fixed

People

(Reporter: bholley, Assigned: rwood)

References

Details

(Whiteboard: [PI:September])

Attachments

(3 files)

Out of tree, we've developed an extensive collection of perf-reftests [1] to verify the existence of various important incremental restyle optimizations.

At some point, we plan to run these in Talos, using the machinery that rwood built in bug 1353900. There's no need to do it immediately though, and probably best to wait, since it's easier to tweak the tests and add new ones when they're not yet running in automation.

[1] https://github.com/heycam/style-perf-tests
Whiteboard: [PI:July]
Assignee: nobody → rwood
Status: NEW → ASSIGNED
Component: CSS Parsing and Computation → Talos
Product: Core → Testing
Just FYI, neither Gecko nor Stylo currently pass all the tests in the repository, and some of the results might be noisy as a result. I'd been sort of assuming that we'd wait until stylo was prefed on by default to enable them on Talos, since otherwise we may have a bunch of spurious alerts to deal with.

Turning them on now certainly isn't a problem if we deal with the above. But it might be a better use of time to focus on making sure that the two harness setups (the in-tree Talos one and the out-of-tree style-perf-tests one) are roughly compatible so that we can just drop the entire set of tests in at some point in August without much hassle, and drop in new tests whenever we write them.
(And if the process is unavoidably manual, the waiting until mid-August is probably better, so that we can be sure the suite is finalized before dropping it in).
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #1)

> Just FYI, neither Gecko nor Stylo currently pass all the tests in the
> repository, and some of the results might be noisy as a result. I'd been
> sort of assuming that we'd wait until stylo was prefed on by default to
> enable them on Talos, since otherwise we may have a bunch of spurious alerts
> to deal with.

Thanks for the update. Yes let's wait until the tests are stable before integrating them into talos.

> Turning them on now certainly isn't a problem if we deal with the above. But
> it might be a better use of time to focus on making sure that the two
> harness setups (the in-tree Talos one and the out-of-tree style-perf-tests
> one) are roughly compatible so that we can just drop the entire set of tests
> in at some point in August without much hassle, and drop in new tests
> whenever we write them.

I'm going to investigate this in Bug 1354291.
Whiteboard: [PI:July] → [PI:August]
Priority: P3 → P4
are there further steps we could take here to get these tests running?
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #1)
> Just FYI, neither Gecko nor Stylo currently pass all the tests in the
> repository, and some of the results might be noisy as a result. I'd been
> sort of assuming that we'd wait until stylo was prefed on by default to
> enable them on Talos, since otherwise we may have a bunch of spurious alerts
> to deal with.

Hi Bobby, it's my understanding that stylo is going to be on by default very soon. What is the status of the remaining stylo-perf-tests, do you feel they are stable enough now? If so, should we start and work on porting them/turning them on in talos? Or do you need more time once stylo is on by default to get more data before they're ported over? Thanks!
Flags: needinfo?(bobbyholley)
(In reply to Robert Wood [:rwood] from comment #5)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #1)
> > Just FYI, neither Gecko nor Stylo currently pass all the tests in the
> > repository, and some of the results might be noisy as a result. I'd been
> > sort of assuming that we'd wait until stylo was prefed on by default to
> > enable them on Talos, since otherwise we may have a bunch of spurious alerts
> > to deal with.
> 
> Hi Bobby, it's my understanding that stylo is going to be on by default very
> soon. What is the status of the remaining stylo-perf-tests, do you feel they
> are stable enough now? If so, should we start and work on porting
> them/turning them on in talos? Or do you need more time once stylo is on by
> default to get more data before they're ported over? Thanks!

Sorry for the delay on my end, just really swamped with email. I think it's the right time to put them into the tree. There will be some failures on Gecko, and a smaller number of failures still on stylo (I think the nth-index cache tests are the only ones that will fail on stylo at this point). But I think we can just expect that.

If we can minimize the amount of Talos-specific stuff that needs to go into the testcases and keep them roughly in the same form as in the github repo, that would be nice. At the very least we should keep them easy to run standalone with alert() output.

Thanks for doing this!
Flags: needinfo?(bobbyholley)
Ok, thanks Bobby
Whiteboard: [PI:August] → [PI:September]
"perf" key word?
Hi Bobby, just to clarify, is this list [1] correct/up to date? Do you want all of them in that list added? I'm assuming we want to have each test listed on the first column compared with the test on the second as we did for bloom-basic vs bloom-basic-ref.

Also, do you want to have all of the tests on the left side (i.e. NON-ref tests) also added to the perf-reftest-singletons suite so they are run individually? I'm assuming so. Thanks.

[1] https://github.com/heycam/style-perf-tests/blob/master/perf-reftest/perf-reftest.list
Flags: needinfo?(bobbyholley)
(In reply to Robert Wood [:rwood] from comment #9)
> Hi Bobby, just to clarify, is this list [1] correct/up to date? Do you want
> all of them in that list added? I'm assuming we want to have each test
> listed on the first column compared with the test on the second as we did
> for bloom-basic vs bloom-basic-ref.

Yes.

> 
> Also, do you want to have all of the tests on the left side (i.e. NON-ref
> tests) also added to the perf-reftest-singletons suite so they are run
> individually? I'm assuming so. Thanks.

Ideally yes. If any of them turn out to be noisy we can re-evaluate.

Also, we should add the -singleton tests that are in the repo (they aren't listed in the manifest).

Thanks for doing this!

> 
> [1]
> https://github.com/heycam/style-perf-tests/blob/master/perf-reftest/perf-
> reftest.list
Flags: needinfo?(bobbyholley)
De-linting the tests in the style-perf-tests git repo, so they can be ported into talos in-tree and have zero lint errors when they land.
Attachment #8903757 - Flags: review?(bobbyholley)
Comment on attachment 8903757 [details] [diff] [review]
https://github.com/heycam/style-perf-tests/pull/7

Review of attachment 8903757 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me
Attachment #8903757 - Attachment is patch: true
Attachment #8903757 - Attachment mime type: text/x-github-pull-request → text/plain
Attachment #8903757 - Flags: review?(bobbyholley) → review+
Comment on attachment 8904724 [details]
Bug 1374750 - Port stand-alone style-perf-tests into talos as singleton tests,

https://reviewboard.mozilla.org/r/176520/#review181514

unless there is a very compelling reason to keep these as individual tests, I would like to make them bubble up into a few groups with subtests (or even 1 group).  The biggest driver for this is runtime (fresh profile, browser, etc.) and sheriff time.  the biggest loss here is that we could miss out on detecting specific test regressions if they are small, although we could catch most of them if they are larger or the regression is spread across multiple subtests.

Is this fairly easy for developers to add/edit/remove tests?  I want to make sure we consider that in what we end up checking in.

::: testing/talos/talos.json:155
(Diff revision 1)
> +                      "stop_cascade_2_singleton",
> +                      "stop_cascade_3_singleton",
> +                      "style_attr_1_singleton",
> +                      "style_sharing_style_attr_singleton",
> +                      "style_sharing_singleton",
> +                      "tiny_traversal_singleton"],

this is effectively doubling our sheriffing overhead, is there some way to summarize these subtests into a single number (or group them into a few numbers)?

::: testing/talos/talos/test.py:266
(Diff revision 1)
> +class StylePerfTest(PageloaderTest):
> +    """
> +    Base class for the style-perf tests
> +    """
> +    tpcycles = 1
> +    tppagecycles = 25

we should determine if we need 25 cycles or if we can do fewer.
Attachment #8904724 - Flags: review?(jmaher) → review-
Comment on attachment 8904724 [details]
Bug 1374750 - Port stand-alone style-perf-tests into talos as singleton tests,

https://reviewboard.mozilla.org/r/176520/#review181514

Yes there are alot of new tests here. So, for typical talos tests it would be easy to add these as subtests and have one result - i.e. add each URL/test html into one existing manifest.  However, since the perf-reftest suite isn't tpyical - it requires two entries in the manifest, each test html is run separately and then results compared against eachother, and the difference becomes the replicates and ultimate results. If we want to add subtest support to that, we can do it of course but it will be a bigger job - changing the manifest to support multiple sets of two (base vs ref) and talos support for that, etc. Can be done of course but won't be quick.

In the current setup it's not hard to add/edit/remove tests, basically comes down to just adding the .html(s) in the perf-reftest or perf-reftest-singetons folder, and creating the manifest, and adding to talos.json and talos test.py.

> this is effectively doubling our sheriffing overhead, is there some way to summarize these subtests into a single number (or group them into a few numbers)?

Yeah so with the perf-reftest-singletons they are typical talos tests with one test html creating the results, so yes for the singletons we could have one perf-reftest-singletons test with all the singletons as subtests if you like (or a couple groups). I guess really it's up to :bholley how he would want these groups to bring the best value.

> we should determine if we need 25 cycles or if we can do fewer.

Good point
Depends on: 1397438
Should we land the perf-reftest-singletons first anyway? I can put them all in one test, and each singleton can be a subtest. That may be a quick way to get some more of these tests in production.
Flags: needinfo?(jmaher)
Attachment #8904724 - Attachment is obsolete: true
yes, lets move forward with the non ref version now
Flags: needinfo?(jmaher)
No longer depends on: 1397438
Summary: stylo: Run full suite of style-perf-tests in Talos → Port stand-alone style-perf-tests into talos as singleton tests
Note: Up for review, but still need to determine if we can reduce tppagecycles from the default 25 to perhaps 10, using this try run:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=910a531495792ec9cd53339335e43087a710be3b
Comment on attachment 8904724 [details]
Bug 1374750 - Port stand-alone style-perf-tests into talos as singleton tests,

https://reviewboard.mozilla.org/r/176520/#review182882

overall this looks good- a lot of changes and it is hard to filter out rebasing/updating vs real changes- lets get started with this as I made two passes and nothing crazy stood out.
Attachment #8904724 - Flags: review?(jmaher) → review+
It looks like style-perf-test 'stop-cascade-1.html' is causing a crash on (at least) win7:

https://treeherder.mozilla.org/logviewer.html#?job_id=129608527&repo=try&lineNumber=2014

:bholley, can you please have a look? Have you seen this before? I can just pull it out of the manifest if need be, so we can land the others.
Flags: needinfo?(bobbyholley)
Looks like the testcase is exhausting available memory. Can we skip it on 32-bit?
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #27)
> Looks like the testcase is exhausting available memory. Can we skip it on
> 32-bit?

Thanks, no that would require taskcluster config changes, I prefer to pull that particular test for now.

There's a few other issues also showing up on the preliminary runs. I prefer not to land any of the tests that are failing or causing intermittents. I will try increasing the talos timeout for these tests to see if that makes any difference. Summary:


Timeout with slow-selector-2 on OSX [1]:

16:03:06     INFO -  PID 1615 | Cycle 1(14): loaded http://localhost:49217/tests/perf-reftest-singletons/slow-selector-2.html (next: http://localhost:49217/tests/perf-reftest-singletons/style-attr-1.html)
[taskcluster 2017-09-08T23:03:07.683Z] Aborting task - max run time exceeded!


Timeout (intermittent) with tiny-traversal-singleton on Linux x64 [2]:

[2] 5:13:29     INFO -  PID 24879 | Cycle 1(24): loaded http://localhost:47371/tests/perf-reftest-singletons/tiny-traversal-singleton.html (next: http://localhost:47371/tests/perf-reftest-singletons/bloom-basic.html)
15:13:30     INFO -  PID 24879 | RSS: Main: 113823744
15:13:30     INFO -  PID 24879 |
15:13:31     INFO -  Timeout waiting for test completion; killing browser...


Timeout in same-rule-node-1 on Win 10 [3]:

14:20:54     INFO -  PID 10172 | Cycle 1(21): loaded http://localhost:49777/tests/perf-reftest-singletons/same-rule-node-1.html (next: http://localhost:49777/tests/perf-reftest-singletons/slow-selector-1.html)
14:21:01     INFO -  Timeout waiting for test completion; killing browser...


[1] https://public-artifacts.taskcluster.net/TnV14deyQ0OvWVFbdNMDuQ/0/public/logs/live_backing.log

[2] https://treeherder.mozilla.org/logviewer.html#?job_id=129680912&repo=try&lineNumber=2772

[3] https://treeherder.mozilla.org/logviewer.html#?job_id=129684361&repo=try&lineNumber=2755
Flags: needinfo?(bobbyholley)
Try run with tptimeout of 60 seconds, this should be better:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0a3753760e6b0b53a8ce5d23f4600cc8604b20ae
Ah, now it's the overall talos test timeout of 1800 seconds that's killing it, try run doubling that to 60 min:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=eab8f1d36e7ded0f6d7152d8c31690b8168b9920
From comment 34 it sounds like we understand the timeouts. Let me know if you still need info.
Flags: needinfo?(bobbyholley)
Alright, so even a talos 60 minute timeout is not long enough for the entire perf-singletons test suite on Win 10. On linux it runs at 60 min even with the entire taskcluster job 62 minutes.

Hate to say it but this is too long... so I'm going to have to split up the perf-reftest-singletons into multiple tests/jobs. We may be able to get the number of cycles down from the default of 25 but I haven't been able to try that out yet as too many timeouts. I think that will just be a bonus if we can reduce the cycles after.

That means taskcluster configs and treeherder symbols etc. I'm thinking symbols like 'ps1', 'ps2', and 'ps3'. What do you guys think? Any objections?
Flags: needinfo?(jmaher)
Flags: needinfo?(bobbyholley)
please try to keep this to a 15 minute window if possible, ideally 10 minutes.
Flags: needinfo?(jmaher)
Which tests are taking a long time? Most of these should complete in under a second, at least for stylo. I could run the whole suite locally in under a minute last I checked.
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #39)
> Which tests are taking a long time? Most of these should complete in under a
> second, at least for stylo. I could run the whole suite locally in under a
> minute last I checked.

How many iterations are you doing?
(In reply to Robert Wood [:rwood] from comment #40)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #39)
> > Which tests are taking a long time? Most of these should complete in under a
> > second, at least for stylo. I could run the whole suite locally in under a
> > minute last I checked.
> 
> How many iterations are you doing?

I was just doing one, but again this was a while ago.

Anyway, the point is that analyzing the breakdown of which tests take how much time would be helpful. Is this just one or two tests taking 90% of the time, or is the overhead pretty even?
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #41)
...
> 
> I was just doing one, but again this was a while ago.
> 
> Anyway, the point is that analyzing the breakdown of which tests take how
> much time would be helpful. Is this just one or two tests taking 90% of the
> time, or is the overhead pretty even?

From the last try run in comment 36, linux x64, default 25 tppagecycles:

bloom-basic.html => 05:33:46, 05:34:07 => 0:21
bloom-basic-2.html => 05:34:07, 05:34:28 => 0:21
style-sharing.html => 05:34:28, 05:34:45 => 0:17
style-sharing-style-attr.html => 05:34:46, 05:35:03 => 0:17
stop-cascade-1.html => 05:35:12, 05:43:23 => 8:11
stop-cascade-2.html => 05:43:33, 05:52:04 => 8:31
stop-cascade-3.html => 05:52:14, 06:01:18 => 9:04
display-none-1.html => 06:01:19, 06:01:48 => 0:29
nth-index-1.html => 06:01:52, 06:04:30 => 2:38
nth-index-2.html => 06:04:35, 06:06:55 => 2:20
some-descendants-1.html => 06:07:01, 06:11:57 => 4:56
only-children-1.html => 06:11:58, 06:13:02 => 1:04
dep-check-1.html => 06:13:13, 06:17:53 => 4:40
same-rule-node-1.html => 06:18:03, 06:26:31 => 8:28
slow-selector-1.html => 06:26:32, 06:27:15 => 0:43
slow-selector-2.html => 06:27:16, 06:27:56 => 0:40
style-attr-1.html => 06:27:59, 06:29:16 => 1:17
coalesce-1.html => 06:29:17, 06:30:08 => 0:51
coalesce-2.html => 06:30:09, 06:30:55 => 0:46
parent-basic-singleton.html => 06:30:56, 06:31:19 => 0:23
tiny-traversal-singleton.html => 06:31:24, 06:33:22 => 1:58
This set alone would take approx. 10 min (at least on linux x64 opt anyway):

bloom-basic.html
bloom-basic-2.html
style-sharing.html
style-sharing-style-attr.html
display-none-1.html
only-children-1.html
slow-selector-1.html
slow-selector-2.html
style-attr-1.html
coalesce-1.html
coalesce-2.html
parent-basic-singleton.html
tiny-traversal-singleton.html
(In reply to Robert Wood [:rwood] from comment #43)
> This set alone would take approx. 10 min (at least on linux x64 opt anyway):
> 
> bloom-basic.html
> bloom-basic-2.html
> style-sharing.html
> style-sharing-style-attr.html
> display-none-1.html
> only-children-1.html
> slow-selector-1.html
> slow-selector-2.html
> style-attr-1.html
> coalesce-1.html
> coalesce-2.html
> parent-basic-singleton.html
> tiny-traversal-singleton.html

Yes, this is exactly what I was going to suggest. The other tests are slow because they're not passing in the 'ref' configuration, which is why I filed bug 1399583.

So let's get this in for now, and then file a bug dependent on bug 1399583 to turn on any others that start running faster. We should probably run all the tests in the ref configuration, but that configuration doesn't need 25 iterations, since they're designed to fail by a significant margin if the associated optimization is missing.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #44)
...
Ok, sounds like a good plan, thanks!
I filed Bug 1399600 for porting over any remaining style-perf-tests that are fixed/optimized in the future.
This set (from comment 43) is green, awesome. It still takes up to 20 minutes on Win 7, so I did a try run with tppagecylces set to 15 [1] instead of 25 [2]. Here's the comparison [3], it looks like it should be safe to drop this down to 15 iterations. That reduces the job time on Win 7 to about 13 minutes.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=295bf320f722a49c9f1112ad57652063fed44347
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=675e9a1a4fc1e1f6773ea33ea6cd0bd990f00def
[3] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=675e9a1a4fc1e1f6773ea33ea6cd0bd990f00def&newProject=try&newRevision=295bf320f722a49c9f1112ad57652063fed44347&framework=1&showOnlyImportant=0
Updated style-perf-tests PR for delinting (rebase and added eslint config)
Attachment #8908146 - Flags: review?(cam)
The style-perf-tests have been updated (included fixes/optimizations for the remaining tests) in Bug 1399583, so I have pulled in the latest tests, and added the remaining ones since they've been fixed. Pushed to try to see how long the perf-reftest-singletons job takes with the entire suite (15 iterations):

https://treeherder.mozilla.org/#/jobs?repo=try&revision=59a8dc5c53712322412e40b2e2c2d03d9592543b
(In reply to Robert Wood [:rwood] from comment #54)

> https://treeherder.mozilla.org/#/jobs?repo=try&revision=59a8dc5c53712322412e40b2e2c2d03d9592543b

Entire suite with updated tests (15 tppagecyles) takes too long (around 30 min).

:bholley, :heycam, FYI - here's the individual test duration on Win 10 for the added tests. Note that talos doesn't use the runner in perf-style-reftest (it only uses utils.js) so maybe that's also why it's a bit slower (?)

dep-check-1.html => 09:20:33, 09:21:15 => 0:42
nth-index-1.html => 09:21:20, 09:22:45 => 1:25
nth-index-2.html => 09:22:51, 09:24:14 => 1:23
some-descendants-1.html => 09:24:15, 09:24:42 => 0:27
stop-cascade-1.html => 09:24:52, 09:30:38 => 5:46
stop-cascade-2.html => 09:30:50, 09:36:50 => 6:00
stop-cascade-3.html => 09:37:02, 09:43:00 => 5:58

Regardless I'm going back to the original plan: land the test set in comment 43, and look into adding the rest of the tests later in Bug 1399600 (in a second perf-reftest-singletons job).
Flags: needinfo?(cam)
Flags: needinfo?(bobbyholley)
Last try run, back to the original set of tests, when this is green I'm landing this puppy!

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b88a6ae1ce5e642483695c3d3c38ff45e381ff0c
(In reply to Robert Wood [:rwood] from comment #55)
> (In reply to Robert Wood [:rwood] from comment #54)
> 
> > https://treeherder.mozilla.org/#/jobs?repo=try&revision=59a8dc5c53712322412e40b2e2c2d03d9592543b
> 
> Entire suite with updated tests (15 tppagecyles) takes too long (around 30
> min).
> 
> :bholley, :heycam, FYI - here's the individual test duration on Win 10 for
> the added tests. Note that talos doesn't use the runner in
> perf-style-reftest (it only uses utils.js) so maybe that's also why it's a
> bit slower (?)
> 
> dep-check-1.html => 09:20:33, 09:21:15 => 0:42
> nth-index-1.html => 09:21:20, 09:22:45 => 1:25
> nth-index-2.html => 09:22:51, 09:24:14 => 1:23
> some-descendants-1.html => 09:24:15, 09:24:42 => 0:27
> stop-cascade-1.html => 09:24:52, 09:30:38 => 5:46
> stop-cascade-2.html => 09:30:50, 09:36:50 => 6:00
> stop-cascade-3.html => 09:37:02, 09:43:00 => 5:58
> 
> Regardless I'm going back to the original plan: land the test set in comment
> 43, and look into adding the rest of the tests later in Bug 1399600 (in a
> second perf-reftest-singletons job).

Yeah I think that's ok. The absolute number in those tests is not so important.
Flags: needinfo?(bobbyholley)
Pushed by rwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/43bf2195c590
Port stand-alone style-perf-tests into talos as singleton tests, r=jmaher
https://hg.mozilla.org/mozilla-central/rev/43bf2195c590
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Thanks Robert!
Flags: needinfo?(cam)
You need to log in before you can comment on or make changes to this bug.