Closed Bug 1591466 Opened 2 years ago Closed 2 years ago

Use a reduced optimal TP6 pageset to reduce testing load

Categories

(Testing :: Raptor, task, P1)

Version 3
task

Tracking

(firefox79 fixed)

RESOLVED FIXED
mozilla79
Tracking Status
firefox79 --- fixed

People

(Reporter: sparky, Assigned: Bebe)

References

(Blocks 1 open bug)

Details

(Whiteboard: [ci-costs-2020:done])

Attachments

(2 files)

This bug is for adding an optimal, reduced TP6 pageset for desktop and mobile testing in an effort to reduce the test load.

See this document for background information: https://docs.google.com/document/d/1pMn77DzYIRQ8dB1hOjp0YDyNFkZtD51gM81uD9S8DF0/edit

The following graphs show the results of an analysis to find a reduced subset for desktop and mobile (tp6, and tp6m):

TP6: https://mozilla.slack.com/files/U9KF08E14/FPSUG1SH2/tp6_hist_with_uniques_dupes_removed.png

TP6M: https://mozilla.slack.com/files/U9KF08E14/FPTCZVCCV/tp6m_hist_with_uniques_dupes_removed.png

Using only the tests which uniquely caught a regression (in red), for tp6m, we find that we can catch 13/16 regressions/improvements ~= 81%. If we include those which caught improvements, then we could catch 15/16 regressions ~= 94%. Using warm and cold variations of those tests would allow us to catch 16/16 regressions = 100%.

For desktop, using the same method (only picking tests with red bars), we can catch ~85% of regressions. Including the ones which caught improvements, and using both warm and cold varieties of all of these tests, we can catch 100% of regressions.

Priority: -- → P2

using :sparky's tool chain:
https://github.com/gmierz/moz-current-tests/tree/master/high-value-tests

I found that looking at specific bugs (53 out of 61) that are determine to not be test only fixes or infra fixes we have 19 tests that we find as high value:
['raptor-tp6m-espn-geckoview', 'raptor-motionmark-htmlsuite-firefox', 'raptor-tp6m-amazon-search-geckoview-cold', 'raptor-stylebench-firefox', 'raptor-motionmark-animometer-firefox', 'raptor-tp6-slides-firefox-cold', 'raptor-speedometer-firefox', 'raptor-tp6-twitch-firefox-cold', 'raptor-tp6-fandom-firefox', 'raptor-tp6-twitter-firefox', 'raptor-tp6-facebook-firefox-cold', 'raptor-wasm-misc-baseline-firefox', 'raptor-tp6-tumblr-firefox', 'raptor-tp6-yandex-firefox-cold', 'raptor-tp6-wikia-firefox', 'raptor-assorted-dom-firefox', 'raptor-tp6-wikipedia-firefox-cold', 'raptor-tp6-twitch-firefox', 'raptor-tp6-bing-firefox']

After we validate this and ensure we can update this data easier, it will be realistic to adjust tier status in the taskcluster .yml files. We can also apply this to talos tests.

doing the same analysis on talos, here are the tests that are high value for Talos:
['tabswitch', 'tsvgx', 'displaylist_mutate', 'tscrollx', 'sessionrestore', 'tp5n', 'tart', 'perf_reftest_singletons', 'startup_about_home_paint_realworld_webextensions', 'tp5o', 'kraken', 'ts_paint_webext', 'tsvgr_opacity', 'startup_about_home_paint', 'tp5o_scroll']

:davehunt, is it ok to move forward with marking tests as tier1/2 as outlined above?

Flags: needinfo?(dave.hunt)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #3)

:davehunt, is it ok to move forward with marking tests as tier1/2 as outlined above?

Yes, but can we limit this to Raptor for now? I would suggest filing a separate bug for Talos.

Flags: needinfo?(dave.hunt)

yeah, talos should be considered separate (I filed bug 1626045)

:bc, can you pick this up in the next week or two?

Flags: needinfo?(bob)

sure

Assignee: nobody → bob
Status: NEW → ASSIGNED
Flags: needinfo?(bob)

I spent some time sanitizing data and cross referencing it in detail. There are 24 tests (5 on android) to consider:
['raptor-tp6m-espn-geckoview-cold', 'raptor-speedometer-firefox', 'raptor-motionmark-animometer-firefox', 'raptor-tp6-slides-firefox-cold', 'raptor-tp6-slides-firefox', 'raptor-tp6-google-mail-firefox-cold', 'raptor-stylebench-firefox', 'raptor-tp6m-amazon-search-geckoview-cold', 'raptor-tp6-twitch-firefox-cold', 'raptor-tp6-google-firefox-cold', 'raptor-tp6-twitch-firefox', 'raptor-tp6-tumblr-firefox', 'raptor-tp6m-espn-geckoview', 'raptor-tp6-fandom-firefox', 'raptor-tp6-bing-firefox', 'raptor-tp6-tumblr-firefox-cold', 'raptor-tp6-wikipedia-firefox-cold', 'raptor-wasm-misc-firefox', 'raptor-tp6m-google-maps-geckoview-cold', 'raptor-tp6m-bing-geckoview', 'raptor-tp6-reddit-firefox-cold', 'raptor-tp6m-ebay-kleinanzeigen-search-geckoview', 'raptor-tp6-instagram-firefox', 'raptor-assorted-dom-firefox']

This is using a full year of data. Limiting this to 6 months of data, we have 9 tests (1 on android) to consider:
['raptor-tp6m-google-maps-geckoview-cold', 'raptor-tp6-tumblr-firefox-cold', 'raptor-tp6-slides-firefox', 'raptor-tp6-yandex-firefox-cold', 'raptor-tp6-wikipedia-firefox-cold', 'raptor-tp6-twitch-firefox-cold', 'raptor-motionmark-animometer-firefox', 'raptor-tp6-slides-firefox-cold', 'raptor-tp6-google-mail-firefox-cold']

As our goal is to keep these running and sheriffed full time just the tier-2 tests would run less frequently and alerts would show up a day or two later, there is little risk to this.

:esmyth, do you have concerns or other thoughts?

Flags: needinfo?(esmyth)

as discussed in an email thread, we feel that 6 months is a more representative sample, which would be the smaller pageset. This would apply across the board.

They key here is monthly we would re-evaluate this work to ensure that we adjust tier-1 tests as needed- since all tests will be sheriffed, just the tier-2 tests will be sheriffed up to a couple days later.

Whiteboard: [ci-costs-2020:todo]

here are the tests we are going to run less frequently:
['raptor-tp6-outlook-firefox-cold', 'raptor-tp6-netflix-firefox-cold', 'raptor-tp6m-google-restaurants-geckoview-cold', 'raptor-tp6m-booking-geckoview-cold', 'raptor-tp6-yahoo-mail-firefox', 'raptor-tp6-microsoft-firefox-cold', 'raptor-tp6m-bing-restaurants-geckoview-cold', 'raptor-tp6m-wikipedia-geckoview-cold', 'raptor-tp6-yahoo-mail-firefox-cold', 'raptor-tp6m-bing-geckoview-cold', 'raptor-motionmark-htmlsuite-firefox', 'raptor-tp6m-amazon-search-geckoview-cold', 'raptor-tp6-facebook-firefox', 'raptor-tp6-yandex-firefox-cold', 'raptor-tp6m-instagram-geckoview-cold', 'raptor-tp6-pinterest-firefox', 'raptor-tp6-apple-firefox-cold', 'raptor-tp6-instagram-firefox-cold']

I've spoken with Eric and he doesn't have any concerns with the identified tests. We discussed some related issues, which I'll follow up with separately and do not block this effort.

Flags: needinfo?(esmyth) → needinfo?(fstrugariu)

Mass-removing myself from cc; search for 12b9dfe4-ece3-40dc-8d23-60e179f64ac1 or any reasonable part thereof, to mass-delete these notifications (and sorry!)

Assignee: bob → fstrugariu

:bebe, can you provide an update here? I think this is waiting on the taskcluster split changes right?

This is waiting for Bug 1633874 - Update taskcluster settings to the new raptor file structure

after that we can generate a list of test to split these in the discussed lists

Flags: needinfo?(fstrugariu)

We can do the split. I think we should make a tp6 and tp6-tier2 that is scheduled. The difference is that pages as lower value (tier-2) will be in the test-subtests of the tp6-tier2 job while tp6 regular job will have the higher value test-subtests.

This should allow for easier moving of tests between tiers. as well as creating test-sets.yml where we can schedule things or set a specific raptor-tp6-tier2 as push-interval-25 while rpator-tp6 is push-interval-10

:bebe, does this way of thinking make sense to you?

Flags: needinfo?(fstrugariu)

:davehunt, can you answer this or help get this moving?

Flags: needinfo?(dave.hunt)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #17)

:davehunt, can you answer this or help get this moving?

Your suggestion in comment 15 sounds good to me. I'm going to leave the needinfo open for Bebe, who is the assignee.

Severity: normal → S3
Depends on: 1633874
Flags: needinfo?(dave.hunt)
Priority: P2 → P1

queried all raptor tests since december 1, 2019, here is the latest set

Chosen tests: ['raptor-tp6m-youtube-geckoview-cold', 'raptor-tp6-twitter-firefox-cold', 'raptor-tp6-twitch-firefox-cold', 'raptor-motionmark-animometer-firefox', 'raptor-tp6-google-mail-firefox-cold', 'raptor-tp6-amazon-firefox-cold', 'raptor-tp6-slides-firefox-cold', 'raptor-tp6-tumblr-firefox-cold', 'raptor-tp6m-google-maps-geckoview-cold', 'raptor-tp6-imgur-firefox-cold-mozproxy-replay', 'raptor-webaudio-firefox']

Rejected tests: ['raptor-tp6m-google-restaurants-geckoview-cold', 'raptor-tp6m-amazon-geckoview-cold', 'raptor-tp6m-wikipedia-geckoview-cold', 'raptor-tp6-paypal-firefox-cold', 'raptor-tp6-slides-firefox-cold-mozproxy-replay', 'raptor-tp6-fandom-firefox-cold', 'raptor-tp6-imgur-firefox-cold', 'raptor-tp6-pinterest-firefox-cold-mozproxy-replay', 'raptor-tp6-bing-firefox-cold', 'raptor-tp6m-cnn-ampstories-geckoview-cold', 'raptor-tp6m-cnn-geckoview-cold', 'raptor-tp6m-ebay-kleinanzeigen-search-geckoview-cold', 'raptor-tp6-docs-firefox-cold', 'raptor-tp6-binast-instagram-firefox-mozproxy-replay', 'raptor-tp6m-amazon-search-geckoview-cold', 'raptor-motionmark-htmlsuite-firefox', 'raptor-tp6m-google-geckoview-cold', 'raptor-tp6-linkedin-firefox-cold', 'raptor-tp6-sheets-firefox-cold', 'raptor-tp6m-aframeio-animation-geckoview-cold', 'raptor-tp6-twitter-firefox-cold-mozproxy-replay', 'raptor-tp6-imdb-firefox-cold', 'raptor-tp6m-facebook-cristiano-geckoview-cold', 'raptor-tp6m-facebook-geckoview-cold', 'raptor-tp6-outlook-firefox-cold', 'raptor-tp6-paypal-firefox-cold-mozproxy-replay', 'raptor-tp6-outlook-firefox-cold-mozproxy-replay', 'raptor-tp6m-youtube-watch-geckoview-cold', 'raptor-tp6-reddit-firefox-cold', 'raptor-tp6-amazon-firefox-cold-mozproxy-replay', 'raptor-tp6-tumblr-firefox-cold-mozproxy-replay', 'raptor-tp6-google-firefox-cold-mozproxy-replay', 'raptor-tp6-wikipedia-firefox-cold', 'raptor-tp6-netflix-firefox-cold', 'raptor-tp6-instagram-firefox-cold-mozproxy-replay', 'raptor-tp6m-ebay-kleinanzeigen-geckoview-cold', 'raptor-tp6-binast-instagram-firefox', 'raptor-tp6m-booking-geckoview-cold', 'raptor-tp6m-jianshu-geckoview-cold-mozproxy-replay', 'raptor-tp6-microsoft-firefox-cold', 'raptor-tp6-yahoo-news-firefox-cold', 'raptor-tp6m-espn-geckoview-cold', 'raptor-tp6-ebay-firefox-cold', 'raptor-tp6m-jianshu-geckoview-cold', 'raptor-tp6-apple-firefox-cold', 'raptor-tp6m-allrecipes-geckoview-cold', 'raptor-tp6-facebook-firefox-cold', 'raptor-tp6m-stackoverflow-geckoview-cold', 'raptor-tp6-google-firefox-cold', 'raptor-tp6m-microsoft-support-geckoview-cold', 'raptor-tp6m-web-de-geckoview-cold', 'raptor-tp6m-bing-restaurants-geckoview-cold', 'raptor-tp6m-bbc-geckoview-cold', 'raptor-tp6-amazon-firefox-mitm5-cold-mozproxy-replay', 'raptor-tp6-office-firefox-cold', 'raptor-tp6-pinterest-firefox-cold', 'raptor-tp6-google-mail-firefox-cold-mozproxy-replay', 'raptor-tp6m-instagram-geckoview-cold', 'raptor-tp6m-reddit-geckoview-cold', 'raptor-tp6-yahoo-news-firefox-cold-mozproxy-replay', 'raptor-tp6-instagram-firefox-cold', 'raptor-tp6-yahoo-mail-firefox-cold', 'raptor-tp6-youtube-firefox-cold', 'raptor-tp6m-imdb-geckoview-cold', 'raptor-tp6m-bing-geckoview-cold', 'raptor-tp6-yandex-firefox-cold']

split raptor tests into tier-1 (high value) and tier-2 (lower value)

the only thing the above patch doesn't do is run the lower value tests every 25th push.

Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2ed99d13012d
split raptor tests into tier-1 (high value) and tier-2 (lower value). r=sparky

Backed out changeset 2ed99d13012d (bug 1591466) on ahal's request.

Backout link: https://hg.mozilla.org/integration/autoland/rev/0b769610f6c4a18725f8ea758b1515e189947bf0

As ahal noticed, the backed out changes caused 300 Rap-t2 tasks running on every autoland push.

a few thoughts:

  1. I find a way to keep these the same name, not tp6-t2. It went this way as it was simple and straightforward, I could be wrong though
  2. I add the push-interval-25 to the tier-2 tests to force it less frequently.

sparky, do you have thoughts on either of these?

Flags: needinfo?(gmierz2)

:jmaher, we could go with option (1) by using the by-raptor-subtest split:

tier:
	by-app:
		firefox:
			by-raptor-subtest:
				amazon: 1
				...
				default: 2
		default: 2

I just noticed that you would have to make a change in the transform to do this - we missed this when we make the name changes. You'd have to split out the shorthand-name of the raptor-subtest entry and make some adjustments down the line: (1) https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/raptor.py#179 (2) https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/raptor.py#246-250

I'm fine with adding the push-interval settings to the new tests as well but option (1) would keep a minimal number of task definitions.

Flags: needinfo?(gmierz2)

I started down this path and realized by doing that I would have a list of subtests, and repeat it for tiers, and repeat it for push-interval. My latest patch will do push-interval-25 by default for for tier-1 push-interval-10.

Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b7889537b4ff
split raptor tests into tier-1 (high value) and tier-2 (lower value). r=sparky
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla79

do not adjust tier and optimization for mobile.

Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ac50886ec03f
do not adjust tier and optimization for mobile. r=sparky

Is there anything else to do here?

Flags: needinfo?(fstrugariu) → needinfo?(jmaher)

we are all done.

Flags: needinfo?(jmaher)

looking into this I think we will save ~4800 hours/week of computation time- this is rough calculations, but probably +-30% of that number.

Whiteboard: [ci-costs-2020:todo] → [ci-costs-2020:done]
You need to log in before you can comment on or make changes to this bug.