Closed Bug 1194711 Opened 9 years ago Closed 7 years ago

Automatic scheduling to have talos jobs 6 times on try

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: vaibhav1994, Unassigned)

References

Details

Attachments

(1 file)

When a developer pushes to try with talos tests in try syntax, it is generally because one needs to compare perf numbers (via compare-talos etc). Right now in buildbot, we schedule the talos jobs only once, which is not enough to have a good idea for comparison and one needs to retrigger the tests again. Instead we should automatically schedule tests in buildbot to have 6 jobs by default.

Going through some buildbot code, it looks like try_parser.py is the place where we determine "talos" buildernames to be scheduled from try syntax[1][2], we should add it there, in my opinion.

[1] - https://hg.mozilla.org/build/buildbotcustom/file/f923b62fbc01/try_parser.py#l183
[2] - https://hg.mozilla.org/build/buildbotcustom/file/f923b62fbc01/try_parser.py#l411
Attached patch buildbot.patchSplinter Review
Kim, does the above suggestion make sense to you? Can you have a look at this patch?
Attachment #8648141 - Flags: review?(kmoir)
maybe we should have special try syntax for triggering 6 repeated talos jobs rather than just doing that every time. we are somewhat constrained with our talos hw machines atm
There is a --rebuild option in try syntax that will trigger jobs a certain number of times, but the point is this should be the default (since developers often forget, and there is really no use for triggering the talos jobs on try once as it does not produce enough data). Joel, what do you think?
We should have syntax for it rather than doing it by default: despite the folklore that you always need six runs whenever you run talos on try, that's just not true. As a wild guess (but one based on having looked at thousands of instances of try syntax and the patch that was actually being tested), I'd expect that around a quarter of all try pushes running talos are actually looking for numbers, another quarter are core, especially JS engine, patches from people who have been burned before by causing talos to stop running and just want to be sure it runs and thus only need one run each, and half are a total waste of resources from people who push everything -u all -t all.
We don't have enough capacity in our test pools right now to run talos jobs 6x on try.  We are trying to reduce our current high wait times as it is. So I don't think that it's viable to run talos 6x on every push.  We are really constrained on our bare metal hardware pools (not AWS) which talos needs to run on. Perhaps making the option more visible that retriggers jobs in the trychooser syntax or adding syntax to retrigger talos jobs associated with the same build would address this concern.
this is a tough problem, but many valid points.  We do require a lot of retriggers to prove a talos regression and it without help from people in the infrastructure we will continue to require a lot of retriggers to account for the noise.

We could do a better job instructing people to use --rebuild 6, but this will increase our overall usage nonetheless.

Should this be a question of getting more hardware combined with making it next to impossible to run -p all -u all -t all ?

I think for now we should just make the --rebuild 6 option better, I think if people don't select unittests and choose talos we should do this.  Not sure where to put that logic in.  Right now the lack of data is taking too long to get.
Joel, 

Do you know the timeframes for this work
http://chmanchester.github.io/blog/2015/08/06/defining-semi-automatic-test-prioritization

This looks like very promising way to reduce the overall load, and make way for disabling
run -p all -u all -t all
Flags: needinfo?(jmaher)
this is a deliverable this quarter, although I view it more of a prototype instead of a replacement.  Maybe by end of Q4 we could have a realistic workflow for using try outside of the existing try syntax.

Some of the issues with the approach mentioned:
* it selects specific tests/directories- those need to map to jobs and we run a subset of jobs
* how do you extend this, how do you know what you have already run/not run
* if we did this, would we need to turn of SETA since we would have a different testing profile on try
* it might require a few rounds of fine tuning as we deal with potential issues like higher backouts, etc.

I really like what chmanchester is making, and I think in a reasonable amount of time it will help.  For the record we could probably get by with --rebuild 6 until the end of the year.
Flags: needinfo?(jmaher)
Comment on attachment 8648141 [details] [diff] [review]
buildbot.patch

Ok, I think we can carry on with --rebuild option on try for time being, since we do not want to increase load on test machines and wait for chmanchester's approach to be implemented.
Attachment #8648141 - Flags: review?(kmoir)
We could also schedule the talos jobs with even lower priority.
This is what I do for Treherder requests for "trigger all talos jobs on a given revision".
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: