Closed Bug 1163698 Opened 5 years ago Closed 5 years ago

Create a service which will automatically retrigger orange jobs on try up to N times

Categories

(Testing :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jgriffin, Assigned: chmanchester)

References

(Depends on 1 open bug)

Details

We want to create a service which will automatically retrigger failed jobs on try up to N times in order to eliminate the manual and time-consuming retrigger-and-wait cycles that developers currently are forced to use when evaluating ambiguous try results, and to make try results easier to parse.

TaskCluster already does this for b2gdesktop jobs on try; each failing job is retried (so it shows up as blue rather than orange or red) up to 2 times.  This masks intermittents and generally makes try results more meaningful, although there is a risk that a new intermittent which is introduced by a patch may be hidden in this manner.

Armen took some notes on possible approaches here:  https://etherpad.mozilla.org/less-orange-on-try

Currently, the pulse approach seems most viable and most extensible to non-buildbot systems.
I have a prototype of this (largely based on the autoland approach) at https://github.com/chmanchester/trigger-bot

It needs lots of battle testing, but I've tested for short periods of time and it works as expected on my own try pushes.
I'm not 100% convinced that hiding the orange status of a job is a good thing. It's smart, but it's also hiding valuable information. I think we should let the developer interpret the series of (O)ranges and (G)reens however they think appropriate.

For example, if I see:
OGG

I'll at least click the job to see the failure string. If I see a test failures that looks related to my patch I'll dig a bit deeper, otherwise hardly any time wasted. But if I see:
BBG

I probably won't even bother to do that tiny bit of extra work. We can try communicating that (B)lue now means (O)range on try, but these colours are so ingrained I doubt any but a small minority will actually pay attention to them.
When this change lands, should I remove (or disable) to current Autoland functionality for retriggering failed jobs?

Also, I'd suggest a threshold percentage of failures beyond which you don't retrigger jobs so you don't end up retriggering jobs N times for a mostly broken push.
Another downside to hiding the orange status is that the retrigger might not be finished by the time the developer comes back to look at their results. They might see a row of green with a blue and a grey in the middle, assume everything is good and land their patch. I think everyone has been guilty of prematurely landing a patch en lieu of waiting for the full set of try jobs to finish at one point or another.
There are valid reasons for keeping oranges, I agree with ahal.  With that said, I think for a developer having a simple mode of 'is this push ok' seems like the next step.  maybe we need http://www.canilandmypatch.com?tryrev=31415926535 with a yes/no answer.  All joking aside, it would be great to make a decision and auto star the failures if it is one orange and 2 greens.
(In reply to Dan Minor [:dminor] from comment #3)
> When this change lands, should I remove (or disable) to current Autoland
> functionality for retriggering failed jobs?
> 
> Also, I'd suggest a threshold percentage of failures beyond which you don't
> retrigger jobs so you don't end up retriggering jobs N times for a mostly
> broken push.

I'm planning to deal with this in the short term by ignoring any job associated with the autoland user, but we should possible merge our triggering efforts before too long since they're so closely aligned.

I'm thinking of a very coarse approach to the per push limit right now - if more than (recent orange factor * automatic retrigger number) failure retriggers have been seen for a revision, the patch probably can't land anyway, so we give up.
I'd be fine with an initial implementation that doesn't mask the oranges, it just handles automatic retriggers, which I think is what chris's prototype does.

Auto-starring for the case where there is one orange and two greens is an interesting idea, and seems relatively safe, as long as we communicate to dev.platform what's happening.
If we had a tool that allow us to auto-star. We could that through the bot besides the triggering of two more jobs.

FYI, we can also have an opt-in variable in the try-syntax before we go fully live.
Blocks: 1163751
(In reply to Chris Manchester [:chmanchester] from comment #6)
> I'm planning to deal with this in the short term by ignoring any job
> associated with the autoland user, but we should possible merge our
> triggering efforts before too long since they're so closely aligned.
> 
Yeah, I think we should turn off Autoland retriggers once this is live - I don't think there's any benefit to having two potentially different retrigger behaviours.

Is there any reason why this is 'Try' only? It seems like it would be useful elsewhere.
(In reply to Dan Minor [:dminor] from comment #9)
> (In reply to Chris Manchester [:chmanchester] from comment #6)
> > I'm planning to deal with this in the short term by ignoring any job
> > associated with the autoland user, but we should possible merge our
> > triggering efforts before too long since they're so closely aligned.
> > 
> Yeah, I think we should turn off Autoland retriggers once this is live - I
> don't think there's any benefit to having two potentially different
> retrigger behaviours.
> 
> Is there any reason why this is 'Try' only? It seems like it would be useful
> elsewhere.

Inbound is another place this seems obviously useful and I think we should turn it on there, but try seems like the place where the responsibility is most squarely on the developer to figure out what's going on and what to do next.
with SETA in place, this has a lot more use on integration branches.
Armen has agreed to take a look at my code and give feedback. For the small number of use cases we're targeting immediately we've agreed not to consumer mozci directly for the sake of triggering jobs, but rather use self-serve directly.

The next step will be getting this running on a VM somewhere so we can test it further. I've tested a few more scenarios on my own try pushes and everything works as I expect. As Armen noted in comment 8, we could use an opt-in try switch for this to get more testing.
This code watches for a lot of gotchas I could think. Good job.

For the record, I would prefer it if it we would use mozci since I'm hoping adusca to expand this service in the future for adding backfill support to treeherder. Something perhaps more general.
TH currently sends cancel and retrigger actions over pulse and we would like to administer those.
I would like to take your project with Alice and expand it. Maybe it doesn't make sense to do but we'll see.

If you want to avoid mozci fetching any files. You can call make_request() directly.
I can only see mozci being used in the "trigger_n_times()" function:
https://github.com/chmanchester/trigger-bot/blob/master/triggerbot/try_watcher.py#L191
which is the same code as these:
https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/mozci.py#L362
https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/sources/buildapi.py#L40
https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/sources/buildapi.py#L86

I see that you have added tracking of what was triggered which mozci does not currently have.
It has some built-in code to prevent triggering more jobs than intended.

FYI take the comments as feedback and not as request to be changed. Just some things that I noted.

>  description='A pulse service for triggering builds with mozci.',
...
> service_name = 'mozci-trigger-bot'
We don't use mozci atm.


> class TryWatcher(object):
If we were to make it generic:
class RepoWatcher(object) and subclass.


> get_users()
What is this for?
I don't see triggerbot_users being used anywhere.


> def trigger_n_times(self, branch, rev, builder, count):

You validate the revision with a regex, however, mozci had to verify that the revision is valid for buildapi.
Commits with messages like DONTBUILD, cannot be triggered on buildapi.
Since you're doing only try you are likely to not hit that issue.

If you're curious:
https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/sources/buildapi.py#L112

In general:
* I think you could read the conf file on init of the class instead.
* I suggest using uppercase variable names for variables that can be used as global since it visually helps.
* Adding more comments and in some cases more meaningful variable and function names
* I would split try_watcher.py in more modules to separate logical components that deal with different unrelated parts
* I would add a bit of the story of what this script does at the top of the module (e.g. we analize each orange try job; we trigger N jobs and we set X property to prevent the same analysis to trigger more jobs)

If you want me to dig deeper or ellaborate let me know.
(In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #13)
> This code watches for a lot of gotchas I could think. Good job.
> 
> For the record, I would prefer it if it we would use mozci since I'm hoping
> adusca to expand this service in the future for adding backfill support to
> treeherder. Something perhaps more general.
> TH currently sends cancel and retrigger actions over pulse and we would like
> to administer those.
> I would like to take your project with Alice and expand it. Maybe it doesn't
> make sense to do but we'll see.
> 
> If you want to avoid mozci fetching any files. You can call make_request()
> directly.
> I can only see mozci being used in the "trigger_n_times()" function:
> https://github.com/chmanchester/trigger-bot/blob/master/triggerbot/
> try_watcher.py#L191
> which is the same code as these:
> https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/mozci.py#L362
> https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/sources/
> buildapi.py#L40
> https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/sources/
> buildapi.py#L86
I filed https://github.com/armenzg/mozilla_ci_tools/issues/191 with my concerns about using mozci, but even with that moving to it should be straightforward and I'm happy to do it as soon as we feel it will be beneficial. 
> 
> I see that you have added tracking of what was triggered which mozci does
> not currently have.
> It has some built-in code to prevent triggering more jobs than intended.
> 
> FYI take the comments as feedback and not as request to be changed. Just
> some things that I noted.
> 
> >  description='A pulse service for triggering builds with mozci.',
> ...
> > service_name = 'mozci-trigger-bot'
> We don't use mozci atm.
> 
> 
> > class TryWatcher(object):
> If we were to make it generic:
> class RepoWatcher(object) and subclass.
> 
> 
> > get_users()
> What is this for?
> I don't see triggerbot_users being used anywhere.
It's used to populate the global "triggerbot_users". This is just so I don't trigger try pushes that aren't my own, this can be removed once this is live everywhere.
> 
> 
> > def trigger_n_times(self, branch, rev, builder, count):
> 
> You validate the revision with a regex, however, mozci had to verify that
> the revision is valid for buildapi.
> Commits with messages like DONTBUILD, cannot be triggered on buildapi.
> Since you're doing only try you are likely to not hit that issue.
> 
> If you're curious:
> https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/sources/
> buildapi.py#L112

I will borrow that logic, thanks for the pointer!
> 
> In general:
> * I think you could read the conf file on init of the class instead.
> * I suggest using uppercase variable names for variables that can be used as
> global since it visually helps.
> * Adding more comments and in some cases more meaningful variable and
> function names
> * I would split try_watcher.py in more modules to separate logical
> components that deal with different unrelated parts
I'll separate out the concerned with extracting data from pulse and the part concerned with triggering jobs.
> * I would add a bit of the story of what this script does at the top of the
> module (e.g. we analize each orange try job; we trigger N jobs and we set X
> property to prevent the same analysis to trigger more jobs)
> 
> If you want me to dig deeper or ellaborate let me know.

I will update the repository this afternoon. Thanks for the feedback!
What's the procedure for getting a machine to run this on?
Flags: needinfo?(jgriffin)
Basically just file a bug under Testing:General and assign it to glob; include what instance type you'd like from http://aws.amazon.com/ec2/instance-types/, and if you have a particular OS you want.
Flags: needinfo?(jgriffin)
Depends on: 1164554
The trigger bot is working well after some testing by myself and jmaher. The issue I ran into most recently was that there appears to be a short time around when a job starts that the buildapi might not have either a request_id or build_id to rebuild on hand, so we were missing re-triggers in that case. I added a short retry loop that should hopefully address the situation.

I've had a few questions about how this works. There are two features implemented: re-triggering N times when a test job fails (N == 2 for now), and requesting N triggers for every test job in a push through try syntax (with the '--rebuild' argument, up to 20). If triggers are requested, no re-triggers happen in response to a failure, and once either failure or requested triggers happen, no further re-triggers are made for that particular revision/branch/buildername. The total failure re-triggers per push has a hard limit to avoid broken revisions from repeatedly failing each test job. The idea is that this will prevent cyclic or gratuitous re-triggers.
This sounds really useful, and something we could deploy and then iterate on as we get feedback from developers.  Great progress!
I would like to hear more about the danger of the cyclic re-triggers.  If we only trigger it at a single point in time, are there concerns about us seeing a future failure and triggering again?

Overall, I really like this- there are a few things which we should outline in our communications:
1) what are the inputs
2) what are the rules/conditions
3) what is expected (builds, tests, hidden jobs, red jobs, blue jobs, etc...)
4) what limits/safeguards we have in place
5) can we get statistics about what volume of jobs are being scheduled
6) are there any errors and how can developers get the status of these errors

Looking forward to using this again and having a general service which we as Mozilla developers can use to make our lives simpler.
There are two safeguards against cyclic re-triggers. First there's a set of already triggered rev/branch/buildername, and if there's a request to trigger anything in the set it's ignored. This set is in memory, so it gets blown away when the service is redeployed, so I also query the buildapi for any requests or builds that match the rev/branch/buildername, and if there are already more than the requested count, no triggers are made.

I'm logging the global count of triggers performed by the service and could log more statistics if necessary.
How do you handle service unavailable from buildbot?  this is something which mozci experiences a lot with some workaround in place, it isn't perfect yet.

it sounds like the only concern for cyclic re-triggers is when we deploy a session of mozABC and it picks up old requests.

Are there any concerns with pulse messages duplicating requests?  I am having trouble understanding how there can be a request to trigger additional jobs.  I assume this is when say for windows 7 opt mochitest-1 we have a failure and retrigger automatically to determine if it is intermittent.  When those results come back, we would determine that we already have a 'windows 7 opt mochitest-1' for the given revision/branch and take no action on it?

how long do we keep this history around?  I would think 2 weeks would be sufficient given that try info is magically 100% wiped out after 2 weeks.
(In reply to Joel Maher (:jmaher) from comment #21)
> How do you handle service unavailable from buildbot?  this is something
> which mozci experiences a lot with some workaround in place, it isn't
> perfect yet.
If we can't get a build_id or request_id to rebuild we aren't able to rebuild. The intention with this is that the action taken isn't any more mysterious than clicking the retrigger button, and the worst failure we would see would be to occasionally miss a trigger if the buildapi is unavailable. I'm assuming the data we get from the buildapi about existing builds is fairly reliable.
> 
> it sounds like the only concern for cyclic re-triggers is when we deploy a
> session of mozABC and it picks up old requests.
I've tested the approach of finding existing builds from the buildapi I mentioned above and it appears effective for this case.
> 
> Are there any concerns with pulse messages duplicating requests?  I am
> having trouble understanding how there can be a request to trigger
> additional jobs.  I assume this is when say for windows 7 opt mochitest-1 we
> have a failure and retrigger automatically to determine if it is
> intermittent.  When those results come back, we would determine that we
> already have a 'windows 7 opt mochitest-1' for the given revision/branch and
> take no action on it?
I don't think a duplicate message from pulse would be a problem. There's a hashmap mapping from revision to a set of buildernames populated whenever a trigger happens. If that builder has already been seen for the revision no trigger happens (I'll want to add branch to the keys of this map if when we want this for more than try).
> 
> how long do we keep this history around?  I would think 2 weeks would be
> sufficient given that try info is magically 100% wiped out after 2 weeks.
The buildapi has builds for fairly old try pushes, so this should be safe. I have to prune the map of revisions after a certain point so it doesn't keep old ones around, there's a comment in the code for the rationale here, but the idea is that forgetting revisions that are several days old shouldn't do harm.
I think as a next test I'll run this for a short period without restrictions on which users it applies to. That should give a better estimate of how many jobs this will ultimately end up triggering.
(In reply to Chris Manchester [:chmanchester] from comment #23)
> I think as a next test I'll run this for a short period without restrictions
> on which users it applies to. That should give a better estimate of how many
> jobs this will ultimately end up triggering.

I ran this for a while this afternoon (logging in "dry" mode, without actually triggering) and found 150-250 triggers per hour. This depends a lot on number of per push failures before giving up additional triggers (I tried with 5 and 7). This limit gets hit pretty frequently, so we can think of the volume as up to 10 triggers per push (or up to 14 if we need 7 trigger actions). This is a lot of jobs, but given the precedents of taskcluster and Autoland it doesn't seem entirely unreasonable.
I tested this on a try push of my own on Friday, and I found that limiting to n triggers per push has an issue -- the budget of triggers gets depleted easily by jobs that are hidden and failing. I'm going to change the budget to be based on a % of all jobs for the push (for the example on Friday, 5% would be enough).
I tested again with the limit based on 5% of all jobs for a push, and the volume was around 150 triggers/hour. I suspect this sort of this varies a lot by time of day, so we'll have a better idea once this is running for day/weeks.

I'll test one more time performing actual triggers, but this should be ready to turn on soon.
This has been running on behalf of the autoland user for a week and a half, and I haven't seen any issues. I've logged all the triggers it would have done if enabled for all users for the last day or so, and the rate is about 70 triggers/hour.

Unless something comes up I'd like to try this enabled for all users next week when volume will be low. I'm thinking it might be reasonable to trigger a failure just once while we're rolling this out, and increase to twice depending on the kind of feedback we get. I don't expect pushback per-se, but this will be a little change from what people expect out of the try server, so let's see.
> I'm thinking it might be reasonable to trigger a failure just once while we're rolling this out,

I agree, thanks.
Blocks: 1177653
I've kept a close eye on this since enabling it last week and everything looks as I expect. We've identified a non-trivial amount of work to extend this to taskcluster scheduled jobs, but we can track that work separately.
Assignee: nobody → cmanchester
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Duplicate of this bug: 700625
Duplicate of this bug: 1178549
Depends on: 1190942
Depends on: 1208104
Depends on: 1219341
No longer depends on: 1289160
Depends on: 1279676
Depends on: 1333907
You need to log in before you can comment on or make changes to this bug.