1163698 - Create a service which will automatically retrigger orange jobs on try up to N times

Reporter

Description

•

10 years ago

We want to create a service which will automatically retrigger failed jobs on try up to N times in order to eliminate the manual and time-consuming retrigger-and-wait cycles that developers currently are forced to use when evaluating ambiguous try results, and to make try results easier to parse. TaskCluster already does this for b2gdesktop jobs on try; each failing job is retried (so it shows up as blue rather than orange or red) up to 2 times. This masks intermittents and generally makes try results more meaningful, although there is a risk that a new intermittent which is introduced by a patch may be hidden in this manner. Armen took some notes on possible approaches here: https://etherpad.mozilla.org/less-orange-on-try Currently, the pulse approach seems most viable and most extensible to non-buildbot systems.

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 1

•

10 years ago

I have a prototype of this (largely based on the autoland approach) at https://github.com/chmanchester/trigger-bot It needs lots of battle testing, but I've tested for short periods of time and it works as expected on my own try pushes.

Andrew Halberstadt [:ahal]

Comment 2

•

10 years ago

I'm not 100% convinced that hiding the orange status of a job is a good thing. It's smart, but it's also hiding valuable information. I think we should let the developer interpret the series of (O)ranges and (G)reens however they think appropriate. For example, if I see: OGG I'll at least click the job to see the failure string. If I see a test failures that looks related to my patch I'll dig a bit deeper, otherwise hardly any time wasted. But if I see: BBG I probably won't even bother to do that tiny bit of extra work. We can try communicating that (B)lue now means (O)range on try, but these colours are so ingrained I doubt any but a small minority will actually pay attention to them.

Dan Minor [:dminor]

Comment 3

•

10 years ago

When this change lands, should I remove (or disable) to current Autoland functionality for retriggering failed jobs? Also, I'd suggest a threshold percentage of failures beyond which you don't retrigger jobs so you don't end up retriggering jobs N times for a mostly broken push.

Andrew Halberstadt [:ahal]

Comment 4

•

10 years ago

Another downside to hiding the orange status is that the retrigger might not be finished by the time the developer comes back to look at their results. They might see a row of green with a blue and a grey in the middle, assume everything is good and land their patch. I think everyone has been guilty of prematurely landing a patch en lieu of waiting for the full set of try jobs to finish at one point or another.

Joel Maher ( :jmaher ) (UTC -8)

Comment 5

•

10 years ago

There are valid reasons for keeping oranges, I agree with ahal. With that said, I think for a developer having a simple mode of 'is this push ok' seems like the next step. maybe we need http://www.canilandmypatch.com?tryrev=31415926535 with a yes/no answer. All joking aside, it would be great to make a decision and auto star the failures if it is one orange and 2 greens.

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 6

•

10 years ago

(In reply to Dan Minor [:dminor] from comment #3) > When this change lands, should I remove (or disable) to current Autoland > functionality for retriggering failed jobs? > > Also, I'd suggest a threshold percentage of failures beyond which you don't > retrigger jobs so you don't end up retriggering jobs N times for a mostly > broken push. I'm planning to deal with this in the short term by ignoring any job associated with the autoland user, but we should possible merge our triggering efforts before too long since they're so closely aligned. I'm thinking of a very coarse approach to the per push limit right now - if more than (recent orange factor * automatic retrigger number) failure retriggers have been seen for a revision, the patch probably can't land anyway, so we give up.

Jonathan Griffin (:jgriffin)

Reporter

Comment 7

•

10 years ago

I'd be fine with an initial implementation that doesn't mask the oranges, it just handles automatic retriggers, which I think is what chris's prototype does. Auto-starring for the case where there is one orange and two greens is an interesting idea, and seems relatively safe, as long as we communicate to dev.platform what's happening.

Armen [:armenzg]

Comment 8

•

10 years ago

If we had a tool that allow us to auto-star. We could that through the bot besides the triggering of two more jobs. FYI, we can also have an opt-in variable in the try-syntax before we go fully live.

Dan Minor [:dminor]

Updated

•

10 years ago

Blocks: 1163751

Dan Minor [:dminor]

Comment 9

•

10 years ago

(In reply to Chris Manchester [:chmanchester] from comment #6) > I'm planning to deal with this in the short term by ignoring any job > associated with the autoland user, but we should possible merge our > triggering efforts before too long since they're so closely aligned. > Yeah, I think we should turn off Autoland retriggers once this is live - I don't think there's any benefit to having two potentially different retrigger behaviours. Is there any reason why this is 'Try' only? It seems like it would be useful elsewhere.

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 10

•

10 years ago

(In reply to Dan Minor [:dminor] from comment #9) > (In reply to Chris Manchester [:chmanchester] from comment #6) > > I'm planning to deal with this in the short term by ignoring any job > > associated with the autoland user, but we should possible merge our > > triggering efforts before too long since they're so closely aligned. > > > Yeah, I think we should turn off Autoland retriggers once this is live - I > don't think there's any benefit to having two potentially different > retrigger behaviours. > > Is there any reason why this is 'Try' only? It seems like it would be useful > elsewhere. Inbound is another place this seems obviously useful and I think we should turn it on there, but try seems like the place where the responsibility is most squarely on the developer to figure out what's going on and what to do next.

Joel Maher ( :jmaher ) (UTC -8)

Comment 11

•

10 years ago

with SETA in place, this has a lot more use on integration branches.

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 12

•

10 years ago

Armen has agreed to take a look at my code and give feedback. For the small number of use cases we're targeting immediately we've agreed not to consumer mozci directly for the sake of triggering jobs, but rather use self-serve directly. The next step will be getting this running on a VM somewhere so we can test it further. I've tested a few more scenarios on my own try pushes and everything works as I expect. As Armen noted in comment 8, we could use an opt-in try switch for this to get more testing.

Armen [:armenzg]

Comment 13

•

10 years ago

This code watches for a lot of gotchas I could think. Good job. For the record, I would prefer it if it we would use mozci since I'm hoping adusca to expand this service in the future for adding backfill support to treeherder. Something perhaps more general. TH currently sends cancel and retrigger actions over pulse and we would like to administer those. I would like to take your project with Alice and expand it. Maybe it doesn't make sense to do but we'll see. If you want to avoid mozci fetching any files. You can call make_request() directly. I can only see mozci being used in the "trigger_n_times()" function: https://github.com/chmanchester/trigger-bot/blob/master/triggerbot/try_watcher.py#L191 which is the same code as these: https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/mozci.py#L362 https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/sources/buildapi.py#L40 https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/sources/buildapi.py#L86 I see that you have added tracking of what was triggered which mozci does not currently have. It has some built-in code to prevent triggering more jobs than intended. FYI take the comments as feedback and not as request to be changed. Just some things that I noted. > description='A pulse service for triggering builds with mozci.', ... > service_name = 'mozci-trigger-bot' We don't use mozci atm. > class TryWatcher(object): If we were to make it generic: class RepoWatcher(object) and subclass. > get_users() What is this for? I don't see triggerbot_users being used anywhere. > def trigger_n_times(self, branch, rev, builder, count): You validate the revision with a regex, however, mozci had to verify that the revision is valid for buildapi. Commits with messages like DONTBUILD, cannot be triggered on buildapi. Since you're doing only try you are likely to not hit that issue. If you're curious: https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/sources/buildapi.py#L112 In general: * I think you could read the conf file on init of the class instead. * I suggest using uppercase variable names for variables that can be used as global since it visually helps. * Adding more comments and in some cases more meaningful variable and function names * I would split try_watcher.py in more modules to separate logical components that deal with different unrelated parts * I would add a bit of the story of what this script does at the top of the module (e.g. we analize each orange try job; we trigger N jobs and we set X property to prevent the same analysis to trigger more jobs) If you want me to dig deeper or ellaborate let me know.

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 14

•

10 years ago

(In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #13) > This code watches for a lot of gotchas I could think. Good job. > > For the record, I would prefer it if it we would use mozci since I'm hoping > adusca to expand this service in the future for adding backfill support to > treeherder. Something perhaps more general. > TH currently sends cancel and retrigger actions over pulse and we would like > to administer those. > I would like to take your project with Alice and expand it. Maybe it doesn't > make sense to do but we'll see. > > If you want to avoid mozci fetching any files. You can call make_request() > directly. > I can only see mozci being used in the "trigger_n_times()" function: > https://github.com/chmanchester/trigger-bot/blob/master/triggerbot/ > try_watcher.py#L191 > which is the same code as these: > https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/mozci.py#L362 > https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/sources/ > buildapi.py#L40 > https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/sources/ > buildapi.py#L86 I filed https://github.com/armenzg/mozilla_ci_tools/issues/191 with my concerns about using mozci, but even with that moving to it should be straightforward and I'm happy to do it as soon as we feel it will be beneficial. > > I see that you have added tracking of what was triggered which mozci does > not currently have. > It has some built-in code to prevent triggering more jobs than intended. > > FYI take the comments as feedback and not as request to be changed. Just > some things that I noted. > > > description='A pulse service for triggering builds with mozci.', > ... > > service_name = 'mozci-trigger-bot' > We don't use mozci atm. > > > > class TryWatcher(object): > If we were to make it generic: > class RepoWatcher(object) and subclass. > > > > get_users() > What is this for? > I don't see triggerbot_users being used anywhere. It's used to populate the global "triggerbot_users". This is just so I don't trigger try pushes that aren't my own, this can be removed once this is live everywhere. > > > > def trigger_n_times(self, branch, rev, builder, count): > > You validate the revision with a regex, however, mozci had to verify that > the revision is valid for buildapi. > Commits with messages like DONTBUILD, cannot be triggered on buildapi. > Since you're doing only try you are likely to not hit that issue. > > If you're curious: > https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/sources/ > buildapi.py#L112 I will borrow that logic, thanks for the pointer! > > In general: > * I think you could read the conf file on init of the class instead. > * I suggest using uppercase variable names for variables that can be used as > global since it visually helps. > * Adding more comments and in some cases more meaningful variable and > function names > * I would split try_watcher.py in more modules to separate logical > components that deal with different unrelated parts I'll separate out the concerned with extracting data from pulse and the part concerned with triggering jobs. > * I would add a bit of the story of what this script does at the top of the > module (e.g. we analize each orange try job; we trigger N jobs and we set X > property to prevent the same analysis to trigger more jobs) > > If you want me to dig deeper or ellaborate let me know. I will update the repository this afternoon. Thanks for the feedback!

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 15

•

10 years ago

What's the procedure for getting a machine to run this on?

Flags: needinfo?(jgriffin)

Jonathan Griffin (:jgriffin)

Reporter

Comment 16

•

10 years ago

Basically just file a bug under Testing:General and assign it to glob; include what instance type you'd like from http://aws.amazon.com/ec2/instance-types/, and if you have a particular OS you want.

Flags: needinfo?(jgriffin)

Chris Manchester (limited bugmail, email directly)

Assignee

Updated

•

10 years ago

Depends on: 1164554

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 17

•

9 years ago

The trigger bot is working well after some testing by myself and jmaher. The issue I ran into most recently was that there appears to be a short time around when a job starts that the buildapi might not have either a request_id or build_id to rebuild on hand, so we were missing re-triggers in that case. I added a short retry loop that should hopefully address the situation. I've had a few questions about how this works. There are two features implemented: re-triggering N times when a test job fails (N == 2 for now), and requesting N triggers for every test job in a push through try syntax (with the '--rebuild' argument, up to 20). If triggers are requested, no re-triggers happen in response to a failure, and once either failure or requested triggers happen, no further re-triggers are made for that particular revision/branch/buildername. The total failure re-triggers per push has a hard limit to avoid broken revisions from repeatedly failing each test job. The idea is that this will prevent cyclic or gratuitous re-triggers.

Jonathan Griffin (:jgriffin)

Reporter

Comment 18

•

9 years ago

This sounds really useful, and something we could deploy and then iterate on as we get feedback from developers. Great progress!

Joel Maher ( :jmaher ) (UTC -8)

Comment 19

•

9 years ago

I would like to hear more about the danger of the cyclic re-triggers. If we only trigger it at a single point in time, are there concerns about us seeing a future failure and triggering again? Overall, I really like this- there are a few things which we should outline in our communications: 1) what are the inputs 2) what are the rules/conditions 3) what is expected (builds, tests, hidden jobs, red jobs, blue jobs, etc...) 4) what limits/safeguards we have in place 5) can we get statistics about what volume of jobs are being scheduled 6) are there any errors and how can developers get the status of these errors Looking forward to using this again and having a general service which we as Mozilla developers can use to make our lives simpler.

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 20

•

9 years ago

There are two safeguards against cyclic re-triggers. First there's a set of already triggered rev/branch/buildername, and if there's a request to trigger anything in the set it's ignored. This set is in memory, so it gets blown away when the service is redeployed, so I also query the buildapi for any requests or builds that match the rev/branch/buildername, and if there are already more than the requested count, no triggers are made. I'm logging the global count of triggers performed by the service and could log more statistics if necessary.

Joel Maher ( :jmaher ) (UTC -8)

Comment 21

•

9 years ago

How do you handle service unavailable from buildbot? this is something which mozci experiences a lot with some workaround in place, it isn't perfect yet. it sounds like the only concern for cyclic re-triggers is when we deploy a session of mozABC and it picks up old requests. Are there any concerns with pulse messages duplicating requests? I am having trouble understanding how there can be a request to trigger additional jobs. I assume this is when say for windows 7 opt mochitest-1 we have a failure and retrigger automatically to determine if it is intermittent. When those results come back, we would determine that we already have a 'windows 7 opt mochitest-1' for the given revision/branch and take no action on it? how long do we keep this history around? I would think 2 weeks would be sufficient given that try info is magically 100% wiped out after 2 weeks.

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 22

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #21) > How do you handle service unavailable from buildbot? this is something > which mozci experiences a lot with some workaround in place, it isn't > perfect yet. If we can't get a build_id or request_id to rebuild we aren't able to rebuild. The intention with this is that the action taken isn't any more mysterious than clicking the retrigger button, and the worst failure we would see would be to occasionally miss a trigger if the buildapi is unavailable. I'm assuming the data we get from the buildapi about existing builds is fairly reliable. > > it sounds like the only concern for cyclic re-triggers is when we deploy a > session of mozABC and it picks up old requests. I've tested the approach of finding existing builds from the buildapi I mentioned above and it appears effective for this case. > > Are there any concerns with pulse messages duplicating requests? I am > having trouble understanding how there can be a request to trigger > additional jobs. I assume this is when say for windows 7 opt mochitest-1 we > have a failure and retrigger automatically to determine if it is > intermittent. When those results come back, we would determine that we > already have a 'windows 7 opt mochitest-1' for the given revision/branch and > take no action on it? I don't think a duplicate message from pulse would be a problem. There's a hashmap mapping from revision to a set of buildernames populated whenever a trigger happens. If that builder has already been seen for the revision no trigger happens (I'll want to add branch to the keys of this map if when we want this for more than try). > > how long do we keep this history around? I would think 2 weeks would be > sufficient given that try info is magically 100% wiped out after 2 weeks. The buildapi has builds for fairly old try pushes, so this should be safe. I have to prune the map of revisions after a certain point so it doesn't keep old ones around, there's a comment in the code for the rationale here, but the idea is that forgetting revisions that are several days old shouldn't do harm.

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 23

•

9 years ago

I think as a next test I'll run this for a short period without restrictions on which users it applies to. That should give a better estimate of how many jobs this will ultimately end up triggering.

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 24

•

9 years ago

(In reply to Chris Manchester [:chmanchester] from comment #23) > I think as a next test I'll run this for a short period without restrictions > on which users it applies to. That should give a better estimate of how many > jobs this will ultimately end up triggering. I ran this for a while this afternoon (logging in "dry" mode, without actually triggering) and found 150-250 triggers per hour. This depends a lot on number of per push failures before giving up additional triggers (I tried with 5 and 7). This limit gets hit pretty frequently, so we can think of the volume as up to 10 triggers per push (or up to 14 if we need 7 trigger actions). This is a lot of jobs, but given the precedents of taskcluster and Autoland it doesn't seem entirely unreasonable.

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 25

•

9 years ago

I tested this on a try push of my own on Friday, and I found that limiting to n triggers per push has an issue -- the budget of triggers gets depleted easily by jobs that are hidden and failing. I'm going to change the budget to be based on a % of all jobs for the push (for the example on Friday, 5% would be enough).

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 26

•

9 years ago

I tested again with the limit based on 5% of all jobs for a push, and the volume was around 150 triggers/hour. I suspect this sort of this varies a lot by time of day, so we'll have a better idea once this is running for day/weeks. I'll test one more time performing actual triggers, but this should be ready to turn on soon.

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 27

•

9 years ago

This has been running on behalf of the autoland user for a week and a half, and I haven't seen any issues. I've logged all the triggers it would have done if enabled for all users for the last day or so, and the rate is about 70 triggers/hour. Unless something comes up I'd like to try this enabled for all users next week when volume will be low. I'm thinking it might be reasonable to trigger a failure just once while we're rolling this out, and increase to twice depending on the kind of feedback we get. I don't expect pushback per-se, but this will be a little change from what people expect out of the try server, so let's see.

Jonathan Griffin (:jgriffin)

Reporter

Comment 28

•

9 years ago

> I'm thinking it might be reasonable to trigger a failure just once while we're rolling this out, I agree, thanks.

Chris Manchester (limited bugmail, email directly)

Assignee

Updated

•

9 years ago

Blocks: 1177653

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 29

•

9 years ago

I've kept a close eye on this since enabling it last week and everything looks as I expect. We've identified a non-trivial amount of work to extend this to taskcluster scheduled jobs, but we can track that work separately.

Assignee: nobody → cmanchester

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Chris Manchester (limited bugmail, email directly)

Assignee

Updated

•

9 years ago

Depends on: 1190942

Chris Manchester (limited bugmail, email directly)

Assignee

Updated

•

9 years ago

Depends on: 1208104

Chris Manchester (limited bugmail, email directly)

Assignee

Updated

•

9 years ago

Depends on: 1219341

Boris Zbarsky [:bzbarsky]

Updated

•

8 years ago

Depends on: 1289160

Armen [:armenzg]

Updated

•

8 years ago

No longer depends on: 1289160

Chris Manchester (limited bugmail, email directly)

Assignee

Updated

•

8 years ago

Depends on: 1279676

Chris Manchester (limited bugmail, email directly)

Assignee

Updated

•

8 years ago

Depends on: 1333907